## Features conversion 

We compare 3 techniques :

* Bag of Words features : learns a vocabulary form of all the documents, then models each document by counting the number of times each word appears

* TF-IDF features : words are given weight TF-IDF measures relevance, not frequency. Method for emphasizing words that occur frequently in a given document, wile deemphasizing words that occur frequently in many documents 

* Word2Vec features: combination of two techniques, CBOW and Skip gram model. Both are neural networks which map words to the target variables which is also a word. 

In [41]:
import numpy as np
import pandas as pd

df = pd.read_csv('datasets/preprocessed_sentiment.csv', usecols=['tweets','labels'])
df.labels.value_counts() 

 1    56011
 0    55487
-1    53898
Name: labels, dtype: int64

In [42]:
df.sample(5)

Unnamed: 0,tweets,labels
14499,imagin abl creat type product want comfort hom...,1
39989,openai api die right now timeout everywher dav...,1
58769,student room phone smart watch radio receiv pr...,-1
16408,agre ryan here imho chatgpt gpt solv key biote...,1
148013,chatgpt good imit everyth itself small step cl...,0


In [43]:
# check if there is any NaN value

df.tweets.isnull().values.any()
df.tweets.isnull().sum()

6

In [44]:
# drop the NaN values if any

df = df.dropna()
df.labels.value_counts() 

 1    56011
 0    55487
-1    53892
Name: labels, dtype: int64

In [14]:
# Bag of Words : 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import gensim

# need to tune the parameters 
BOWvectorize = CountVectorizer(max_df = 0.90, min_df = 2, max_features = 1000, stop_words='english')
BOW = BOWvectorize.fit_transform(df.tweets)

In [15]:
BOW.shape

(165390, 1000)

In [18]:
# TF-IDF features: 
TfidfVect = TfidfVectorizer(max_df = 0.90, min_df = 2, max_features = 1000, stop_words='english')
Tfidf = TfidfVect.fit_transform(df.tweets)

In [19]:
Tfidf.shape

(165390, 1000)

In [49]:
# Word2Vec features: 
tokenize_tweet = df.tweets.apply(lambda x: x.split())

model_W2V = gensim.models.Word2Vec(tokenize_tweet, 
                                   vector_size = 200, # No. of features
                                   window =  5, # default window
                                   min_count = 2, 
                                   sg = 1, # 1 for skip-gram model
                                   hs = 0,
                                   negative = 10, # for negative sampling
                                   workers = 2,  # No. of cores
                                   seed = 34 )

model_W2V.train(tokenize_tweet, total_examples= len(df.tweets), epochs=20)

(32280296, 38528160)

In [51]:
w2v_words = list(model_W2V.wv.index_to_key)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

number of words that occured minimum 5 times  34919
sample words  ['chatgpt', 'openai', 'write', 'ask', 'like', 'use', 'googl', 'new', 'gener', 'code', 'it', 'chatbot', 'answer', 'good', 'time', 'question', 'think', 'know', 'work', 'tri', 'creat', 'thing', 'help', 'gpt', 'human', 'peopl', 'search', 'way', 'tool', 'need', 'learn', 'amp', 'model', 'better', 'futur', 'intellig', 'world', 'languag', 'day', 'bot', 'go', 'play', 'prompt', 'chat', 'technolog', 'year', 'look', 'want', 'talk', 'amaz']


In [53]:
from tqdm import tqdm

vector = []
for sent in tqdm(tokenize_tweet):
    sent_vec = np.zeros(200)
    count = 0
    for word in sent: 
        if word in w2v_words:
            vec = model_W2V.wv[word]
            sent_vec += vec 
            count += 1
    if count != 0:
        sent_vec /= count #normalize
    vector.append(sent_vec)
    
print(len(vector))
print(len(vector[0]))

100%|█████████████████████████████████████████████████████████████████████████| 165390/165390 [04:11<00:00, 657.65it/s]

165390
200





## Model building and testing

For the three types of embedding features method, we will compare the following models: 
* Logistic Regression 
* SVM 
* Random Forest 
* KNN 
* CNN 

### BOW classification 

In [56]:
# Split the train and test datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

x_train_bow, x_test_bow, y_train_bow, y_test_bow = train_test_split(BOW, df.labels, test_size=0.2) 

In [58]:
from sklearn.linear_model import LogisticRegression


LR = LogisticRegression(solver='lbfgs', max_iter=500, multi_class='multinomial')
LR.fit(x_train_bow, y_train_bow)

prediction = LR.predict_proba(x_test_bow)

# if prediction is greater than or equal to 0.3 than 1 else 0
pred = prediction[:,1] >= 0.3
pred = pred.astype(np.int)

f1_score(y_test_bow, pred, average='micro')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  pred = pred.astype(np.int)


0.17800350686256725

In [None]:
# SVM for BOW 
from sklearn.svm import SVC

#parameters in SVC
# c_list=list(range(1,51))
param_grid_svc = {'C': [1, 10, 100, 1000],
                  'kernel': ['linear','poly','rbf','sigmoid'],
                  'degree': [1,2,3,4]}
print(param_grid_svc)

model_svc = SVC()

In [None]:
#best parameters for SVC
SVC_RandomGrid = RandomizedSearchCV(estimator = model_svc, param_distributions = param_grid_svc, cv = 10, verbose=2, n_jobs = 4)
SVC_RandomGrid.fit(x_train_bow, y_train_bow)
SVC_RandomGrid.best_params_

In [None]:
model_svc = SVC(kernel='linear',degree=1, C=1)
model_svc = model_svc.fit(x_train_bow, y_train_bow)
prediction_svc = model_svc.predict(x_test_bow)

print(classification_report(y_test_bow, prediction_svc))

In [None]:
# Random Forest for BOW: 
from sklearn.ensemble._forest import RandomForestClassifier

model_forest = RandomForestClassifier()

In [None]:
#best parameter for RF
from sklearn.model_selection import RandomizedSearchCV

#parameters in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 200, num = 20)] #tree number
max_features = ['auto', 'sqrt','log2']
max_depth = [10,20,30,40]
min_samples_split = [2, 5, 10, 15]
min_samples_leaf = [1, 2, 5, 10]

# Create the param grid
param_grid_forest = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

RF_RandomGrid = RandomizedSearchCV(estimator = model_forest, param_distributions = param_grid_forest, cv = 10, verbose=2, n_jobs = 4)
RF_RandomGrid.fit(x_train_bow, y_train_bow)
RF_RandomGrid.best_params_

In [None]:
# model establishment and results 
# Random Forest
model_forest = RandomForestClassifier(n_estimators = 140, min_samples_split=10, min_samples_leaf=2, max_features='auto', max_depth=40)
model_forest.fit(x_train_bow, y_train_bow)
prediction = model_forest.predict(x_test_bow)

from sklearn.metrics import classification_report 
print(classification_report(y_test_bow, prediction))