## Features conversion 

We compare 3 techniques :

* Bag of Words features : learns a vocabulary form of all the documents, then models each document by counting the number of times each word appears

* TF-IDF features : words are given weight TF-IDF measures relevance, not frequency. Method for emphasizing words that occur frequently in a given document, wile deemphasizing words that occur frequently in many documents 

* Word2Vec features: combination of two techniques, CBOW and Skip gram model. Both are neural networks which map words to the target variables which is also a word. 

In [9]:
import numpy as np
import pandas as pd

df = pd.read_csv('datasets/preprocessed_sentiment.csv', usecols=['tweets','labels'])
df.labels.value_counts() 

 1    56011
 0    55487
-1    53898
Name: labels, dtype: int64

In [10]:
df.sample(5)

Unnamed: 0,tweets,labels
148362,bit python code today deep rabbit hole pyplot ...,-1
81991,learn absolut fun prompt today nnignor previou...,1
71397,chatgpt good free capitalist perspect,1
76924,hey check cool site found topic viamytwitternam,0
58997,elon musk found critic compani buzzi new chatb...,-1


In [11]:
# check if there is any NaN value

df.tweets.isnull().values.any()
df.tweets.isnull().sum()

6

In [13]:
# drop the NaN values if any

df = df.dropna()
df.labels.value_counts() 

 1    56011
 0    55487
-1    53892
Name: labels, dtype: int64

In [14]:
# Bag of Words : 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import gensim

# need to tune the parameters 
BOWvectorize = CountVectorizer(max_df = 0.90, min_df = 2, max_features = 1000, stop_words='english')
BOW = BOWvectorize.fit_transform(df.tweets)

In [15]:
BOW.shape

(165390, 1000)

In [18]:
# TF-IDF features: 
TfidfVect = TfidfVectorizer(max_df = 0.90, min_df = 2, max_features = 1000, stop_words='english')
Tfidf = TfidfVect.fit_transform(df.tweets)

In [19]:
Tfidf.shape

(165390, 1000)

In [23]:
# Word2Vec features: 
tokenize_tweet = df.tweets.apply(lambda x: x.split())

model_W2V = gensim.models.Word2Vec(tokenize_tweet, 
                                   vector_size = 200, # No. of features
                                   window =  5, # default window
                                   min_count = 2, 
                                   sg = 1, # 1 for skip-gram model
                                   hs = 0,
                                   negative = 10, # for negative sampling
                                   workers = 2,  # No. of cores
                                   seed = 34 )

model_W2V.train(tokenize_tweet, total_examples = len(df.tweets), epochs = 20)

(32281683, 38528160)

In [37]:
#Each word can get its own vector. The representation of a tweets can the vector sum of each word divided by the total number(average) 
#or just the sum of each word vector

def word2vec_tweet(tokens, size):
    vector = np.zeros(size).reshape((1,size))
    vector_cnt = 0
    for word in tokens:
        try:
            vector += model_W2V.wv[word].reshape((1, size))
            vector_cnt += 1
            
        except KeyError: 
            print(word, 'not found')
            
    if vector_cnt != 0:
        vector = vector/vector_cnt #average for tweets
    
    return vector 

In [32]:
tweet_arr = np.zeros((len(tokenize_tweet), 200))

for i in range (len(tokenize_tweet)):
    tweet_arr[i,:] = word2vec_tweet(tokenize_tweet[i], 200)
    
tweet_vec_df = pd.DataFrame(tweet_arr)
tweet_vec_df

KeyError: 84349