## TF IDF

In this notebook, we implement TF-IDF so we can later implement it in our model. The goal is to determine which are the frequent words that may play a role into getting a highly-scored comment. 

### Set-up

In [None]:
#!pip install wordcloud

In [None]:
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from wordcloud import WordCloud

Input your repository path here:

In [None]:
repsource = "C:/Users/s1027177/OneDrive - Syngenta/Documents/FOAD/au_secours/"

In [None]:
file1=open(repsource+"df_body_cleaned","rb")
df_cleaned=pickle.load(file1)
file1.close()

### TF IDF

In traditional NLP, we represent words with occurrence vectors. TF-IDF consists in the term count divided by document count

TfidfVectorizer parameter values:

* Filter out words that appear in lesS than 1% of the comments
* Use the idf
* Keep only `number_of_dimensions` words

In [None]:
number_of_dimensions = 1000

In [None]:
tfidf = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 1),
    max_features=number_of_dimensions, 
    min_df=0.01)

In [None]:
def test_split(dataset):
    Test_DF = dataset[pd.isna(dataset.ups)]
    return Test_DF

def train_split(dataset):
    Train_DF = dataset[pd.isna(dataset.ups) == False]
    return Train_DF

In [None]:
Train_DF = train_split(df_cleaned)
Test_DF = test_split(df_cleaned)

In [None]:
print(Train_DF.shape)
print(Test_DF.shape)

In [None]:
train_text = Train_DF['body']
test_text = Test_DF['body']
#all_text = pd.concat([train_text, test_text])
#all_text.shape

Fit TF-IDF to learn vocabulary on the pre-processed `body` variable and return the document-term matrix for both train and test sets. We save them in a pickle file because it time and space consuming (RAM).

In [None]:
tfidf.fit(train_text)
train_word_features = tfidf.transform(train_text)
test_word_features = tfidf.transform(test_text)
feature_names = tfidf.get_feature_names()

In [None]:
file2=open(repsource+"word_features","wb")
pickle.dump(train_word_features,file2)
pickle.dump(test_word_features,file2)
file2.close()

Convert into dataframes and transpose to get the frequencies per term for the words with largest tf-idf terms across all comments in the train set to build a word cloud. This visualization technique is used to represent text data. The size of each word indicates its frequency or importance.

In [None]:
train_df_features = pd.DataFrame(train_word_features.toarray(), columns=feature_names)
test_df_features = pd.DataFrame(test_word_features.toarray(), columns=feature_names)

In [None]:
file6=open(repsource+"word_df_features","wb")
pickle.dump(train_df_features,file6)
pickle.dump(test_df_features,file6)
file6.close()

In [None]:
print(train_df_features.shape)
print(test_df_features.shape)

In [None]:
Cloud = WordCloud(background_color="white", max_words=100).generate_from_frequencies(train_df_features.T.sum(axis=1))

In [None]:
plt.imshow(Cloud, interpolation='bilinear')
plt.axis("off")
plt.show()

After looking at the TF-IDF matrix and this workcloud, it seems that we could have done POS tagging to apply TF-IDF on nouns only or extend our stop word list. Another problem is that we don't capture the semantic meaning of the words or their association. We better use word embeddings. We do it in the notebook `022_word_association`.