<a href="https://colab.research.google.com/github/Tikquuss/word_embeddings/blob/main/BOW%26TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For the parts ```Bag of words``` and ```TF-IDF```, I was inspired by the [moocs  of coursera](https://www.coursera.org/learn/language-processing/home/week/1) that I followed recently.  


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import pickle

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Data**

In [None]:
! wget -c https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv

--2021-01-05 21:42:35--  https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65862309 (63M) [text/plain]
Saving to: ‘IMDb_Reviews.csv’


2021-01-05 21:42:36 (196 MB/s) - ‘IMDb_Reviews.csv’ saved [65862309/65862309]



In [None]:
data_frame = pd.read_csv('/content/IMDb_Reviews.csv')

In [None]:
df.shape

(50000, 2)

In [None]:
df.head(5)

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


**Summary of the dataset**

In [None]:
#df.describe()

In [None]:
df['sentiment'].value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

# **Spliting the training dataset**

In [None]:
X, y = df['review'].values, df['sentiment'].values

In [None]:
seed = 1234 # For reproducibility
test_ratio = 0.2

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_ratio, random_state = seed)

In [None]:
len(X_train), len(X_test),

(40000, 10000)

In [None]:
X_train[0]

"Basically, this movie is one of those rare movies you either hate and think borders on suicide as the next best thing to do, rather than having to sit through it for two hours. Or, as in my case, you see it as a kult hit, one of those movies wherein the humour, the plot, the acting, is actually very hidden but for those of us willing to go looking for it, trusting the director well, the reward is: U laugh your A.. of !! The fact that U have to find the things mentioned above, actually makes the movie even more funny, because u get the impression the director isn't even aware of how funny his movie is, which doesn't seem likely and therein lies the intelligence at the helm of this magnificient project called : Spaced Invaders !!"

# **Text Prepare**

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = re.sub(REPLACE_BY_SPACE_RE, ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, '', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
    return text

In [None]:
a = X_train[0]
X_train = [text_prepare(x) for x in X_train]
X_test = [text_prepare(x) for x in X_test]

In [None]:
print(a)
print(X_train[0])

Basically, this movie is one of those rare movies you either hate and think borders on suicide as the next best thing to do, rather than having to sit through it for two hours. Or, as in my case, you see it as a kult hit, one of those movies wherein the humour, the plot, the acting, is actually very hidden but for those of us willing to go looking for it, trusting the director well, the reward is: U laugh your A.. of !! The fact that U have to find the things mentioned above, actually makes the movie even more funny, because u get the impression the director isn't even aware of how funny his movie is, which doesn't seem likely and therein lies the intelligence at the helm of this magnificient project called : Spaced Invaders !!
basically movie one rare movies either hate think borders suicide next best thing rather sit two hours case see kult hit one movies wherein humour plot acting actually hidden us willing go looking trusting director well reward u laugh fact u find things mentione

# **Transforming text to a vector**

## **1) Bag of words**   




1. Find *N* most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
2. For each title in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.  

Drawbacks : 
- vocabulary size
- contain many 0s (thereby resulting in a sparse matrix)
- We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.


Let's try to do it for a toy example. Imagine that we have *N* = 4 and the list of the most popular words is 

    ['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this: 

    {'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:

    'hi how are you'

For this text we create a corresponding zero vector 

    [0, 0, 0, 0]
    
And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:

    'hi':  [1, 0, 0, 0]
    'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
    'are': [1, 0, 0, 1]
    'you': [1, 1, 0, 1]

The resulting vector will be 

    [1, 1, 0, 1]



---



To find the most common words use train data

**Words counts and most common words**

In [None]:
words_counts = {}
for line in X_train:
  word_list = line.split()
  for word in word_list: 
    words_counts[word] = words_counts.get(word, 0) + 1

In [None]:
most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:10]
print(most_common_words)

[('br', 94212), ('movie', 66976), ('film', 60162), ('one', 41018), ('like', 31214), ('good', 22955), ('even', 19621), ('would', 19229), ('time', 18757), ('really', 18386)]


In [None]:
DICT_SIZE = 10000 # size of the dictionary
WORDS_TO_INDEX = {key: rank for rank, key in enumerate(sorted(words_counts.keys(), key=lambda x: words_counts[x], reverse=True)[:DICT_SIZE], 0)}
INDEX_TO_WORDS = {y:x for x,y in WORDS_TO_INDEX.items()}

In [None]:
def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size)
    for item in text.split():
        if item in words_to_index.keys():
            result_vector[words_to_index[item]] += 1
    return result_vector

In [None]:
my_bag_of_words(X_train[0], WORDS_TO_INDEX, DICT_SIZE)

array([0., 3., 0., ..., 0., 0., 0.])

Now apply the implemented function to all samples.  
We use [scipy.sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix) (Compressed Sparse Row matrix) for fast matrix vector products and [scipy.sparse.vstack](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.vstack.html#scipy.sparse.vstack)  to Stack sparse matrices vertically (row wise)

In [None]:
# sparse matrix package for numeric data.
from scipy import sparse as sp_sparse 

In [None]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])

In [None]:
print('X_train shape ', X_train_mybag.shape)
print('X_test shape ', X_test_mybag.shape)

X_train shape  (40000, 10000)
X_test shape  (10000, 10000)


## 2) **TF-IDF (Term Frequency-Inverse Document Frequency)**


TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.	

- *Term Frequency (TF)* : It is a measure of how frequently a term, $t$, appears in a document, $d$:	
$$tf (t, d) = \frac{\text{number of times the term “t” appears in the document “d”}}{\text{number of terms in the document "d"}}$$

- *Inverse Document Frequency (IDF)* : IDF is a measure of how important a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words.

$$idf (t) = log \bigg( \frac{\text{numbers of document}}{\text{number of document with the term "t"}} \bigg)$$

- We can now compute the TF-IDF score for each word in the corpus. Words with a higher score are more important, and those with a lower score are less important.

$$tf\_idf(t, d) = tf (t, d) * idf (t)$$


TF-IDF takes into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 

- We use class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from *scikit-learn*. 
- We use *train* corpus to train a vectorizer. 
- Our filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles)
- We use bigrams along with unigrams in our vocabulary.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

How is it work?

In [None]:
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X_dummy = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names()) 
print(X_dummy.shape)
print(X_dummy)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
  (0, 1)	0.46979138557992045
  (0, 2)	0.5802858236844359
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 8)	0.38408524091481483
  (1, 5)	0.5386476208856763
  (1, 1)	0.6876235979836938
  (1, 6)	0.281088674033753
  (1, 3)	0.281088674033753
  (1, 8)	0.281088674033753
  (2, 4)	0.511848512707169
  (2, 7)	0.511848512707169
  (2, 0)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 3)	0.267103787642168
  (2, 8)	0.267103787642168
  (3, 1)	0.46979138557992045
  (3, 2)	0.5802858236844359
  (3, 6)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 8)	0.38408524091481483


In [None]:
def tfidf_features(X_train, X_test):
    """
        X_train, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train and test sets and return the result
    
    
    tfidf_vectorizer = TfidfVectorizer(
        lowercase = True, 
        min_df=5, 
        max_df=0.9, 
        ngram_range=(1, 2), 
        #token_pattern='(\S+)' # todo
    )
    
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    
    return X_train, X_test, tfidf_vectorizer, tfidf_vectorizer.vocabulary_

In [None]:
X_train_tfidf, X_test_tfidf, tfidf_vectorizer, tfidf_vocab = tfidf_features(X_train, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

In [None]:
print(tfidf_vectorizer.transform([X_train[0]]))

  (0, 141685)	0.13766236349138156
  (0, 141675)	0.0893839369132375
  (0, 141017)	0.12594728983001974
  (0, 140062)	0.037648472901724864
  (0, 135446)	0.05220524424041844
  (0, 133749)	0.09541472954660453
  (0, 133619)	0.04380564551985267
  (0, 132820)	0.1341276887781818
  (0, 128269)	0.04173390750443947
  (0, 128072)	0.05149263147285057
  (0, 127810)	0.04801861800733908
  (0, 127533)	0.13639775235602547
  (0, 127532)	0.12694346361735465
  (0, 124122)	0.09064994063599877
  (0, 119784)	0.13976290036945693
  (0, 117283)	0.07635550509899132
  (0, 112639)	0.16111250408393288
  (0, 112583)	0.05910500108832005
  (0, 111844)	0.03513215105565094
  (0, 106803)	0.11794474007926736
  (0, 103247)	0.15638535971459747
  (0, 103134)	0.05641522903117024
  (0, 103055)	0.14605616589456188
  (0, 103040)	0.08514258494068441
  (0, 101203)	0.0831683934229732
  :	:
  (0, 47650)	0.04933971841386271
  (0, 42601)	0.15857952553589774
  (0, 42544)	0.05204299770481386
  (0, 39893)	0.11698354005840478
  (0, 39699)	0

In [None]:
print('X_train_tfidf shape ', X_train_tfidf.shape)
print('X_test_tfidf shape ', X_test_tfidf.shape)

X_train_tfidf shape  (40000, 146247)
X_test_tfidf shape  (10000, 146247)


In [None]:
assert list(tfidf_vocab.keys())[:10] == list(tfidf_reversed_vocab.values())[:10], "An error occurred"
list(tfidf_vocab.keys())[:10]

['basically',
 'movie',
 'one',
 'rare',
 'movies',
 'either',
 'hate',
 'think',
 'borders',
 'suicide']

# **Classifiers**


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV  

## **Exhaustive search over specified parameter values for an estimator.**

In [None]:
parameters = {'C': np.linspace(start = 0.0001, stop = 100, num=100)}

In [None]:
grid_search_mybag = GridSearchCV(LogisticRegression(), parameters, n_jobs = -1)
grid_search_tfidf = GridSearchCV(LogisticRegression(), parameters, n_jobs = -1)

In [None]:
grid_search_mybag.fit(X_train_mybag, y_train)
grid_search_tfidf.fit(X_train_tfidf, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


GridSearchCV(cv=None, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': array([1.00000e-04, 1.010...
       7.57576e+01, 7.67677e+01, 7.77778e+01, 7.87879e+01, 7.97980e+01,
       8.08081e+01, 8.18182e+01, 8.28283e+01, 8.38384e+01, 8.48485e+01,
       8.58586e+01, 8.68687e+01, 8.78788e+01, 8.88889e+01, 8.98990e+01,
       9.09091e+01, 9.19192e+01, 9.29293e+01, 9.39394e+01, 9.49495e+01,
    

In [None]:
print('best parameters mybag: ', grid_search_mybag.best_params_)
print('best scrores mybag: ', grid_search_mybag.best_score_)

print('best parameters tfidf: ', grid_search_tfidf.best_params_)
print('best scrores tfidf: ', grid_search_tfidf.best_score_)

best parameters mybag:  {'C': 1.0102}
best scrores mybag:  0.8708500000000001
best parameters tfidf:  {'C': 9.091}
best scrores tfidf:  0.9049250000000001


In [None]:
C_mybag = 1.0102
C_tfidf = 0.9049250000000001

Train the classifiers for different data transformations: *bag-of-words*, *tf-idf* and *bert*.

In [None]:
classifier_mybag = LogisticRegression(penalty = "l2", C = C_mybag, solver = "newton-cg", random_state = 0, n_jobs = -1).fit(
    X_train_mybag, y_train
)

classifier_tfidf = LogisticRegression(penalty = "l2", C = C_tfidf, solver = "newton-cg", random_state = 0, n_jobs = -1).fit(
    X_train_tfidf, 
    y_train
)

Create predictions for the data : labels and scores.

In [None]:
y_test_predicted_labels_mybag = classifier_mybag.predict(X_test_mybag)
y_test_predicted_scores_mybag = classifier_mybag.decision_function(X_test_mybag)

y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
y_test_predicted_scores_tfidf = classifier_tfidf.decision_function(X_test_tfidf)

In [None]:
print('===== Bag-of-words : ', classifier_mybag.score(X_test_mybag, y_test))
print('===== Tfidf : ', classifier_tfidf.score(X_test_tfidf, y_test))

===== Bag-of-words :  0.8765
===== Tfidf :  0.9024


### Evaluation

To evaluate the results we will use several classification metrics:
 - [Accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
 - [F1-score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

 

In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score  

In [None]:
def print_evaluation_scores(y, predicted):
    print("accuracy_score : ", accuracy_score(y, predicted))
    print("f1_score : ", f1_score(y, predicted, average="macro"))
    print("recall_score : ", recall_score(y, predicted, average="macro"))

In [None]:
print('===== Bag-of-words')
print_evaluation_scores(y_test, y_test_predicted_labels_mybag)
print('===== Tfidf')
print_evaluation_scores(y_test, y_test_predicted_labels_tfidf)

===== Bag-of-words
accuracy_score :  0.8765
f1_score :  0.8764981215364286
recall_score :  0.8764965821550783
===== Tfidf
accuracy_score :  0.9024
f1_score :  0.9023864082834894
recall_score :  0.9023871394374807


### **Deploy model with gradio**

In [None]:
! pip install gradio

Collecting gradio
[?25l  Downloading https://files.pythonhosted.org/packages/ec/c6/0c18e033cd293c603266e33212df6da3ae4cc3b84e7e91317bce9cffffa9/gradio-1.4.0-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 7.3MB/s 
Collecting paramiko
[?25l  Downloading https://files.pythonhosted.org/packages/95/19/124e9287b43e6ff3ebb9cdea3e5e8e88475a873c05ccdf8b7e20d2c4201e/paramiko-2.7.2-py2.py3-none-any.whl (206kB)
[K     |████████████████████████████████| 215kB 26.9MB/s 
Collecting markdown2
  Downloading https://files.pythonhosted.org/packages/a2/d2/6e6ab0d9387c332bf1de205522a8386b48e2dddf332ba6ad71b2ec371110/markdown2-2.3.10-py2.py3-none-any.whl
Collecting Flask-Cors>=3.0.8
  Downloading https://files.pythonhosted.org/packages/69/7f/d0aeaaafb5c3c76c8d2141dbe2d4f6dca5d6c31872d4e5349768c1958abc/Flask_Cors-3.0.9-py2.py3-none-any.whl
Collecting analytics-python
  Downloading https://files.pythonhosted.org/packages/d3/37/c49d052f88655cd96445c36979fb63f69ef859e167eaff5706ca7

In [None]:
import gradio as gr

In [None]:
def mybag_predict(eula):
    vec = my_bag_of_words(text_prepare(eula) , WORDS_TO_INDEX, DICT_SIZE)
    output = classifier_mybag.predict([vec])[0]
    return "Positive" if output == 1 else "Negative"

def tfidf_predict(eula):
    vec = tfidf_vectorizer.transform([text_prepare(eula)])
    output = classifier_tfidf.predict(vec)[0]
    return "Positive" if output == 1 else "Negative"

def predict(model_name, eula):
  if model_name == "Bag of word":
    return mybag_predict(eula)
  elif model_name == "TD-IDF":
    return tfidf_predict(eula)



---



In [None]:
inputs = gr.inputs.Textbox(placeholder="Your review", label = "Review", lines=10)
output = gr.outputs.Textbox()
gr.Interface(fn = mybag_predict, inputs = inputs, outputs = output).launch()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
This share link will expire in 24 hours. If you need a permanent link, email support@gradio.app
Running on External URL: https://12241.gradio.app
Interface loading below...


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7860/',
 'https://12241.gradio.app')

In [None]:
inputs = gr.inputs.Textbox(placeholder="Your review", label = "Review", lines=10)
output = gr.outputs.Textbox()
gr.Interface(fn = tfidf_predict, inputs = inputs, outputs = output).launch()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
This share link will expire in 24 hours. If you need a permanent link, email support@gradio.app
Running on External URL: https://36240.gradio.app
Interface loading below...


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7861/',
 'https://36240.gradio.app')

In [None]:
inputs = gr.inputs.Textbox(placeholder="Your review", label = "Review", lines=10)
model_name = gr.inputs.Dropdown(["Bag of word", "TD-IDF"], label = "model name")
output = gr.outputs.Textbox()
gr.Interface(fn = predict, inputs = [model_name, inputs], outputs = output).launch()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
This share link will expire in 24 hours. If you need a permanent link, email support@gradio.app
Running on External URL: https://12189.gradio.app
Interface loading below...


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7863/',
 'https://12189.gradio.app')