# Welcome !

> Let's see what we can do together. Let's take a look at some sentiment analysis.

***To detect good sentences from bad we will use some TF-IDF matrices: Term-frequency - Inverse-Document-Frequency***


### What is it?
Let's say we have for example: **"I love this cat"** and **"I love this dog"**, **"I hate the movie"**
In this case, we can see that the frequency of love over the first sentence is 1, over the second 1 and finally 0.
The global frequency is 2.
Let's take a look at: "I", the frequency over all sentences is respectively 1.
The global fequency of I over the document (3 sentences) is 3.

### Does it mean "I" is more important than "love"?
Nop, you're right, "I" is not more important. To focus on correctness of the frequency over all documents, we have to multiply the TF (term fequency) with the IDF (inverse-document-frequency):

***TF * log(N/df)***

N = 3 (because 3 sentences)
* For "I":
TF = 1 for the first sentence
df = 3 because "I" appears in all 3 documents

**TF-IDF("I") = 0**, therefore, "I" in the first document is not a valuable information.

* For "love":
TF = 1 for the first sentence
df = 2 because "love" appears in 2 / 3 documents

**TF-IDF("love")** = 0.17, therefore "love" is a little valuable in the first sentence.

***PS: this approach will be quite wrong as it will not take into account any context.***


In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

### Below is some bad and good

We will use those sentences to create two TF-IDF matrices: one good, one bad.
Later, we will compare every sentence we want with the two matrices and check which of the two has the closest cosine similarity (distance) in terms of TF-IDF value with the sentence we want to predict the sentiment.
If you don't understand yet, no worries keep up!

In [2]:
bad_sentences = [
    ['I', 'hate', 'this'],
    ['I', 'don\'t', 'like', 'it'],
    ['I', 'hate', 'that'],
    ['What', 'a', 'disaster'],
    ['Not', 'good'],
    ['It', 'is', 'bad'],
    ['I', 'am', 'not', 'happy'],
    ['I', 'am', 'unhappy'],
    ['I', 'am', 'sad'],
    ['I', 'am', 'mad'],
    ['I', 'am', 'angry'],
    ['I', 'am', 'not', 'glad'],
    ['I', 'am', 'not', 'pleased'],
    ['I', 'am', 'not', 'satisfied'],
    ['I', 'am', 'not', 'content'],
    ['I', 'am', 'not', 'cheerful'],
    ['I', 'am', 'not', 'delighted'],
    ['I', 'am', 'not', 'joyful'],
    ['I', 'am', 'not', 'joyous'],
    ['I', 'am', 'not', 'jubilant'],
    ['I', 'am', 'not', 'ecstatic'],
    ['I', 'am', 'not', 'elated'],
    ['I', 'am', 'not', 'overjoyed'],
    ['I', 'am', 'not', 'thrilled'],
    ['I', 'am', 'not', 'excited'],
    ['I', 'am', 'not', 'exhilarated'],
    ['I', 'am', 'not', 'euphoric'],
    ['I', 'am', 'not', 'blissful'],
    ['I', 'am', 'not', 'cheery'],
    ['I', 'am', 'not', 'chipper'],
    ['I', 'am', 'not', 'contented'],
    ['I', 'am', 'not', 'enjoying'],
    ['I', 'am', 'not', 'glad'],
    ['I', 'am', 'not', 'gratified'],
    ['I', 'am', 'not', 'gratifying'],
    ['I', 'am', 'not', 'happy'],
    ['I', 'am', 'not', 'joyous'],
    ['I', 'am', 'not', 'jubilant'],
    ['I', 'am', 'not', 'pleased'],
    ['I', 'am', 'not', 'pleasing'],
    ['I', 'am', 'not', 'satisfied']
]

good_sentences = [
    ['I', 'love', 'this'],
    ['I', 'like', 'it'],
    ['I', 'love', 'it'],
    ['What', 'a', 'wonderful', 'day'],
    ['Good'],
    ['It', 'is', 'good'],
    ['I', 'am', 'happy'],
    ['I', 'am', 'glad'],
    ['I', 'am', 'pleased'],
    ['I', 'am', 'satisfied'],
    ['I', 'am', 'content'],
    ['I', 'am', 'cheerful'],
    ['I', 'am', 'delighted'],
    ['I', 'am', 'joyful'],
    ['I', 'am', 'joyous'],
    ['I', 'am', 'jubilant'],
    ['I', 'am', 'ecstatic'],
    ['I', 'am', 'elated'],
    ['I', 'am', 'overjoyed'],
    ['I', 'am', 'thrilled'],
    ['I', 'am', 'excited'],
    ['I', 'am', 'exhilarated'],
    ['I', 'love', 'that', 'a', 'lot'],
    ['I', 'am', 'euphoric'],
    ['I', 'am', 'pleased'],
    ['I', 'am', 'pleasing'],
    ['I', 'am', 'satisfied']

]

### 1. Term Frequency
As we said earlier, we will first calculate the term frequency of each word in each sentence.

->TIPS:
- Create an empty dictionary
- Loop over each sentence
- Loop over each word in the sentence
- If the word is already in the dictionary, add 1 to the value
- If the word is not in the dictionary, add the word as a key and set the value to 1

In [3]:
from os import X_OK
def term_frequency(tokenized_sentences):
    d = {}
    x = 1

    for phrase in tokenized_sentences :
      for mot in phrase :
        if mot in d :
          d[mot] +=1
        else :
          d[mot] = 1

    return d


In [4]:
tf_bad = term_frequency(bad_sentences)
tf_good = term_frequency(good_sentences)

In [5]:
global_shape = len(tf_bad) + len(tf_good)

### 2. TF Matrix
Now that we have the term frequency of each word, we will create a matrix with the shape of the number of sentences and the number of words in the dictionary.

->TIPS:
- Create an empty matrix (list of lists)
- Loop over each sentence
- Create an empty vector (list) of the size: global_shape (this is because we are going to create two TF-IDF matrices: one for good sentences and one for bad sentences. The global_shape will be the same for both matrices)
- Loop over each word in the sentence
- If the word is in the dictionary, add the number of times the word appears in the sentence to the vector
- If the word is not in the dictionary, add 0 to the vector
- Add the vector to the matrix
- Return the matrix

In [6]:
def tf_matrix(tokenized_sentences, tf):
    matrice = []
    for phrase in tokenized_sentences :
      v = [0]* global_shape
      for mot in phrase :
        if mot in tf :
          val = list(tf.keys()).index(mot)
          nb_mot = phrase.count(mot)
          v[val]=nb_mot

        else :
          v.append(0)
      matrice.append(v)
    return matrice


In [13]:
tf_matrix_bad = tf_matrix(bad_sentences, tf_bad)
tf_matrix_good = tf_matrix(good_sentences, tf_good)
print(tf_matrix_bad)

[[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### 3. IDF Matrix
Now that we have the TF matrix, we will create the IDF matrix.

->TIPS:
- Create an empty vector (list)
- Loop over each rows (sentences) of the tf_matrix
- Create a counter
- Loop over each sentence
- If the word is in the sentence, add 1 to the counter
- Add the log of the number of sentences divided by the counter + 1 to the vector
- Return the vector

->TIPS:
- Create an empty vector (list)
- Loop over each rows (sentences) of the tf_matrix
- Create a counter
- Loop over each columns(words) of the tf_matrix
- If the word is in the sentence, add 1 to the counter
- Add the log of the number of sentences divided by the counter + 1 to the vector
- Return the vector


In [14]:
def idf_matrix(tf_matrix):
    IDF = []
    for phrase in range(len(tf_matrix[0])):
      nb_phrase = 0
      for mot in range(len(tf_matrix)):
        if tf_matrix[mot][phrase] != 0:
          nb_phrase += 1
      IDF.append(np.log((len(tf_matrix))/(nb_phrase + 1)))

    print("Taille de la matrice IDF : \n", np.array(IDF).shape)
    print("Matrice IDF : \n", np.array(IDF))
    return IDF

In [15]:
idf_matrix_bad = idf_matrix(tf_matrix_bad)
idf_matrix_good = idf_matrix(tf_matrix_good)

Taille de la matrice IDF : 
 (80,)
Matrice IDF : 
 [0.05001042 2.61495978 3.02042489 3.02042489 3.02042489 3.02042489
 3.02042489 3.02042489 3.02042489 3.02042489 3.02042489 3.02042489
 3.02042489 3.02042489 3.02042489 0.13005313 0.24783616 2.61495978
 3.02042489 3.02042489 3.02042489 3.02042489 2.61495978 2.61495978
 2.61495978 3.02042489 3.02042489 3.02042489 3.02042489 2.61495978
 2.61495978 3.02042489 3.02042489 3.02042489 3.02042489 3.02042489
 3.02042489 3.02042489 3.02042489 3.02042489 3.02042489 3.02042489
 3.02042489 3.02042489 3.02042489 3.02042489 3.71357207 3.71357207
 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207
 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207
 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207
 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207
 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207 3.71357207
 3.71357207 3.71357207]
Taille de la matrice IDF : 
 (80,)
Matrice IDF : 
 [0.

### 4. TF-IDF Matrix
Now that we have the TF matrix and the IDF matrix, we will create the TF-IDF matrix.

->TIPS:
- Create an empty matrix (list of lists)
- Loop over the TF matrix
- Create an empty vector (list) of the size: global_shape (this is because we are going to create two TF-IDF matrices: one for good sentences and one for bad sentences. The global_shape will be the same for both matrices)
- Loop over each word in the sentence
- Multiply the TF value with the IDF value
- Add the value to the vector
- Add the vector to the matrix
- Return the matrix

In [44]:
def tf_idf_matrix(tf_matrix, idf_matrix):
  matrice = []
  for i in range(len(tf_matrix)) :
    vect = []
    for mot in range(len(tf_matrix[0])):
      vect.append(tf_matrix[i][mot] * idf_matrix[mot])
    matrice.append(vect)
  print("Taille de la matrice TF-IDF : \n", np.array(matrice).shape)
  print("Matrice TF-IDF : \n", np.array(matrice))
  return matrice

In [45]:
tf_idf_matrix_bad = tf_idf_matrix(tf_matrix_bad, idf_matrix_bad)
tf_idf_matrix_good = tf_idf_matrix(tf_matrix_good, idf_matrix_good)

Taille de la matrice TF-IDF : 
 (41, 80)
Matrice TF-IDF : 
 [[0.05001042 2.61495978 3.02042489 ... 0.         0.         0.        ]
 [0.05001042 0.         0.         ... 0.         0.         0.        ]
 [0.05001042 2.61495978 0.         ... 0.         0.         0.        ]
 ...
 [0.05001042 0.         0.         ... 0.         0.         0.        ]
 [0.05001042 0.         0.         ... 0.         0.         0.        ]
 [0.05001042 0.         0.         ... 0.         0.         0.        ]]
Taille de la matrice TF-IDF : 
 (27, 80)
Matrice TF-IDF : 
 [[0.07696104 1.9095425  2.60268969 ... 0.         0.         0.        ]
 [0.07696104 0.         0.         ... 0.         0.         0.        ]
 [0.07696104 1.9095425  0.         ... 0.         0.         0.        ]
 ...
 [0.07696104 0.         0.         ... 0.         0.         0.        ]
 [0.07696104 0.         0.         ... 0.         0.         0.        ]
 [0.07696104 0.         0.         ... 0.         0.         0.   

### 5. Check sentiment
Now that we have the two TF-IDF matrices, we can check the sentiment of any sentence we want.
To check the sentiment, we will calculate the cosine similarity between the sentence we want to check and the two TF-IDF matrices.
The TF-IDF matrix with the highest cosine similarity will be the one with the closest distance to the sentence we want to check.

For example, for the sentence "I love this cat", the cosine similarity with the good TF-IDF matrix could be 0.5 and the cosine similarity with the bad TF-IDF matrix could be 0.2.
In this case, the sentence "I love this cat" will be considered as a good sentence.

->TIPS:
- Create a function that takes a query as an input (a sentence)
- Create an empty vector (list) of the size: global_shape (this is because we are going to create two TF-IDF matrices: one for good sentences and one for bad sentences. The global_shape will be the same for both matrices)
- Combine all words from both tf_bad and tf_good as a set: **all_words = set(list(tf_bad.keys()) + list(tf_good.keys()))**
- Loop over each word in the set
- If the word is in the query:
    - If the word is in tf_bad:
        - Add the TF-IDF value to the vector with the index of the word in tf_bad as index and the idf_matrix_bad value as value
    - If the word is in tf_good:
        - Add the TF-IDF value to the vector with the index of the word in tf_good as index and the idf_matrix_good value as value
- Calculate the cosine similarity between the query vector and the good TF-IDF matrix: **cosine_similarity([query_vector], tf_idf_matrix_good)[0][0]**
- Calculate the cosine similarity between the query vector and the bad TF-IDF matrix: **cosine_similarity([query_vector], tf_idf_matrix_bad)[0][0]**
- If the cosine similarity with the good TF-IDF matrix is higher than the cosine similarity with the bad TF-IDF matrix:
    - Print the query with a smiley (or whatever :p)
    - Else: print the query with a sad smiley (or whatever :p)

In [46]:
good_query = "WOW, I am so happy to meet you!"
bad_query = "I hate it a lot"

In [47]:
def check_sentiment(query):
  query_mat = query.split()
  vect = [0] * global_shape
  all_words = set(list(tf_bad.keys()) + list(tf_good.keys()))
  for mot in all_words :
    if mot in query_mat :
      if mot in tf_bad :
        ind = list(tf_bad.keys()).index(mot)
        val = query_mat.count(mot)* idf_matrix_bad[ind]
        vect[ind]=val

      elif mot in tf_good :
        ind = list(tf_good.keys()).index(mot)
        val = query_mat.count(mot)*idf_matrix_good[ind]
        vect[ind]=val
  coss_good = cosine_similarity([vect], tf_idf_matrix_good)[0][0]
  coss_bad = cosine_similarity([vect], tf_idf_matrix_bad)[0][0]


  if coss_good >= coss_bad :
    print(query, ":)")
  else :
    print(query,":'(")


In [48]:
check_sentiment(good_query)
check_sentiment(bad_query)

WOW, I am so happy to meet you! :)
I hate it a lot :'(
