# TfidfVectorizer

**Word counts like CountVectorizer are a good starting point, but are very basic.**

One of the problem related to this is counts words like “the, and, or, there and others" will appear many times and their large counts will not be very meaningful in the encoded vectors and cause our model to missunderstand important words than those repeated many times.

Alternative way of this is to calculate word frequencies which the first of tfidf is tf--> Term Frequency, then  down scale these words that appear a lot across documents using the second part which is idf --> Inverse Document Frequency.



In [1]:
import os
import sys
import pymongo
from time import sleep
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# splitting data
from sklearn.model_selection import train_test_split

# Features Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC


In [2]:
# import cleaning data file because of using some of its function of we need

In [3]:
sys.path.append(os.path.abspath('../scraping_cleaning'))
from cleaning_data import *

Our scraped now are 5777 Product
one product Data

 {'_id': ObjectId('5e48539b980cc3c34837af7f'), 'product_title': 'هاتف ابل ايفون 11 مع فيس تايم بشريحة واحدة وشريحة الكترونية - ذاكرة تخزين 64 جيجا، ذاكرة وصول عشوائية 4 جيجا، شبكة ال تي اي من الجيل الرابع - ارجواني', 'product_url': 'https://egypt.souq.com/eg-ar/%D9%87%D8%A7%D8%AA%D9%81-%D8%A7%D8%A8%D9%84-%D8%A7%D9%8A%D9%81%D9%88%D9%86-11-%D9%85%D8%B9-%D9%81%D9%8A%D8%B3-%D8%AA%D8%A7%D9%8A%D9%85-%D8%A8%D8%B4%D8%B1%D9%8A%D8%AD%D8%A9-%D9%88%D8%A7%D8%AD%D8%AF%D8%A9-%D9%88%D8%B4%D8%B1%D9%8A%D8%AD%D8%A9-%D8%A7%D9%84%D9%83%D8%AA%D8%B1%D9%88%D9%86%D9%8A%D8%A9-%D8%B0%D8%A7%D9%83%D8%B1%D8%A9-%D8%AA%D8%AE%D8%B2%D9%8A%D9%86-64-%D8%AC%D9%8A%D8%AC%D8%A7-%D8%B0%D8%A7%D9%83%D8%B1%D8%A9-%D9%88%D8%B5%D9%88%D9%84-%D8%B9%D8%B4%D9%88%D8%A7%D8%A6%D9%8A%D8%A9-4-%D8%AC%D9%8A%D8%AC%D8%A7-%D8%B4%D8%A8%D9%83%D8%A9-%D8%A7%D9%84-%D8%AA%D9%8A-%D8%A7%D9%8A-%D9%85%D9%86-%D8%A7%D9%84%D8%AC%D9%8A%D9%84-%D8%A7%D9%84%D8%B1%D8%A7%D8%A8%D8%B9-%D8%A7%D8%B1%D8%AC%D9%88%D8%A7%

## Read our classified file

In [4]:
df_file = pd.read_csv('../csv_files/file_classified_reviews_updated.csv')
df_file.head()

Unnamed: 0,Arabic Reviews,polarity
0,ممتاز صراح مريح خاصت في خرج لي من غير عرب,1
1,عمل,1
2,جيد وع,0
3,حل,1
4,سرير او محجر كما سمي ممتاز وم عب امان خاص اطفا...,1


In [5]:
print("The number of reviews in our data set is: ", len(df_file))

The number of reviews in our data set is:  6057


**Shuffle the data**

In [6]:
df_file = df_file.sample(frac=1).reset_index(drop=True)

In [7]:
# now 
df_file.head()

Unnamed: 0,Arabic Reviews,polarity
0,جيد حجم سط ليس بير شكل نفس,0
1,منتج صل ضم ممتاز,1
2,صراح ناس غا في ادب نشاط,1
3,برامج العاب تنزل على ذاكره داخليه صغيره جيج لي...,1
4,موبايل ممتاز خام ممتاز جوانب من معدن سعر رخص م...,1


## split the data to train and testing 

In [8]:
X = df_file['Arabic Reviews'] 
y = df_file['polarity']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1)
X_train[:5]

96                 نصح بها عمل جد مريح لكن ضعيف لي حد ما 
3437             ثاب جد راءع ممتاز ذات راءح واح جد روووع 
312                                    مع جد في ضوء نهار 
5582    رام 4GB ذاكر بير قدر حط ذاكر خارج 400GB كثر من...
1612                             حجم صغر من جوال ايف بلس 
Name: Arabic Reviews, dtype: object

In [10]:
print("Our training data now are: " + str(len(X_train))  + " Reviews")
print("Our testing data now are: " + str(len(X_test))  + " Reviews")
print("Our training data now are: " + str(len(y_train))  + " labels")
print("Our testing data now are: " + str(len(y_test))  + " labels")

Our training data now are: 5451 Reviews
Our testing data now are: 606 Reviews
Our training data now are: 5451 labels
Our testing data now are: 606 labels


### Take object from TfidfVectorizer & fit the data
The CountVectorizer provides a simple way to both tokenize a **collection of text documents** like:
![example of data](../images/TfidfVectorizer_df_file.png)


and build a vocabulary of known words, but also to encode new documents using that vocabulary like:
![example of output](../images/TfidfVectorizervocabulary_.png)

**The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: "the".**

In [11]:
def tfidf_vectorizer(df):
    '''
    Argumen:
        df dataframe of multiple reviews
    return:
        Train & test arrays that can fir to the model
    '''
# I fit the vector to all of the data
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer = tfidf_vectorizer.fit(X) 
    word_idf_weights = tfidf_vectorizer.idf_
    print("Our 10 words weights\n\n",word_idf_weights[:10])
# fit splited data
    testing_data = tfidf_vectorizer.transform(X_test)
    training_data = tfidf_vectorizer.transform(X_train) 
# convert to array that can apply to ML model
    training_data = training_data.toarray()
    testing_data = testing_data.toarray()
    return training_data, testing_data

In [12]:
training_data, testing_data = tfidf_vectorizer(X)

Our 10 words weights

 [9.01598781 9.01598781 7.07007766 6.49025917 9.01598781 9.01598781
 9.01598781 8.32284063 9.01598781 9.01598781]


In [13]:
# first shape is the data itself and second shape is the BOW in our data
print("Our new vectorized data: " + str(training_data.shape))
print("Our new vectorized data: " + str(testing_data.shape)) 
print("The first 2 review after transform: \n", testing_data[:2])

Our new vectorized data: (5451, 5464)
Our new vectorized data: (606, 5464)
The first 2 review after transform: 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### result using naive_bayes MultinomialNB Model

In [14]:
clf_MultinomialNB = MultinomialNB()

In [15]:
model = clf_MultinomialNB.fit(training_data, y_train)

In [16]:
predict = model.predict(training_data)

In [17]:
print("F1 score of our training data is: ", f1_score(y_train, predict, average='micro'))

F1 score of our training data is:  0.8460832874701889


In [18]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_train, predict))

Evalution Matrix of training data is 
 [[ 211  837]
 [   2 4401]]


In [19]:
predict = model.predict(testing_data)

In [20]:
print("F1 score of our testing data is: ", f1_score(y_test, predict, average='micro'))

F1 score of our testing data is:  0.8415841584158416


In [21]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_test, predict))

Evalution Matrix of training data is 
 [[ 10  96]
 [  0 500]]


### result using LogisticRegression Model

In [22]:
clf_LogisticRegression = LogisticRegression(penalty='l2', tol=0.00001, solver='liblinear',max_iter=1000)

In [23]:
logistic_model = clf_LogisticRegression.fit(training_data, y_train)

In [24]:
predict = logistic_model.predict(training_data)

In [25]:
print("F1 score of our testing data is: ", f1_score(y_train, predict, average='micro'))

F1 score of our testing data is:  0.8985507246376812


In [26]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_train, predict))

Evalution Matrix of training data is 
 [[ 539  509]
 [  44 4359]]


In [27]:
predict = logistic_model.predict(testing_data)

In [28]:
print("F1 score of our testing data is: ", f1_score(y_test, predict, average='micro'))

F1 score of our testing data is:  0.9042904290429042


In [29]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_test, predict))

Evalution Matrix of training data is 
 [[ 54  52]
 [  6 494]]


### result using SVC Model

In [30]:
clf_SVC = SVC(kernel='linear')

In [31]:
svc_model = clf_SVC.fit(training_data, y_train)

In [32]:
predict = svc_model.predict(training_data)

In [33]:
print("F1 score of our testing data is: ", f1_score(y_train, predict, average='micro'))

F1 score of our testing data is:  0.9308383782792148


In [34]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_train, predict))

Evalution Matrix of training data is 
 [[ 727  321]
 [  56 4347]]


In [35]:
predict = svc_model.predict(testing_data)

In [36]:
print("F1 score of our testing data is: ", f1_score(y_test, predict, average='micro'))

F1 score of our testing data is:  0.9075907590759076


In [37]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_test, predict))

Evalution Matrix of training data is 
 [[ 61  45]
 [ 11 489]]


In [38]:
clf_SVC = SVC(kernel='poly', degree=2)

In [39]:
svc_model = clf_SVC.fit(training_data, y_train)



In [40]:
predict = svc_model.predict(training_data)

In [41]:
print("F1 score of our training data is: ", f1_score(y_train, predict, average='micro'))

F1 score of our training data is:  0.8077416987708678


In [42]:
print("Evalution Matrix of training data is \n", confusion_matrix(y_train, predict))

Evalution Matrix of training data is 
 [[   0 1048]
 [   0 4403]]


In [43]:
predict = svc_model.predict(testing_data)

In [44]:
print("F1 score of our testing data is: ", f1_score(y_test, predict, average='micro'))

F1 score of our testing data is:  0.8250825082508251


In [45]:
print("Evalution Matrix of testing data is \n", confusion_matrix(y_test, predict))

Evalution Matrix of testing data is 
 [[  0 106]
 [  0 500]]


In [46]:
clf_SVC = SVC(kernel='sigmoid')

In [47]:
svc_model = clf_SVC.fit(training_data, y_train)



In [48]:
predict = svc_model.predict(training_data)

In [49]:
print("F1 score of our training data is: ", f1_score(y_train, predict, average='micro'))

F1 score of our training data is:  0.8077416987708678


In [50]:
predict = svc_model.predict(testing_data)

In [51]:
print("F1 score of our testing data is: ", f1_score(y_test, predict, average='micro'))

F1 score of our testing data is:  0.8250825082508251


In [52]:
print("Evalution Matrix of testing data is \n", confusion_matrix(y_test, predict))

Evalution Matrix of testing data is 
 [[  0 106]
 [  0 500]]
