<a href="https://colab.research.google.com/github/PaulBarriere/TSE-NBSVM/blob/main/Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Report Project NBSVM (Naive-Bayes Support Vector Machines)
Nicolas Le Gall and Paul Barriere Master 2 students in Econometrics and Statistics at Toulouse School of Economics. 

The aim of this report is to illustrate the functions we built to implement the NB-SVM modelisation method introduced in *Baselines and Bigrams: Simple, Good Sentiment and Topic Classification*

In [1]:
!git clone https://@github.com/PaulBarriere/TSE-NBSVM.git

Cloning into 'TSE-NBSVM'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (101/101), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 101 (delta 46), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (101/101), 168.25 KiB | 3.00 MiB/s, done.
Resolving deltas: 100% (46/46), done.


# 0. Execute all the .py files

*This step allows to import and downloads all the packages we need and to run the functions we created. It avoids to have an overweighted notebook. In the end this notebook is really easy looking and confirtable for the users.*

In [2]:
# Functions for the data
%run TSE-NBSVM/Functions_Import_Data.py

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Functions NBSVM
%run TSE-NBSVM/Functions_NBSVM.py

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now that we imported the functions and packages let's have a look at the data we choose to illustrate our work.

# 1. Import data

In [4]:
liste_fichier = ['Nokia_6610.txt', 'Apex_AD2600_Progressive_scan_DVD_player.txt', 
                 'Canon_G3.txt', 'Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt', 
                 'Nikon_coolpix_4300.txt']

In [5]:
Nokia_6610_x, Nokia_6610_y, Apex_AD2600_Progressive_scan_DVD_player_x, Apex_AD2600_Progressive_scan_DVD_player_y, Canon_G3_x, Canon_G3_y, Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB_x, Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB_y, Nikon_coolpix_4300_x, Nikon_coolpix_4300_y = import_all_data(liste_fichier)

Number of reviews for this product :36
Number of reviews for this product :97
Number of reviews for this product :42
Number of reviews for this product :91
Number of reviews for this product :30


There are several product and it could be interesting to look if we can mix them for example. We have imported and cleaned our data with the previous function (import_all_data), let us have a look at how it look with the first review of Apex_AD2600_Progressive_scan_DVD_player and its grade.

In [7]:
Apex_AD2600_Progressive_scan_DVD_player_x[0]

'bought apex dvd players player p play everything died shortly getting back using player nice machines consider quality pretty low'

In [8]:
Apex_AD2600_Progressive_scan_DVD_player_y[0]

-1.0

The samples are different. Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB has much bigger reviews for example than Apex_AD2600_Progressive_scan_DVD_player.

In [9]:
count = 0
for i in [Nokia_6610_x, Apex_AD2600_Progressive_scan_DVD_player_x, Canon_G3_x, Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB_x, Nikon_coolpix_4300_x]:
  avg_mean = round(np.mean([len(i[j]) for j in range(len(i))]))
  avg_std = round(np.std([len(i[j]) for j in range(len(i))]))
  print('The average number of character in {} is {}. And the standard deviation is {}'.format(liste_fichier[count], avg_mean, avg_std))
  count+=1

The average number of character in Nokia_6610.txt is 779. And the standard deviation is 648
The average number of character in Apex_AD2600_Progressive_scan_DVD_player.txt is 336. And the standard deviation is 259
The average number of character in Canon_G3.txt is 822. And the standard deviation is 703
The average number of character in Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt is 1031. And the standard deviation is 868
The average number of character in Nikon_coolpix_4300.txt is 656. And the standard deviation is 486


Let's perform the modelisation on Apex_AD2600_Progressive_scan_DVD_player first. We divide it into a test and train set with 70% in the train set. 

In [6]:
train_x, test_x, train_y, test_y = train_test_split(Apex_AD2600_Progressive_scan_DVD_player_x, 
                                                    Apex_AD2600_Progressive_scan_DVD_player_y, 
                                                    test_size=0.3, 
                                                    random_state=2000)

# 3. Naive Bayes: 

### Explonation and illustration of the model
This model as explained in the paper uses a vector $V$ which is the list of vocabulary present in the train sample. We can choose if we want unigrams, bigrams, or more. We choose here to have both unigrams and bigrams.  

In [7]:
''' Illustration on this example with 10 random elements of vector V '''
V, F, vectorization = vectorize(Train_x_sample = train_x, ngrams=(1,2))
random.sample(V,10)

['multiformat',
 'occasionally lip',
 'known problem',
 'past',
 'sharp',
 'name',
 'dvd dvd',
 'mixed bag',
 'reads quietly',
 'always']

Then we have a matrix $F$ which columns are reviews and rows are number of elements of the vector $V$ in this review. So $F_{i,j}$ is the number of time the element i of the vector $V$ ($V_i$) is in the review j.

In [8]:
''' Illustration on this example with first 5 columns of the matrix F (ie the 
5 first reviews and 20 lines of V '''

F[30:50,:5]

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

Then we have re-write the train sample in the form of a matrix of integers. The idea is now to compute the log-count ratio vector $R$ which is a vector of size len($V$). In order to compute $R$ we defined $P$ and $Q$ two vectors:
* $P = \alpha + \sum_{i:y(i)=1} f^{(i)}$ with smoothing parameter $\alpha$.
* $Q = \alpha + \sum_{i:y(i)=-1} f^{(i)}$ with smoothing parameter $\alpha$ (the same).

$f^{(i)}$ is a vector of occurences of V in the review i. So P and Q are basically sum of columns of F (positive review for P and negative for Q).



Then we define R by : 
$R = log(\frac{P/\| P\|_1}{Q/\| Q\|_1})$

In [9]:
# We implemented a function to compute directly R:
R = get_R(alpha = .1,Train_y_sample = train_y, F = F)
R

array([-0.37430794, -0.37430794, -2.77220321, ..., -0.37430794,
       -0.37430794, -0.37430794])

This log-count matrix is the W matrix in the model explained in the paper:

$y^{(k)} = sign(W^T X^{(k)} + b)$. Where :
* $y^{(k)}$ is the prediction of the $k^{th}$ review
* $W$ is the log count ratio ($R$) 
* $X^{(k)}$ is the $k^{th}$ review expressed in a vector of occurences of terms in V. 
*  $b$ is defined as follow: $b = log(N+~/~N−)$ where $N+$ is the number of positive review.

So in order to predict a new sample we have to transform the test sample in a vector of V. So the matrix $Sample$ defined below is for $(i,j)$, $Sample_{(i,j)}~=~$number of time the element $i$ of the vector $V$ is in the review $j$ of test_x. 

In [10]:
sample = vectorization.transform(test_x).toarray().T
# We do have :
len(sample) == len(V)

True

In [11]:
# And the number of columns is equal to number of test_x reviews:
len(sample[0]) == len(test_y)

True

In [12]:
sample

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Then to compute the prediction we apply the model and compute the product of $W^T.Sample + B$ where $B$ is defined as follow: $b = log(N+~/~N−)$ where $N+$ is the number of positive review.  

In [13]:
nb_neg = len([i for i, value in enumerate(train_y) if value == -1])
nb_pos = len([i for i, value in enumerate(train_y) if value == 1])
B = np.log(nb_pos/nb_neg)
B

-0.3930425881096072

We compute the product of $W^T$ and $Sample$ and apply minus B

In [14]:
np.dot(R.T,sample) + B

array([ 4.70671658e+01,  5.09275942e+00,  4.40375276e+01,  4.35835184e-02,
       -4.39010698e+01, -1.57494584e+01, -3.59975026e+01,  1.91536760e+01,
        2.61759522e+01,  5.55536435e+00,  1.32167523e+01,  2.03342614e+01,
       -2.04423485e+00,  4.20131554e+00,  7.30995692e+01,  2.33973428e-01,
        6.21047019e-01,  2.86168938e+01, -1.92038291e+00,  4.94957117e+00,
       -1.14154734e+01,  7.90024495e+00, -2.18059360e+01, -2.25150590e+00,
        7.36509311e-01,  5.52234153e+00,  1.00315615e+01,  1.09743408e+00,
        1.46057979e+01,  2.47714136e+00])

And now we take the sign of this expression to have the predictions:

In [15]:
np.sign(np.dot(R.T,sample) + B)

array([ 1.,  1.,  1.,  1., -1., -1., -1.,  1.,  1.,  1.,  1.,  1., -1.,
        1.,  1.,  1.,  1.,  1., -1.,  1., -1.,  1., -1., -1.,  1.,  1.,
        1.,  1.,  1.,  1.])

### Functions implemented
The previous cells were to show step by step how to construct the model as explained in the paper. But we created a function which directly compute every vectors and matrix needed. 

In [16]:
W, B, binarized, vectorization = MNB_fit_model(Train_x_sample = train_x, 
                                              Train_y_sample = train_y,
                                              ngrams = (1,2), 
                                              alpha = .1, 
                                              binarized=True)

In [17]:
pred = predict(binarized,W,B,vectorization,test_x)
pred

array([ 1.,  1.,  1.,  1., -1., -1., -1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,  1., -1., -1.,  1.,  1.,
        1.,  1.,  1.,  1.])

We can have a look to the accuracy of the model:

In [18]:
eval(pred,test_y)

accuracy 0.6
              precision    recall  f1-score   support

        -1.0       0.83      0.31      0.45        16
         1.0       0.54      0.93      0.68        14

    accuracy                           0.60        30
   macro avg       0.69      0.62      0.57        30
weighted avg       0.70      0.60      0.56        30



Condidering the small size of the sample we should perform a cross-validation to evaluate the model. We implemented a function to do this and we have the following result:

In [19]:
random.seed(2022)
cross_validation_NBSVM(Apex_AD2600_Progressive_scan_DVD_player_x, 
                       Apex_AD2600_Progressive_scan_DVD_player_y,
                       cv = 4,
                       alpha = .1)

0.625

This is not a very good result considering that this sample contains around 60% negative values. So a naive model (predicting always negative) would acheive the same accuracy. We think the small size of our sample is responsible for this outcome. 

In [20]:
nb_neg = len([i for i, value in enumerate(Apex_AD2600_Progressive_scan_DVD_player_y) if value == -1])
nb_pos = len([i for i, value in enumerate(Apex_AD2600_Progressive_scan_DVD_player_y) if value == 1])
nb_pos, nb_neg

(41, 56)

# 4. SVM
In order to perform the SVM implementation we used the svm object of the package: sklearn. We need to transform our reviews into vectors of V once again.

In [6]:
train_x, test_x, train_y, test_y = train_test_split(Apex_AD2600_Progressive_scan_DVD_player_x, 
                                                    Apex_AD2600_Progressive_scan_DVD_player_y, 
                                                    test_size=0.3, 
                                                    random_state=2000)

In [9]:
# We vectorize again train_x:
V, F, vectorization = vectorize(Train_x_sample = train_x, ngrams=(1,2))
# Since binarized = True always here.
F = np.where(F > 0,1,0)

In [23]:
# We define and fit the model:
clf = svm.LinearSVC()
clf.fit(F.T, train_y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [24]:
# We transform test_x into a vector of 
sample = vectorization.transform(test_x).toarray().T
# We predict the outcome of test_x:
pred_svm = clf.predict(vectorization.transform(test_x))
pred_svm

array([-1., -1., -1., -1., -1., -1., -1., -1., -1.,  1., -1., -1., -1.,
        1., -1., -1., -1., -1., -1.,  1., -1., -1., -1.,  1., -1.,  1.,
       -1.,  1.,  1.,  1.])

In [25]:
eval(pred_svm,test_y)

accuracy 0.6
              precision    recall  f1-score   support

        -1.0       0.59      0.81      0.68        16
         1.0       0.62      0.36      0.45        14

    accuracy                           0.60        30
   macro avg       0.61      0.58      0.57        30
weighted avg       0.61      0.60      0.58        30



Once again we implemented a function defining the model directly with an argument NB to be True if you want NBSVM and false otherwise

In [7]:
clf = NB_SVM_fit_model(Train_x_sample = train_x,
                       Train_y_sample = train_y,
                       ngrams = (1,2),
                       NB=False) 

In [11]:
# Then to predict we do as above:

# We transform test_x into a vector of 
sample = vectorization.transform(test_x).toarray().T
# We predict the outcome of test_x:
pred_svm = clf.predict(vectorization.transform(test_x))
pred_svm

array([-1., -1., -1., -1., -1., -1., -1., -1., -1.,  1., -1., -1., -1.,
        1., -1., -1., -1., -1., -1.,  1., -1., -1., -1.,  1., -1.,  1.,
       -1.,  1.,  1.,  1.])

In [12]:
eval(pred_svm,test_y)

accuracy 0.6
              precision    recall  f1-score   support

        -1.0       0.59      0.81      0.68        16
         1.0       0.62      0.36      0.45        14

    accuracy                           0.60        30
   macro avg       0.61      0.58      0.57        30
weighted avg       0.61      0.60      0.58        30



# 5. NBSVM
The idea of NBSVM is to create a model equivalent to SVM but changing from $x = \hat{F}$ to $x = \hat{F} o \hat{r}$ (elementwise product). Let us remind that F is a matrix of size :$ len($V$) * nb~of~reviews$ and that $\hat{r}$ is the log count ratio so a vector of size len($V$).

In [13]:
train_x, test_x, train_y, test_y = train_test_split(Apex_AD2600_Progressive_scan_DVD_player_x, 
                                                    Apex_AD2600_Progressive_scan_DVD_player_y, 
                                                    test_size=0.3, 
                                                    random_state=2000)

In [14]:
# We vectorize again train_x:
V, F, vectorization = vectorize(Train_x_sample = train_x, ngrams=(1,2))
# Since binarized = True always here.
F = np.where(F > 0,1,0)

In [15]:
# We compute R, the log count ratio (it is R hat here since we use F hat to compute it)
R = get_R(alpha = 10, Train_y_sample = train_y, F = F)
R

array([ 0.0365331 ,  0.0365331 , -0.05877708, ...,  0.0365331 ,
        0.0365331 ,  0.0365331 ])

We compute now the elementwise product of $\hat{r}$ and $\hat{F}$

In [16]:
# We define first a matrix r with the same number of columns as F and with each 
# column equal to R.
r = np.array([list(R),]*len(F[0]))
r = r.transpose()
r

array([[ 0.0365331 ,  0.0365331 ,  0.0365331 , ...,  0.0365331 ,
         0.0365331 ,  0.0365331 ],
       [ 0.0365331 ,  0.0365331 ,  0.0365331 , ...,  0.0365331 ,
         0.0365331 ,  0.0365331 ],
       [-0.05877708, -0.05877708, -0.05877708, ..., -0.05877708,
        -0.05877708, -0.05877708],
       ...,
       [ 0.0365331 ,  0.0365331 ,  0.0365331 , ...,  0.0365331 ,
         0.0365331 ,  0.0365331 ],
       [ 0.0365331 ,  0.0365331 ,  0.0365331 , ...,  0.0365331 ,
         0.0365331 ,  0.0365331 ],
       [ 0.0365331 ,  0.0365331 ,  0.0365331 , ...,  0.0365331 ,
         0.0365331 ,  0.0365331 ]])

In [17]:
# We do the elementwise produt of R and F.
product = np.multiply(r,F)
product

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.        , -0.        , -0.        , ..., -0.        ,
        -0.05877708, -0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [18]:
# We define the model:
clf = svm.LinearSVC()
clf.fit(product.T, train_y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [19]:
# Then to predict we do as above:

# We transform test_x into a vector of 
sample = vectorization.transform(test_x).toarray().T
# We predict the outcome of test_x:
pred_nbsvm = clf.predict(vectorization.transform(test_x))
pred_nbsvm

array([-1., -1.,  1.,  1., -1., -1.,  1., -1.,  1.,  1.,  1.,  1., -1.,
       -1.,  1.,  1., -1., -1.,  1.,  1.,  1., -1., -1.,  1., -1.,  1.,
       -1., -1.,  1.,  1.])

In [20]:
eval(pred_nbsvm,test_y)

accuracy 0.6
              precision    recall  f1-score   support

        -1.0       0.64      0.56      0.60        16
         1.0       0.56      0.64      0.60        14

    accuracy                           0.60        30
   macro avg       0.60      0.60      0.60        30
weighted avg       0.61      0.60      0.60        30



We can use the function implemented directly:

In [21]:
clf = NB_SVM_fit_model(train_x, train_y, ngrams = (1,2), NB=True)

In [22]:
sample = vectorization.transform(test_x).toarray().T
# We predict the outcome of test_x:
pred_nbsvm = clf.predict(vectorization.transform(test_x))
pred_nbsvm

array([-1., -1.,  1.,  1., -1., -1.,  1., -1.,  1.,  1.,  1.,  1., -1.,
       -1.,  1.,  1., -1., -1.,  1.,  1.,  1., -1., -1.,  1., -1.,  1.,
       -1., -1.,  1.,  1.])

In [23]:
eval(pred_nbsvm,test_y)

accuracy 0.6
              precision    recall  f1-score   support

        -1.0       0.64      0.56      0.60        16
         1.0       0.56      0.64      0.60        14

    accuracy                           0.60        30
   macro avg       0.60      0.60      0.60        30
weighted avg       0.61      0.60      0.60        30

