## Logistic Regression

We set the random seed to make our result reproductible.

In [68]:
import random

random.seed(10)

First we import everything we need for this sheet.

In [69]:
# import datasets
from datasets import load_dataset, concatenate_datasets
import pandas as pd

We download the dataset from HuggingFacee. We will manually split data train and test set. First we will going to merge train and test dataset into one dataset of 50 000 elements. 

In [70]:
dataset_train = load_dataset('imdb', split='train')
dataset_test = load_dataset('imdb', split='test')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


In [71]:
dataset = concatenate_datasets([dataset_train, dataset_test])

Now that we have our data, we want to convert it to a DataFrame to facilitate manipulations.

In [72]:
from typing import List, Tuple

def create_dataframe(data: List[Tuple[str, str]], columns: List[str]) -> pd.DataFrame:
    """ Convert our data into a DataFrame and convert the string identifier to int """

    rtn = pd.DataFrame(data, columns=columns)
    return rtn

df = create_dataframe(list(zip(dataset['label'], dataset['text'])), ['Label', 'Text'])
df.head()

Unnamed: 0,Label,Text
0,1,Bromwell High is a cartoon comedy. It ran at t...
1,1,Homelessness (or Houselessness as George Carli...
2,1,Brilliant over-acting by Lesley Ann Warren. Be...
3,1,This is easily the most underrated film inn th...
4,1,This is not the typical Mel Brooks film. It wa...


First, we need to convert the text into numbers that we can do calculations on. We use word frequencies. We want to transform the given text to a vector on the basis of the frequency of each word in the text.

For this we use `CountVectorizer` from `sklearn`.

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
 
X = cv.fit_transform(df['Text']).toarray()
y = df['Label']

The `train_test_split` shuffles all the dataset before splitting. In our case, we will use 75% of data for training and 25% for testing.

In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

In [75]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

We use the confusion_matrix of sklearn to display the number of right (True positive and True negative) and wrong (False positive and False negative) predictions.

In [76]:
from sklearn.metrics import confusion_matrix
y_pred = logreg.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[5453,  799],
       [ 773, 5475]])

We use the classification_report of sklearn to display the precision, recall, and F1-score for both classes on the test data.

In [77]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.87      6252
           1       0.87      0.88      0.87      6248

    accuracy                           0.87     12500
   macro avg       0.87      0.87      0.87     12500
weighted avg       0.87      0.87      0.87     12500



Compare to method using naive bayes model, we can see that logistic regression with word frequencies have better results than naive bayes model. This may be due to the assumptions that the Naive Bayes algorithm makes: independence assumption



--------------------------------------------------------------------------


Now that we have created our model with logistic regression based on word frequencies using CountVectorizer, we will tried this time a model with logistic regression based on hand-made features.

We have chosen to implement 4 features: presence of keyword "no" or not, presence of "!" or not, number of positive and negative keywords using vaderSentiment.txt.

In [78]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re
analyzer = SentimentIntensityAnalyzer()

def countPositiveNegative(sentence: str) -> Tuple[str, str]:
  """ Returns number of positive and negative words passing a sentence in argument"""
  positive = 0
  negative = 0
  res = re.findall(r'\w+', sentence) 
  for word in res:
    vs = analyzer.polarity_scores(word)
    if vs['compound'] >= 0.05 :
        positive += 1
    elif vs['compound'] <= - 0.05 :
        negative += 1
  return positive, negative

In [79]:
import numpy as np

def getAllFeatures(df: pd.DataFrame):
  df['containsNo'] = np.where(df['Text'].str.contains(r"\b(no)\b", case = False), 1, 0)
  df['containsExclamation'] = np.where(df['Text'].str.contains("!"), 1, 0)

getAllFeatures(df)

  return func(self, *args, **kwargs)


In [80]:
df = pd.concat([df, df['Text'].apply(lambda cell: pd.Series(countPositiveNegative(cell), index = ['positiveWords', 'negativeWords']))], axis = 1)

In [86]:
df

Unnamed: 0,Label,Text,containsNo,containsExclamation,positiveWords,negativeWords
0,1,Bromwell High is a cartoon comedy. It ran at t...,0,1,2,2
1,1,Homelessness (or Houselessness as George Carli...,0,0,22,13
2,1,Brilliant over-acting by Lesley Ann Warren. Be...,0,0,9,5
3,1,This is easily the most underrated film inn th...,0,0,8,3
4,1,This is not the typical Mel Brooks film. It wa...,0,0,7,2
...,...,...,...,...,...,...
49995,0,I occasionally let my kids watch this garbage ...,0,0,2,6
49996,0,When all we have anymore is pretty much realit...,0,0,12,10
49997,0,The basic genre is a thriller intercut with an...,1,0,13,7
49998,0,Four things intrigued me as to this film - fir...,0,0,11,2


In [81]:
X = df.loc[:, ~df.columns.isin(['Text','Label'])]
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

In [83]:
logreg2 = LogisticRegression(max_iter=1000)
logreg2.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [84]:
from sklearn.metrics import confusion_matrix
y_pred = logreg2.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[4401, 1851],
       [1838, 4410]])

In [85]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.70      0.70      6252
           1       0.70      0.71      0.71      6248

    accuracy                           0.70     12500
   macro avg       0.70      0.70      0.70     12500
weighted avg       0.70      0.70      0.70     12500

