## Logistic Regression Pretreatement

We set the random seed to make our result reproductible.

In [36]:
import random

random.seed(10)

First we import everything we need for this sheet.

In [37]:
# import datasets
from datasets import load_dataset, concatenate_datasets
import pandas as pd

We download the dataset from HuggingFace. We will manually split data train and test set. First we will going to merge the train and test dataset into one dataset of 50 000 elements. 

In [38]:
dataset_train = load_dataset('imdb', split='train')
dataset_test = load_dataset('imdb', split='test')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


In [39]:
dataset = concatenate_datasets([dataset_train, dataset_test])

Now that we have our data, we want to convert it to a DataFrame to facilitate manipulations.

In [40]:
from typing import List, Tuple

def create_dataframe(data: List[Tuple[str, str]], columns: List[str]) -> pd.DataFrame:
    """ Convert our data into a DataFrame and convert the string identifier to int """

    rtn = pd.DataFrame(data, columns=columns)
    return rtn

df = create_dataframe(list(zip(dataset['label'], dataset['text'])), ['Label', 'Text'])
df.head()

Unnamed: 0,Label,Text
0,1,Bromwell High is a cartoon comedy. It ran at t...
1,1,Homelessness (or Houselessness as George Carli...
2,1,Brilliant over-acting by Lesley Ann Warren. Be...
3,1,This is easily the most underrated film inn th...
4,1,This is not the typical Mel Brooks film. It wa...


In [41]:
# import packages for steeming
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

In [42]:
# We need to download a package for word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Let's start by removing html tags

In [43]:
# Remove html tags
from bs4 import BeautifulSoup
df['Text'] = df['Text'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

Let's continue with the word tokenization.

In [44]:
# Tokenization
df['Text'] = df['Text'].apply(lambda x: " ".join(word_tokenize(x)))

Now let's apply the stemming to everything that is composed of characters. Words are simply cut and stemmed. We do not have any punctuation.

In [45]:
# Steeming
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

def stemming(text):
    return [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
df['Text'] = df['Text'].apply(lambda x: " ".join(stemming(x)))

In [46]:
df['Text'][:5]

0    bromwel high is a cartoon comedi it ran at the...
1    homeless or houseless as georg carlin state ha...
2    brilliant by lesley ann warren best dramat hob...
3    this is easili the most underr film inn the br...
4    this is not the typic mel brook film it was mu...
Name: Text, dtype: object

In [47]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 4.8 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Let's lemmatize all token we can find.

In [48]:
# Lemmatization
import spacy

nlp = spacy.load("en_core_web_sm", disable = ['ner', 'tagger', 'parser', 'textcat', "lemmatizer"])

def lemmatization(text):
    return [token.lemma_ for token in nlp(text.lower()) if re_word.match(token.text)]

df['Text'] = df['Text'].apply(lambda x: " ".join(lemmatization(x)))

First, we need to convert the text into numbers that we can do calculations on. We use word frequencies. We want to transform the given text to a vector on the basis of the frequency of each word in the text.

For this we use `CountVectorizer` from `sklearn`.

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
 
X = cv.fit_transform(df['Text']).toarray()
y = df['Label']

The `train_test_split` shuffles all the dataset before splitting. In our case, we will use 75% of data for training and 25% for testing.

In [50]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

In [51]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

We use the confusion_matrix of sklearn to display the number of right (True positive and True negative) and wrong (False positive and False negative) predictions.

In [52]:
from sklearn.metrics import confusion_matrix
y_pred = logreg.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[5390,  862],
       [ 768, 5480]])

We use the classification_report of sklearn to display the precision, recall, and F1-score for both classes on the test data.

In [53]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87      6252
           1       0.86      0.88      0.87      6248

    accuracy                           0.87     12500
   macro avg       0.87      0.87      0.87     12500
weighted avg       0.87      0.87      0.87     12500



Compared to logistic regression without pretreatement, we have quite the same accuracy, precision for both classes. Pretreatement on word frequencies features using logistic regression doesn't really have impact.

In [54]:
bad_predict_df = y_test.where(y_test != y_pred).dropna()
indexes = bad_predict_df.index
df.iloc[indexes]

Unnamed: 0,Label,Text
12968,0,i see this movi a a veri young girl i 27 now a...
43460,0,ray bradburi run and hide this tacki film vers...
34478,1,this be a veri dark movi somewhat well than th...
19883,0,i read comment that you should watch this film...
2838,1,okay yes this be a movi with continu error lik...
...,...,...
11686,1,i probabl doubl my knowledg of iran when i see...
34600,1,the ﻿1 time i see this film i want to like it ...
21800,0,i ﻿1 see this movi on mst3k and although i lau...
27615,1,sheba babi be alway underr much like becaus it...


In [55]:
df.iloc[12968]["Text"]

'i see this movi a a veri young girl i 27 now and it scare me witless for year i have nightmar about everi aspect of this film from the way it be draw to the music to obvious the violenc my parent still argu about who allow me to watch it and both of them say that they would never let me watch such a movi i think they onli say that know that i have such strong feel about it 0 i be current read the book out of morbid curio and the fact that it a classic and it be realli a great stori howev i do think that it should have be make into a cartoon ever good mayb kid nowaday would find it quaint but it give me nightmar for week and week and i still have a hard time see rabbit draw in a similar way give me a littl heart palpit everi time yah i be a wuss but i strong suggest that anus parent look to show this movi to their kid read them the book instead or watch it \ufeff1 to make certain that they approv of the content not everyon find it a disturb a i do but we be out there 0'



--------------------------------------------------------------------------


Now that we have created our model with logistic regression based on word frequencies using CountVectorizer, we will tried this time a model with logistic regression based on hand-made features.

We have chosen to implement 4 features: presence of keyword "no" or not, presence of "!" or not, number of positive and negative keywords using vaderSentiment.txt. 

In [56]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re
analyzer = SentimentIntensityAnalyzer()

def countPositiveNegative(sentence: str) -> Tuple[str, str]:
  """ Returns number of positive and negative words passing a sentence in argument"""
  positive = 0
  negative = 0
  res = re.findall(r'\w+', sentence) 
  for word in res:
    vs = analyzer.polarity_scores(word)
    if vs['compound'] >= 0.05 :
        positive += 1
    elif vs['compound'] <= - 0.05 :
        negative += 1
  return positive, negative


In [58]:
import numpy as np

def getAllFeatures(df : pd.DataFrame):
  """ Add new columns to dataframe df for features """
  df['containsNo'] = np.where(df['Text'].str.contains(r"\b(no)\b", case = False), 1, 0)
  df['containsExclamation'] = np.where(df['Text'].str.contains("!"), 1, 0)

getAllFeatures(df)

  return func(self, *args, **kwargs)


In [59]:
df = pd.concat([df, df['Text'].apply(lambda cell: pd.Series(countPositiveNegative(cell), index = ['positiveWords', 'negativeWords']))], axis = 1)

In [60]:
df

Unnamed: 0,Label,Text,containsNo,containsExclamation,positiveWords,negativeWords
0,1,bromwel high be a cartoon comedi it run at the...,0,0,0,0
1,1,homeless or houseless a georg carlin state hav...,0,0,19,10
2,1,brilliant by lesley ann warren well dramat hob...,0,0,7,4
3,1,this be easili the much underr film inn the br...,0,0,4,3
4,1,this be not the typic mel brook film it be muc...,0,0,6,2
...,...,...,...,...,...,...
49995,0,i occasion let my kid watch this garbag so the...,0,0,1,4
49996,0,when all we have anymor be pretti much realiti...,0,0,6,8
49997,0,the basic genr be a thriller intercut with a u...,1,0,9,4
49998,0,four thing intrigu me a to this film ﻿1 it sta...,0,0,10,1


In [61]:
X = df.loc[:, ~df.columns.isin(['Text','Label'])]
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

In [62]:
logreg2 = LogisticRegression(max_iter=1000)
logreg2.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [63]:
from sklearn.metrics import confusion_matrix
y_pred = logreg2.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[4181, 2071],
       [1967, 4281]])

In [64]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.68      0.67      0.67      6252
           1       0.67      0.69      0.68      6248

    accuracy                           0.68     12500
   macro avg       0.68      0.68      0.68     12500
weighted avg       0.68      0.68      0.68     12500



Accuracy and precision of both classes using hand made features is not very good. Maybe the use of those features were not appropriate to the imdb dataset. Using vaderSentiment on all the sentence and not by counting only positive and negative words will have been surely more efficient.

In the case of pretreatement, steeming delete all punctuations in sentences.  The feature of presence of '!' or not is useless. 