<h2 align="center">Logistic Regression: A Sentiment Analysis Case Study</h2>

## Introduction

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

## 1. Loading the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [3]:
df.iloc[1, 0]

"OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie! <br /><br />I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn't understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about! <br /><br />I don't care that everyone on this movie was doing out of love for the project, or some such nonsense... I've seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid! <br /><br />I rented this piece of garbage for a buck, and I want my money back! I want my 2 hou

## 2. Transforming documents into features vectors

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()

In [5]:
docs = np.array(['the sun is shining', 
                 'the weather is sweet',
                 'make you want to move your dancing feet'])
bag = count.fit_transform(docs)

In [6]:
count.vocabulary_

{'the': 8,
 'sun': 6,
 'is': 2,
 'shining': 5,
 'weather': 11,
 'sweet': 7,
 'make': 3,
 'you': 12,
 'want': 10,
 'to': 9,
 'move': 4,
 'your': 13,
 'dancing': 0,
 'feet': 1}

In [7]:
bag.toarray()

array([[0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0],
       [1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1]], dtype=int64)

## 3. Word relevancy using term frequency-inverse document frequency 

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)

In [9]:
(tfidf.fit_transform(bag)).toarray()

array([[0.        , 0.        , 0.42804604, 0.        , 0.        ,
        0.5628291 , 0.5628291 , 0.        , 0.42804604, 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.42804604, 0.        , 0.        ,
        0.        , 0.        , 0.5628291 , 0.42804604, 0.        ,
        0.        , 0.5628291 , 0.        , 0.        ],
       [0.35355339, 0.35355339, 0.        , 0.35355339, 0.35355339,
        0.        , 0.        , 0.        , 0.        , 0.35355339,
        0.35355339, 0.        , 0.35355339, 0.35355339]])

In [10]:
np.set_printoptions(precision=2)
(tfidf.fit_transform(bag)).toarray()

array([[0.  , 0.  , 0.43, 0.  , 0.  , 0.56, 0.56, 0.  , 0.43, 0.  , 0.  ,
        0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.43, 0.  , 0.  , 0.  , 0.  , 0.56, 0.43, 0.  , 0.  ,
        0.56, 0.  , 0.  ],
       [0.35, 0.35, 0.  , 0.35, 0.35, 0.  , 0.  , 0.  , 0.  , 0.35, 0.35,
        0.  , 0.35, 0.35]])

## 4. Data preparation

In [11]:
df.iloc[0, 0][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [12]:
# remove html notations and ponctuation
# send the emojies symbols to the end of the text
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [13]:
preprocessor(df.iloc[0, 0][-50:])

'is seven title brazil not available'

In [14]:
preprocessor('this is :) a test :( ! :-) </a>')

'this is a test :) :( :)'

## 5. Tokenization of documents

In [15]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

In [16]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [17]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

To get rid of the stop words (and, or ...)

In [18]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [20]:
# redefine the tokenizer
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split() if word not in stop]

In [21]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'thu', 'run']

## 6. Transform Text Data into TF-IDF Vectors

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                       lowercase=None,
                       preprocessor=preprocessor,
                       tokenizer=tokenizer_porter,
                       use_idf=True,
                       norm='l2',
                       smooth_idf=True)

In [23]:
y = df.sentiment.values
X = tfidf.fit_transform(df.review)

## 7. Document Classification using Logistic Regression

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
# we used a 50% of the dataset as a test, because the creator of this dataset made it this way

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,
                                                   test_size=0.5,
                                                   shuffle=False)

In [26]:
import pickle
from sklearn.linear_model import LogisticRegressionCV

In [27]:
clf = LogisticRegressionCV(cv=5,
                          scoring='accuracy',
                          random_state=0,
                          n_jobs=-1,
                          verbose=3,
                          max_iter=300).fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.0min remaining:  3.0min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.7min finished


In [28]:
# save the model into the disk

saved_model = open('saved_model.sav', 'wb')
pickle.dump(clf, saved_model)
saved_model.close()

## 8. Model Evaluation

In [29]:
# load the saved model

filepath = 'saved_model.sav'
saved_clf = pickle.load(open(filepath, 'rb'))

In [30]:
saved_clf.score(X_test, y_test)

0.89472

Well 90% accuracy is pretty well considering this simple model

In [58]:
predictions = saved_clf.predict(X_test)

In [59]:
from sklearn import metrics

In [60]:
acc = metrics.accuracy_score(y_test, predictions)
print('Accuracy: {}\n'.format(acc))

Accuracy: 0.89472



In [62]:
mse = metrics.mean_squared_error(y_test, predictions)
print('MSE: {}\n'.format(mse))

MSE: 0.10528

