# "Sklearn: IMDB Sentiment with TF_IDF and Stemming"

- title: "Sklearn: IMDB Sentiment with TF_IDF and Stemming"
- toc: true
- badges: False
- comments: true
- author: Sam Treacy
- categories: [sklearn, pandas, gridsearch, stemming, stopwords, sentiment, nlp, classification, python]

In [None]:
# Sklearn NLP TF-IDF Sentiment Stemming GridSearch - IMDB Reviews

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as plt
from IPython.display import Image

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('DATA/IMDB Dataset.csv')
df.shape

(50000, 2)

In [3]:
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [4]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

## Clean Data

In [5]:
df['review'] = df['review'].str.lower()
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [6]:
def remove_punctuation(text):
    cleaned = ''.join(char for char in text if char not in ('?', '!', '-','_','.',
                                                            '@','#',':',';','"',',',"'",')','(') )
    return cleaned

In [7]:
df['review'] = df['review'].apply(remove_punctuation)

In [8]:
df['review'] = df['review'].str.replace('<br /><br />',' ').str.replace('<br>', ' ')
df['review'][1]

'a wonderful little production  the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece  the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life  the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done'

## Remove Stopwords

In [9]:
import nltk
nltk.download('stopwords')
stopwords =nltk.corpus.stopwords.words('english')

# Add some extra stopwords
stopwords.extend(['well', 'use']) 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samtreacy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
def remove_stopwords(text):
    cleaned = ' '.join(char for char in text.split(' ') if char not in stopwords)
    return cleaned

In [12]:
df['review'] = df['review'].apply(remove_stopwords)
df['review'][1]

'wonderful little production  filming technique unassuming oldtimebbc fashion gives comforting sometimes discomforting sense realism entire piece  actors extremely chosen michael sheen got polari voices pat truly see seamless editing guided references williams diary entries worth watching terrificly written performed piece masterful production one great masters comedy life  realism really comes home little things fantasy guard rather traditional dream techniques remains solid disappears plays knowledge senses particularly scenes concerning orton halliwell sets particularly flat halliwells murals decorating every surface terribly done'

## Stem words

In [13]:
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

In [14]:
porter_stemmer.stem('programers')

'program'

In [15]:
def stem_words(text):
    combined = []
    split_text = text.split()
    for word in split_text:
        stemmedword = porter_stemmer.stem(word)
        combined.append(stemmedword)
        
    combined = ' '.join(combined)
    return combined

In [16]:
df['review'] = df['review'].apply(stem_words)

In [17]:
df['review'][20]

'success die hard sequel surpris realli 1990 glut die hard movi cash wrong guy wrong place wrong time concept cliffhang die hard mountain time rescu sli stop mom shoot stallon career cliffhang one big nitpick dream especi expert mountain climb basejump aviat facial express act skill full excus dismiss film one overblown pile junk stallon even manag get outact hors howev forget nonsens actual lovabl undeni entertain romp deliv plenti thrill unintent plenti laugh youv got love john lithgow sneeri evil tick everi box band baddi best perman harass hapless turncoat agent rex linn traver may henri portrait serial killer michael rooker noteworthi cringeworthi perform hal insist constantli shriek pain disbelief captor man never hurt anybodi whilst sure cant realli look like ralph wait frank charact grin girl plummet death mention must go former london burn actor craig fairbrass brit bad guy come cropper whilst use hal human footbal ye cant help enjoy bit hal need good kick forget better judgem

## Define Target and Features

In [18]:
y = df['sentiment']

X = df['review']

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((40000,), (40000,), (10000,), (10000,))

## Vectorise using TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorize = TfidfVectorizer()

X_train = vectorize.fit_transform(X_train)
X_test  = vectorize.transform(X_test)

## Import ML modules

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier

## Evaluate Baseline

In [22]:
model = DummyClassifier(strategy='uniform')

model.fit(X_train, y_train)

DummyClassifier(strategy='uniform')

In [23]:
predictions = model.predict(X_test)

In [24]:
from sklearn.metrics import confusion_matrix, classification_report

print(classification_report(y_test, predictions),'\n')
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

    negative       0.50      0.50      0.50      4961
    positive       0.50      0.50      0.50      5039

    accuracy                           0.50     10000
   macro avg       0.50      0.50      0.50     10000
weighted avg       0.50      0.50      0.50     10000
 

[[2474 2487]
 [2516 2523]]


In [25]:
model = LogisticRegression()

model.fit(X_train, y_train)

predictions = model.predict(X_test)

In [26]:
print(classification_report(y_test, predictions),'\n')
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

    negative       0.90      0.87      0.89      4961
    positive       0.88      0.91      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000
 

[[4324  637]
 [ 478 4561]]


## Grid Search

In [30]:
from sklearn.model_selection import GridSearchCV

model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [45]:
param_grid = {'penalty': ['l1','l2'],
              'solver':['newton-cg' 'lbfgs', 'liblinear', 'sag', 'saga']}

In [46]:
grid = GridSearchCV(LogisticRegression(), param_grid, refit=True, verbose=2)

In [47]:
#collapse
grid.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] penalty=l1, solver=newton-cglbfgs ...............................
[CV] ................ penalty=l1, solver=newton-cglbfgs, total=   0.0s
[CV] penalty=l1, solver=newton-cglbfgs ...............................
[CV] ................ penalty=l1, solver=newton-cglbfgs, total=   0.0s
[CV] penalty=l1, solver=newton-cglbfgs ...............................
[CV] ................ penalty=l1, solver=newton-cglbfgs, total=   0.0s
[CV] penalty=l1, solver=newton-cglbfgs ...............................
[CV] ................ penalty=l1, solver=newton-cglbfgs, total=   0.0s
[CV] penalty=l1, solver=newton-cglbfgs ...............................
[CV] ................ penalty=l1, solver=newton-cglbfgs, total=   0.0s
[CV] penalty=l1, solver=liblinear ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ..................... penalty=l1, solver=liblinear, total=   0.8s
[CV] penalty=l1, solver=liblinear ....................................
[CV] ..................... penalty=l1, solver=liblinear, total=   1.0s
[CV] penalty=l1, solver=liblinear ....................................
[CV] ..................... penalty=l1, solver=liblinear, total=   0.9s
[CV] penalty=l1, solver=liblinear ....................................
[CV] ..................... penalty=l1, solver=liblinear, total=   0.9s
[CV] penalty=l1, solver=liblinear ....................................
[CV] ..................... penalty=l1, solver=liblinear, total=   0.9s
[CV] penalty=l1, solver=sag ..........................................
[CV] ........................... penalty=l1, solver=sag, total=   0.0s
[CV] penalty=l1, solver=sag ..........................................
[CV] ........................... penalty=l1, solver=sag, total=   0.0s
[CV] penalty=l1, solver=sag ..........................................
[CV] .

[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:  1.4min finished


GridSearchCV(estimator=LogisticRegression(),
             param_grid={'penalty': ['l1', 'l2'],
                         'solver': ['newton-cglbfgs', 'liblinear', 'sag',
                                    'saga']},
             verbose=2)

In [50]:
predictions = grid.predict(X_test)

In [52]:
print(classification_report(y_test, predictions),'\n')
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

    negative       0.90      0.87      0.89      4961
    positive       0.88      0.91      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000
 

[[4324  637]
 [ 478 4561]]


In [54]:
grid.best_params_

{'penalty': 'l2', 'solver': 'liblinear'}