# Logistic Regression on IMDB Data

Main purpose of this tutorial is to remember Logistic Regression implementation when I forget details about the concept.

### IMDB Data Analysis

In [71]:
import imdb_functions # Includes load_data()
import numpy as np

# The data is IMDB data, change the path to data directory.
X_train_corpus , y_train, X_test_corpus , y_test = imdb_functions.load_imdb(path = "../../Fall_19/aclImdb")

Loading the imdb data
Train Data loaded.
Test Data loaded.


In [72]:
print("Total Data: "+ str(len(X_train_corpus)))
num_positive=len(y_train.nonzero()[0])+len(y_test.nonzero()[0])
num_negative=len(y_train)+len(y_test)-num_positive
print("# positive labels: "+ str(num_positive) + "\n# negative labels: "+ str(num_negative))

print("# positive/negative ratio in both train and test: " +str(len(y_train.nonzero()[0])/len(X_train_corpus)))

Total Data: 25000
# positive labels: 25000
# negative labels: 25000
# positive/negative ratio in both train and test: 0.5


------------------

## Logistic Regression
A naive implementation 

In [9]:
import re # To tokenize without using CountVectorizer
import os
import sklearn as sk

from sklearn.feature_extraction import stop_words
from sklearn.preprocessing import Binarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.linear_model import LogisticRegression

In [4]:
X_train_corpus[2]

"Beast Wars is a show that is over-hyped, overpraised and overrated. Let's meet the characters of this obnoxious show whose creators must have been on acid to try and make a show like this.<br /><br />Cheetor- Seriously, they need to have censor bars on this guy. How come he dosen't creep out the viewers having the same voice as baby Taz? (at least Razzoff from Rayman 3: Hoodlum Havoc is voiced by Slip & Slide) Action Blast- If you want a line of show that suck, get G4 Tranceformers Cybertron- A show that should go down in a toilet. Good Job Creators (Sarcasm) Show it self-Retarded & boring (at least the Super Mario games are better) This show had a lot of followers sayin' bring it back, but I believe that it was cancelled for its own good."

In [5]:
# Using Regex to convert paragraphs into words. Lower for consistency.
re.findall(r"(?u)\b[\w\'/]+\b", X_train_corpus[0].lower())

['this',
 'movie',
 'is',
 'another',
 'christian',
 'propaganda',
 'film',
 'in',
 'the',
 'line',
 'of',
 'the',
 'omega',
 'code',
 'not',
 'that',
 'that',
 'is',
 'necessarily',
 'bad',
 'but',
 'for',
 'the',
 'fact',
 'that',
 'most',
 'propaganda',
 'films',
 'sacrifice',
 'sincerity',
 'and',
 'realism',
 'for',
 'the',
 'message',
 'they',
 'wish',
 'to',
 'deliver',
 'if',
 'you',
 'enjoy',
 'a',
 'styrofoam',
 'portrayal',
 'of',
 'life',
 'on',
 'the',
 'streets',
 'and',
 'the',
 'way',
 'the',
 'gospel',
 'can',
 'change',
 'a',
 'life',
 'than',
 'perhaps',
 'you',
 'may',
 'enjoy',
 'this',
 'movie',
 'i',
 'say',
 'save',
 'your',
 'money',
 'and',
 'rent',
 'the',
 'cross',
 'and',
 'the',
 'switchblade',
 'or',
 'the',
 'mission',
 'when',
 'will',
 'christian',
 'directors',
 'learn',
 'that',
 'sometimes',
 'people',
 'say',
 'bad',
 'words',
 'it',
 'was',
 'frustrating',
 'to',
 'see',
 'criminals',
 'depicted',
 'who',
 'are',
 'not',
 'allowed',
 'to',
 'swear

_Implement also dictionary. Count them._

__Notes:__

Miscommunication. I tried to implement LR without using built-in functions. It looks like I understood it wrongly. Abondaning the project, and implementing simple Logistic Regression.

--------------

### Logistic Regression: the Second Try

In [23]:
import sklearn as sk

from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Tokenization and Vectorization

In [40]:
# Try with the configuration below
# Regular LR (sklearn)... with countvectorizer, binary=True, n-gram = (1, 1).

token = r"(?u)\b[\w\'/]+\b"
vectorizer = CountVectorizer(token_pattern=token, 
                             binary=True,
                             ngram_range=(1,1),
                             min_df=5,
                             stop_words=["the","a","of","and","br","to"])
X_train_vector = vectorizer.fit_transform(X_train_corpus)
X_test_vector = vectorizer.transform(X_test_corpus)

In [52]:
X_train_vector.sum(axis=0)

matrix([[153,  49,  78, ...,  13,  17,   5]], dtype=int64)

In [53]:
# Top 10 Words:
sum_words = X_train_vector.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
words_freq[:10]

[('this', 22639),
 ('is', 22426),
 ('in', 22039),
 ('it', 21341),
 ('that', 20046),
 ('i', 19244),
 ('but', 17981),
 ('for', 17885),
 ('with', 17467),
 ('was', 16161)]

Fitting Logistic Regression

In [10]:
lr = LogisticRegression()

In [38]:
lr.fit(X_train_vector, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [57]:
lr.predict(X_test_vector)

array([0, 1, 1, ..., 1, 0, 0])

In [70]:
lr.predict_proba(X_test_vector)

array([[8.34583173e-01, 1.65416827e-01],
       [4.58080746e-01, 5.41919254e-01],
       [8.32109572e-03, 9.91678904e-01],
       ...,
       [2.70875533e-04, 9.99729124e-01],
       [9.95231088e-01, 4.76891166e-03],
       [7.78532148e-01, 2.21467852e-01]])

In [73]:
lr.score(X_train_vector, y_train)

0.99764

In [58]:
lr.score(X_test_vector, y_test)

0.87316

#### Some exciting stuff..
Wooow, exctiing.

In [65]:
# Find biggest coefficients.
coefs = list(zip(vectorizer.get_feature_names(), lr.coef_[0]))
rank_coefs = sorted(coefs, key = lambda x: x[1], reverse=True)
rank_coefs[:10]

[('7/10', 3.177890766667874),
 ('8/10', 1.893127889566014),
 ('refreshing', 1.5762645208255615),
 ('appreciated', 1.5270690687232276),
 ('excellent', 1.5045416011211958),
 ('7', 1.4659954500589691),
 ('hooked', 1.434411192007017),
 ('perfect', 1.3946349369805615),
 ('superb', 1.3491485762585986),
 ('wonderfully', 1.3102587357216458)]

In [68]:
# Most Confident Documents
probs = lr.predict_proba(X_train_vector)
positive_probs = list(zip(*probs))[1]
word_prob = list(zip(X_train_corpus, positive_probs))
prob_sorted = sorted(word_prob, key = lambda x: x[1], reverse=True)
prob_sorted[:2]

[('By now you\'ve probably heard a bit about the new Disney dub of Miyazaki\'s classic film, Laputa: Castle In The Sky. During late summer of 1998, Disney released "Kiki\'s Delivery Service" on video which included a preview of the Laputa dub saying it was due out in "1999". It\'s obviously way past that year now, but the dub has been finally completed. And it\'s not "Laputa: Castle In The Sky", just "Castle In The Sky" for the dub, since Laputa is not such a nice word in Spanish (even though they use the word Laputa many times throughout the dub). You\'ve also probably heard that world renowned composer, Joe Hisaishi, who scored the movie originally, went back to rescore the excellent music with new arrangements. Laputa came out before My Neighbor Totoro and after Nausicaa of the Valley of the Wind, which began Studio Ghibli and it\'s long string of hits. And in my opinion, I think it\'s one of Miyazaki\'s best films with a powerful lesson tuckered inside this two hour and four minute

In [69]:
# Most Uncertain Documents.
probs = lr.predict_proba(X_train_vector)
positive_probs = [x[1] for x in probs]
word_prob = list(zip(X_train_corpus, abs(np.array(positive_probs)-0.5))) # Notice -0.5
prob_sorted = sorted(word_prob, key = lambda x: x[1], reverse=False)
prob_sorted[:2]

[('Those engaging the movie camera so early in the century must have figured out some of its potential very early on. This is a good story of a playboy type who needs money and inadvertently sells his soul to Satan for a lot of money. Unfortunately, the soul is his double and he must confront him frequently, tearing his life apart. There are some wonderful scenes with people fading out and, of course, the scenes when the two are on the stage at the same time. The middle part is a bit dull, but the Faustian story is always in the minds of the viewer. One thing I have to mention is the general unattractiveness of the people in the movie. Also, they pretty much shied away from much action which would have at least given some life to the thing. I first was made aware of this movie about 25 years ago and have finally been able to see it. I was not disappointed.',
  0.0024264958418332983),
 ('I don\'t know where to start; the acting, the special effects and the writing are all about as bad a

## RESULTS

- Logistic Regression Test Accuracy for Ngram=(1,1) is 87.316%

- The most frequent word is 'this'

- The most effective word is '7/10'

- The most confident document is fairly long and has words with more positive coefficients.

- The most uncertain document is short documents.