# Text Classifier

## Load Data

In [375]:
import numpy as np
import pandas as pd

In [376]:
# !pip install black

In [377]:
import warnings
warnings.filterwarnings('ignore')

I did the loading of the data with glob in the first place, but this was more inconvenient than the load_files like this:

Code used with **glob**-module:
#get data file names
path = (
    r"/Users/antoniaschulze/Desktop/03 Movie/aclImdb/test/pos/"
)
filenames = glob.glob(path + "/*.txt")

test_file = []
for filename in filenames:
    with open(filename) as file_object:
        lines = file_object.readlines()
        test_file.append(pd.DataFrame((lines)))

#Concatenate all data into one DataFrame
test = pd.concat(test_file, ignore_index=True)


Loading the files in a kind of "folder"

In [378]:
from sklearn.datasets import load_files

reviews_train = load_files(
    "/Users/antoniaschulze/Desktop/03 Movie/aclImdb/train/",
    encoding="utf8",  # specifies the encoding
    categories=["pos", "neg"],  # identified as target variable
)
reviews_test = load_files(
    "/Users/antoniaschulze/Desktop/03 Movie/aclImdb/test/",
    encoding="utf8",
    categories=["pos", "neg"],
)

Loading the files in a dataframe and assigning data and target to the columns review and sentiment

In [379]:
train = pd.DataFrame(reviews_train.data, columns=["review"])
train["sentiment"] = reviews_train.target
train.sample(n=10)

Unnamed: 0,review,sentiment
15777,what kind of sh*t is this? Power rangers vs Fr...,0
4837,This was one of the most boring movies I've ev...,0
13938,Reviewed at the World Premiere screening Sept....,1
21617,i'm not sure if it is available worldwide - bu...,1
3329,I've read some terrible things about this film...,1
11287,"This is a bad, bad movie. I'm an actual fencer...",0
20252,Obviously a film that has had great influence ...,1
19858,This movie was the beatliest mormon movie made...,0
11377,One of my favorite movies to date starts as an...,1
17301,I sat down to watch this movie with my friends...,0


In [380]:
test = pd.DataFrame(reviews_test.data, columns=["review"])
test["sentiment"] = reviews_test.target
test.sample(n=10)

Unnamed: 0,review,sentiment
23482,"LE CERCLE ROUGE is a very good film, though it...",1
9883,"Like a lot of the comments above me, also I th...",0
9250,With Pep Squad receiving an average of 4.7 on ...,0
20121,(Contains really bad Spoilers) So what can I s...,0
9926,A woman and her aunt go to Scotland to locate ...,1
23996,Abysmal Indonesian action film from legendary ...,0
14310,If you've never experienced the thing that is ...,1
20037,Fantastic movie! One of the best film noir mov...,1
11529,Maybe our standards for Vientam movies have in...,0
19644,"Maybe it's just that it was made in 1997, or m...",0


Lets check one specific review in order to get to know what we have to do:

In [381]:
stored_example = train["review"][24972]
stored_example

"Horrendous pillaging of a classic.<br /><br />It wasn't written convincingly at all why Mary should develop such sympathy for Bates. He may be more stable until they start playing pranks with him, but he still doesn't help himself at all with his actions. (inviting a comparative stranger to stay alone with him in his until recently disused motel; telling the attractive young girl of his past mental issues; lying about the knives, etc... ) This, in addition to her previous knowledge should have kept Mary extremely wary of him, but this somehow doesn't happen just so they can play the 'mistaken-identity-murder-game later on. Which in itself is also ridiculous: 'So-and-so is the real killer - plus her as well - also him! There were too many contrived twists in order to slap a story on screen when the narrative didn't need extending.<br /><br />It was good to see Perkins reprising his famous role again, but that's about the only small pleasure to be had. It's definitely not a patch on Hit

So we can identify punctuation, spelling mistakes and much more which we will handle in the following

### Preprocess the text 
In order to achieve the highest model performance one should feed the regression model with high quality data. In terms of classifying text this implies to run several steps before building the model:

a.) make lower case to avoid having different cases for the same words. Having several cases for the same word lowers the model performance as it is not able to combine or at least detect a familarity between "MoVIe" and "moviE" for instance.

In [382]:
# Apply to every row in review the lower function
train['review'] = train['review'].apply(lambda x: x.lower())
# lambda function joins the lower case words for every review
# Split every word with a whitespace make it lower and join it to the new review
# train['review'] = train['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
test['review'] = test['review'].apply(lambda x: x.lower())
train.head(n=20)

Unnamed: 0,review,sentiment
0,"zero day leads you to think, even re-think why...",1
1,words can't describe how bad this movie is. i ...,0
2,everyone plays their part pretty well in this ...,1
3,there are a lot of highly talented filmmakers/...,0
4,i've just had the evidence that confirmed my s...,0
5,"the movie was sub-par, but this television pil...",1
6,this movie has a special way of telling the st...,1
7,the single worst film i've ever seen in a thea...,0
8,the plot of this terrible film is so convolute...,0
9,i had no idea that mr. izzard was so damn funn...,1


b.) Remove punctuation as this information is not value-adding in terms of sentiment analysis

Having in mind our example above we should replace the punctuation with a white space - otherwise we cant capture "-" words.

In [383]:
# lambda function is replacing (better said deletes) every punctuation with a whitespace
# as we found words-linked-by- and they would be "joined" so we rather replace punctuation
# with a whitespace
train["review"] = train["review"].str.replace(r"[^\w\s]", " ")
test["review"] = test["review"].str.replace(r"[^\w\s]", " ")
train.head()

Unnamed: 0,review,sentiment
0,zero day leads you to think even re think why...,1
1,words can t describe how bad this movie is i ...,0
2,everyone plays their part pretty well in this ...,1
3,there are a lot of highly talented filmmakers ...,0
4,i ve just had the evidence that confirmed my s...,0


c.) Delete english stopwords

Stopwords are English words which does not add much meaning to a sentence and therefore can be dropped in case of sentiment analysis / NLP. 

In [384]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
# if the word x in a review is not in stop(words) join it to the new review-text
train['review'] = train['review'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
test['review'] = test['review'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train.head()

Unnamed: 0,review,sentiment
0,zero day leads think even think two boys young...,1
1,words describe bad movie explain writing see g...,0
2,everyone plays part pretty well little nice mo...,1
3,lot highly talented filmmakers actors germany ...,0
4,evidence confirmed suspicions bunch kids 14 22...,0


In the beginning we had this <'br /><br /'> in the review column: 
before:

In [385]:
stored_example

"Horrendous pillaging of a classic.<br /><br />It wasn't written convincingly at all why Mary should develop such sympathy for Bates. He may be more stable until they start playing pranks with him, but he still doesn't help himself at all with his actions. (inviting a comparative stranger to stay alone with him in his until recently disused motel; telling the attractive young girl of his past mental issues; lying about the knives, etc... ) This, in addition to her previous knowledge should have kept Mary extremely wary of him, but this somehow doesn't happen just so they can play the 'mistaken-identity-murder-game later on. Which in itself is also ridiculous: 'So-and-so is the real killer - plus her as well - also him! There were too many contrived twists in order to slap a story on screen when the narrative didn't need extending.<br /><br />It was good to see Perkins reprising his famous role again, but that's about the only small pleasure to be had. It's definitely not a patch on Hit

Now it is removed the punctuation but still has "br". But this word will be deleted anyways in the next step. So no need to worry right now.

In [386]:
train['review'][24972]

'horrendous pillaging classic br br written convincingly mary develop sympathy bates may stable start playing pranks still help actions inviting comparative stranger stay alone recently disused motel telling attractive young girl past mental issues lying knives etc addition previous knowledge kept mary extremely wary somehow happen play mistaken identity murder game later also ridiculous real killer plus well also many contrived twists order slap story screen narrative need extending br br good see perkins reprising famous role small pleasure definitely patch hitchcock intention even trying get close bothering'

d.) Common word removal as this might affect model performance

Common words don't add any value to the model -> e.g. in this case "well" can be removed. 

In [387]:
# Creates a Series that counts the number of occurences of every word and show most occurred 10 values
# For all the reviews (not only one review)
# by default descending
freq_train_top = pd.Series(" ".join(train["review"]).split()).value_counts()[:10]
freq_train_top

br       101871
movie     44047
film      40159
one       26795
like      20281
good      15147
time      12727
even      12655
would     12436
story     11988
dtype: int64

You could argue weather "great" should be disregarded here, but in the first place I decided to proceed dropping "great"

In [388]:
freq_test_top = pd.Series(' '.join(test['review']).split()).value_counts()[:10]
freq_test_top

br       100080
movie     43924
film      39546
one       26808
like      19891
good      14606
time      12383
even      12216
would     12166
see       11550
dtype: int64

Converting the frequency-series to a list and than joins only words to the new review-text if they are not part of the list.

In [389]:
freq_train_top = list(freq_train_top.index)
train['review'] = train['review'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_train_top))
freq_test_top = list(freq_test_top.index)
test['review'] = test['review'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_test_top))
train.head()

Unnamed: 0,review,sentiment
0,zero day leads think think two boys young men ...,1
1,words describe bad explain writing see get gri...,0
2,everyone plays part pretty well little nice be...,1
3,lot highly talented filmmakers actors germany ...,0
4,evidence confirmed suspicions bunch kids 14 22...,0


f.) Lemmatization

Word endings with -ly, -ing, -s will be encoded to their "root" word and this will increase model performance significantly as it is not having different cases and counts for the same word

In [390]:
# !pip install textblob
from textblob import Word

In [391]:
train['review'] = train['review'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
test['review'] = test['review'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train.head()

Unnamed: 0,review,sentiment
0,zero day lead think think two boy young men co...,1
1,word describe bad explain writing see get grip...,0
2,everyone play part pretty well little nice bel...,1
3,lot highly talented filmmaker actor germany no...,0
4,evidence confirmed suspicion bunch kid 14 22 p...,0


g.) Drop words with count < 5

If a word is not mentioned more than 5 times in a review, it will not be helpful to predict. So I create another series - with the same structure as freq_train_top which contains the count for every word and then join with lambda function only if the count is greater than 5

comment: all of this could have been done in the CountVectorizer itself like e.g. CountVectorizer(min_df = 5)

In [392]:
freq_train = pd.Series(" ".join(train["review"]).split()).value_counts()
freq_test = pd.Series(" ".join(test["review"]).split()).value_counts()
freq_train.head()

character    14183
get          12515
make         12229
see          12016
really       11738
dtype: int64

In [393]:
train['review'] = train['review'].apply(lambda x: " ".join(x for x in x.split() if freq_train[x] > 5))
test['review'] = test['review'].apply(lambda x: " ".join(x for x in x.split() if freq_test[x] > 5))
train.head()

Unnamed: 0,review,sentiment
0,zero day lead think think two boy young men co...,1
1,word describe bad explain writing see get grip...,0
2,everyone play part pretty well little nice bel...,1
3,lot highly talented filmmaker actor germany no...,0
4,evidence confirmed suspicion bunch kid 14 22 p...,0


h.) Compare review before and after

In [394]:
stored_example

"Horrendous pillaging of a classic.<br /><br />It wasn't written convincingly at all why Mary should develop such sympathy for Bates. He may be more stable until they start playing pranks with him, but he still doesn't help himself at all with his actions. (inviting a comparative stranger to stay alone with him in his until recently disused motel; telling the attractive young girl of his past mental issues; lying about the knives, etc... ) This, in addition to her previous knowledge should have kept Mary extremely wary of him, but this somehow doesn't happen just so they can play the 'mistaken-identity-murder-game later on. Which in itself is also ridiculous: 'So-and-so is the real killer - plus her as well - also him! There were too many contrived twists in order to slap a story on screen when the narrative didn't need extending.<br /><br />It was good to see Perkins reprising his famous role again, but that's about the only small pleasure to be had. It's definitely not a patch on Hit

In [395]:
train["review"][24972]

'horrendous classic written convincingly mary develop sympathy bates may stable start playing prank still help action inviting comparative stranger stay alone recently disused motel telling attractive young girl past mental issue lying knife etc addition previous knowledge kept mary extremely wary somehow happen play mistaken identity murder game later also ridiculous real killer plus well also many contrived twist order slap screen narrative need extending see perkins reprising famous role small pleasure definitely patch hitchcock intention trying get close bothering'

# CountVectorizer

In [409]:
train.sample(n=10)

Unnamed: 0,review,sentiment
3570,first got give people got thing together 9 11 ...,1
2000,watched babysitter part eclipse drive cult cla...,1
21790,much potential anyone followed jeffrey know ma...,0
14688,actually lie shrek 3 actually first 3d animate...,0
9895,saw dull waste hbo comedy channel quite innoce...,0
3684,saw sneak two day official opening must say ex...,0
18120,guess fitting tribute first superman crummy pa...,0
3686,big fan musical loved film fred astaire ginger...,1
8679,actor filmmaker certainly audience among air p...,0
11106,watching many next action star reality tv eps ...,0


declare the different sets needed for regression

In [410]:
y_test, y_train = test["sentiment"], train["sentiment"]  # target variable
X_test, X_train = test["review"], train["review"]

Apply the *Vectorizer* to the train data to get the matrix with one review in a row and the number of count per word. Each word is displayed as a unique column

In [411]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
# load words "learn vocabulary"
vec.fit(X_train)
X_train = vec.transform(X_train)
X_train  # vectorized data --> "bag of words"
X_train.shape  # number of vocabulary (number of columns)

(25000, 23649)

In [412]:
pd.DataFrame(X_train.todense(), columns=vec.get_feature_names()).head()

Unnamed: 0,00,000,007,00s,01,02,03,04,05,06,...,zorak,zorro,zp,zu,zucco,zucker,zuckerman,zulu,zuniga,zwick
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [413]:
# transform test data
X_test = vec.transform(X_test)

# Modeling

fit bag_of_words to log regression. Evaluation metrics chosen is by default accuracy for **train** data!

In [414]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# declare model as type of Logit Regression
model = LogisticRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)

I am not sure what you mean by compute the accuracy score but I understood that I should predict it and then compute the accuracy of the score

In [422]:
from sklearn.metrics import accuracy_score
accuracy_scorescore(y_test, prediction)

0.855

Both have the same amount of vocabulary as Naive Bayes can only work with words it already knows from the train set

In [423]:
X_train.shape

(25000, 23649)

In [424]:
X_test.shape

(25000, 23649)

# Model Optimization
Using GridSearchCV to optimize the C hyperparameter of the logistic regression

Concatenate the train and test data:

In [425]:
df = [train, test]
dataset = pd.concat(df)
print("Proof it worked: ", dataset.size)

Proof it worked:  100000


- Declare different sets again 
- Apply vectorizer again to the entire dataset bag of words

In [426]:
X_dataset, y_dataset = dataset['review'], dataset['sentiment']
vec.fit(X_dataset)  # load words "learn vocabulary"
X_dataset = vec.transform(X_dataset)
X_dataset.shape

(50000, 27983)

Defining param_grid with the range of C hyperparameters chosen and call the Cross validation search function

In [429]:
param_grid = {"C": np.linspace(0.0001, 1, num=10)}  # creates a sequence within a and b containing 25 steps
grid = GridSearchCV(LogisticRegression(), param_grid=param_grid, cv=10, n_jobs=-1)
grid
# n_jobs = -1 declares using all cores

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'C': array([1.000e-04, 1.112e-01, 2.223e-01, 3.334e-01, 4.445e-01, 5.556e-01,
       6.667e-01, 7.778e-01, 8.889e-01, 1.000e+00])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Fit the model with the different C-values for the TRAIN-data

Print the different crossvalidation results for different hyperparameters

In [430]:
grid.fit(X_dataset, y_dataset)
results = pd.DataFrame(grid.cv_results_)
# results

Print the best score achieved with the corresponding best C-value

In [431]:
print("Best Score: ", grid.best_score_, " with ", grid.best_params_)

Best Score:  0.89198  with  {'C': 0.11120000000000001}


Apply this grid parameter 

In [432]:
grid.best_estimator_

LogisticRegression(C=0.11120000000000001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [433]:
model = grid.best_estimator_
# use the best C for the original train data and then predict it and compute accuracy for test data
model.fit(X_train, y_train)
prediction = model.predict(X_test)
accuracy_score(y_test, prediction)

0.87032