# Movie Reviews (Text Classification Project)
## 1. Prepare Data
### Read in Data
* Drop all nulls
* Remove any blank reviews (i.e. whitespace only)

In [10]:
# load libraries
import pandas as pd
import numpy as np

# read in dataset
in_path = 'C:/Users/Matthew.Allen2/Documents/GitHub/Data-Science-Private/Udemy/NLP Course Files/TextFiles/'
df = pd.read_csv(in_path + 'moviereviews.tsv', sep='\t')

# peek at data
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [11]:
# check shape
df.shape

(2000, 2)

In [12]:
# check nulls
df.isnull().sum()

label      0
review    35
dtype: int64

In [13]:
# drop nulls
df.dropna(inplace=True)

In [14]:
# check for blanks
blanks = []

# iterate through (index, label, review)
for i, lb, rv in df.itertuples():
    # check if whitespace
    if rv.isspace():
        # store index of blanks
        blanks.append(i)
        
# remove blanks at selected index
df.drop(blanks, inplace=True)

# check output shape
df.shape

(1938, 2)

### Train, Test, Split Data
* Prepare data for ML model

In [16]:
# load libraries
from sklearn.model_selection import train_test_split

# extract X and y
X, y = df['review'], df['label']

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 2. Modelling
### Build Pipeline
* Stages for:
    * Count Vectorize (frequency counts of each unique word)
    * TF-IDF Transformation (importance of words relevant to their frequency across all documents)
    * LinearSVC model (linear classifier based on maximising distance between support vectors and boundary)
* Fit pipeline to training data

In [18]:
# load libraries
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# create pipeline object
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())])

# fit pipeline to training data
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Make Predictions
* Make predictions based off trained model
* Assess accuracy (precision, recall, f1 etc.) of predicted values against known values on test set

In [20]:
# load libraries
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# make test predictions
y_pred = text_clf.predict(X_test)

# evaluate model
# confusion matrix (use crosstab instead to accurately infer variable order)
pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted,neg,pos,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
neg,235,47,282
pos,41,259,300
All,276,306,582


In [21]:
# classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [22]:
# accuracy score
print(round(accuracy_score(y_test, y_pred), 2))

0.85


Summary:
* Our initial model isn't too bad
* Overall accuracy is 85%
* Precision, recall and f1-scores are consistent between positive and negative reviews
* Therefore it's not struggling with one particular class, it's just a reasonable good predictor of both
* Still a reasonable amount of missclassification though (evenly split between both classes i.e. FN and FP)