In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('/content/moviereviews2.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [3]:
len(df)

6000

In [4]:
print(df['review'][0]) ## Let's check a negative review of the movie

I loved this movie and will watch it again. Original twist to Plot of Man vs Man vs Self. I think this is Kurt Russell's best movie. His eyes conveyed more than most actors words. Perhaps there's hope for Mankind in spite of Government Intervention?


In [5]:
print(df['review'][2]) ## Let's check a positive review of the movie

I was not expecting the powerful filmmaking experience of "Girlfight". It's an Indie; low-budget, no big-name actors, freshman director. I had heard it was good, but not this good.<br /><br />Placed in a contemporary, ethnic, working-class Brooklyn, Karyn Kusama has done an extraordinary job of capturing the day-do-day struggles of urban Latinos. Diana, the protagonist, is seething with anger and lashes out at her high school peers, getting in trouble with the school and her friends. She is being raised by her single father, who appears to love her and her brother, but applies a strict, sex-based double standard on his children. The father's double standard is illustrated by the fact that Tiny, the brother, is taking boxing lessons at the local gym, but Diana is denied similar pursuits. On an errand to the gym to meet Tiny, Diana is captivated by boxing. Tiny doesn't like boxing, so he and Diana trade places; he gets the money from Dad then gives it to Diana to take the lessons in his 



```
# Let's train the raw text to predict if a new review is positive or negative!
```



In [6]:
## Checking for missing values first
df.isnull().sum()

label      0
review    20
dtype: int64



```
# In the above, we're not missing any labels but 35 reviews are missing! Let's remove these empty values.
```



In [7]:
df.dropna(inplace=True) # inplace=True makes a permanent drop of the missing values

In [8]:
df.isnull().sum()

label     0
review    0
dtype: int64

In [9]:
## This will drop empty strings.
blanks = []

for i, lb, rv in df.itertuples():
  if rv.isspace():
    blanks.append(i) # Here I'm collecting index position of these blanks statements

In [10]:
blanks

[]

In [11]:
# Let's drop these indexes with blanks
df.drop(blanks, inplace=True)

In [12]:
len(df) ## New length of cleaned dataframe (dropping missing data and empty strings data)

5980

In [13]:
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)



```
# Let's build a pipeline
```



In [15]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [16]:
test_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())]).fit(X_train, y_train)

In [17]:
predictions = test_clf.predict(X_test)

In [18]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

In [19]:
print(confusion_matrix(y_test, predictions))
print(f'--------------------------------------------------------')
print(classification_report(y_test, predictions))

[[900  91]
 [ 63 920]]
--------------------------------------------------------
              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [20]:
print(metrics.accuracy_score(y_test, predictions))

0.9219858156028369
