## 1. Import the libraries:

In [56]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import metrics

## 2. Load the dataset:

In [25]:
data = pd.read_csv("./moviereviews.tsv", sep="\t")
data.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [68]:
print(data['review'][0])

how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 
this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternate

## 3. Exploratory Data Analysis:

In [28]:
data.shape

(2000, 2)

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   2000 non-null   object
 1   review  1965 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


#### 3.1 Check for the null values:

In [30]:
data.isna().sum()

label      0
review    35
dtype: int64

There seems to be 35 null values in the review. We will be dropping these values:

In [31]:
data.dropna(inplace=True)

In [32]:
data.isna().sum()

label     0
review    0
dtype: int64

#### 3.2 Check for empty strings:

In [33]:
blank_indexes = []


# (index, labels, and reviews)
for ind, lbl, rev in data.itertuples():
    if rev.isspace():
        blank_indexes.append(ind)

In [34]:
print(f"There seems to have empty (only white-spaces) reviews too in our dataset.\n"
      f"There are in total: {len(blank_indexes)} empty reviews.")

There seems to have empty (only white-spaces) reviews too in our dataset.
There are in total: 27 empty reviews.


In [35]:
blank_indexes

[57,
 71,
 147,
 151,
 283,
 307,
 313,
 323,
 343,
 351,
 427,
 501,
 633,
 675,
 815,
 851,
 977,
 1079,
 1299,
 1455,
 1493,
 1525,
 1531,
 1763,
 1851,
 1905,
 1993]

##### Dropping the empty reviews:

In [36]:
data.drop(blank_indexes, inplace=True)

In [37]:
data.shape

(1938, 2)

We started with 2000 rows, but after dropping missing data and empty data, we are not left with 1938 rows.

## 4. Encoding and Splitting the data into training and testing data:

In [41]:
data.head(2)

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...


In [42]:
X = data.iloc[:, 1].values
y = data.iloc[:, 0].values

#### 4.1 Encoding the target values into binary using LabelEncoder:

In [47]:
encoder = LabelEncoder()
encoder

In [48]:
y = encoder.fit_transform(y)

#### 4.2 Splitting the data into training and testing data:

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### 4.3 Build a Pipeline:

##### Logistic Regression:

In [54]:
pipe_one = Pipeline([('tf_idf_vect', TfidfVectorizer()),
                    ('log_reg', LogisticRegression())])

pipe_one.fit(X_train, y_train)

In [59]:
pred = pipe_one.predict(X_test)

In [61]:
print(f"Confusion Matrix:\n{metrics.confusion_matrix(y_test, pred)}")

Confusion Matrix:
[[233  49]
 [ 51 249]]


In [63]:
print(f"Training: {pipe_one.score(X_train, y_train) * 100:.2f}%")
print(f"Testing: {pipe_one.score(X_test, y_test) * 100:.2f}%")

Training: 96.46%
Testing: 82.82%


It seems the model is **Overfitting** here.

##### SVC:

In [64]:
pipe_two = Pipeline([('tf_idf_vect', TfidfVectorizer()),
                    ('svc', LinearSVC())])


pipe_two.fit(X_train, y_train)

In [65]:
pred = pipe_two.predict(X_test)

In [66]:
print(f"Confusion Matrix:\n{metrics.confusion_matrix(y_test, pred)}")

Confusion Matrix:
[[235  47]
 [ 41 259]]


In [67]:
print(f"Training: {pipe_two.score(X_train, y_train) * 100:.2f}%")
print(f"Testing: {pipe_two.score(X_test, y_test) * 100:.2f}%")

Training: 100.00%
Testing: 84.88%


Compared to Logistic Regression, Linear SVC seems to be even more overfitting.