<a href="https://colab.research.google.com/github/Ali-Asgar-Lakdawala/NLP/blob/main/02_Text_Classification_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report,precision_score,recall_score,confusion_matrix
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [33]:
df=pd.read_csv('https://raw.githubusercontent.com/Ali-Asgar-Lakdawala/NLP/main/Data/moviereviews2.tsv',sep='\t')

### Task #2: Check for missing values:

In [3]:
# Check for NaN values:
df.isna().sum()

label      0
review    20
dtype: int64

In [34]:
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [5]:
df.review

0       I loved this movie and will watch it again. Or...
1       A warm, touching movie that has a fantasy-like...
2       I was not expecting the powerful filmmaking ex...
3       This so-called "documentary" tries to tell tha...
4       This show has been my escape from reality for ...
                              ...                        
5995    Of the three remakes of this plot, I like them...
5996    Poor Whoopi Goldberg. Imagine her at a friend'...
5997    Honestly before I watched this movie, I had he...
5998    This movie is essentially shot on a hand held ...
5999    It has singing. It has drama. It has comedy. I...
Name: review, Length: 6000, dtype: object

### Task #3: Remove NaN values:

In [8]:
df[df['review'].apply(lambda x: isinstance(x, float))]

Unnamed: 0,label,review
825,neg,
895,neg,
1889,neg,
2038,pos,
2260,pos,
2452,neg,
2713,pos,
2980,pos,
3182,neg,
3250,pos,


In [9]:
df.dropna(inplace=True)

In [10]:
# Check for whitespace strings (it's OK if there aren't any!):
blank=[]
for index,label,review in df.itertuples():
  if review.isspace == True:
    blank.append(index)
  else:
    pass

In [11]:
blank


[]

### Task #4: Take a quick look at the `label` column:

In [12]:
df.label.head()

0    pos
1    pos
2    pos
3    neg
4    pos
Name: label, dtype: object

In [14]:
df.label.value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [21]:
X=df.review
y=df.label

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [25]:
from sklearn.svm import LinearSVC
tfidf_vect=TfidfVectorizer()
tfidf_vect.fit(X)
X_train_vect=tfidf_vect.transform(X_train)
X_test_vect=tfidf_vect.transform(X_test)

linear_svm=LinearSVC()
linear_svm_model=linear_svm.fit(X_train_vect,y_train)

### Task #7: Run predictions and analyze the results

In [27]:
# Form a prediction set
y_pred=linear_svm_model.predict(X_test_vect)

In [29]:
# Report the confusion matrix
print(confusion_matrix(y_test,y_pred))


[[898  93]
 [ 61 922]]


In [30]:
# Print a classification report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         neg       0.94      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [32]:
from sklearn.metrics import accuracy_score
# Print the overall accuracy
accuracy_score(y_test,y_pred)

0.9219858156028369

## Great job!