<a href="https://colab.research.google.com/github/ErenB02/comp_bio/blob/Labs(DM)/LAB8(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bag of Words and feature reduction

In this notebook, we will see how to develop a machine learning model on textual inputs. The goal of the project is to classify the sentiment of movie reviews.

Let's import useful Python packages.

In [159]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Dataset

Stanford University researchers have taken 50,000 movie reviews from [IMDB](https://www.imdb.com/) labelled them as either positive or negative and [made them available](http://ai.stanford.edu/~amaas/data/sentiment/). We created a dataset with 2,500 positive reviews and the 2,500 negative reviews

Let's read the dataset available on: https://github.com/andvise/DataAnalyticsDatasets/blob/8e8f6475f49d2a587e4f5c76cdf0b011b22c6ac1/dataset_5000_reviews.csv

In [160]:
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/8e8f6475f49d2a587e4f5c76cdf0b011b22c6ac1/dataset_5000_reviews.csv?raw=true")

In [161]:
df.head()

Unnamed: 0,Review,Sentiment
0,Shiri Appleby is the cutest little embodiment ...,Negative
1,"Normally, I have much better things to do with...",Negative
2,this movie is not good.the first one almost su...,Negative
3,"As a biographical film, ""The Lady With Red Hai...",Positive
4,I do not fail to recognize Haneke's above-aver...,Negative


In [162]:
df.tail()

Unnamed: 0,Review,Sentiment
4995,Los Angeles TV news reporter Jennifer (the bea...,Positive
4996,"This film is absolutely awful, but nevertheles...",Negative
4997,...however I am not one of them. Caro Diario a...,Negative
4998,This film had a great cast going for it: Chris...,Negative
4999,If you look at Corey Large's information here ...,Negative


In [163]:
df['Sentiment'].value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
Negative,2500
Positive,2500


## Preprocessing

Let's encode the labels as 0 and 1 using the *LabelEncoder*.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [164]:
# YOUR CODE HERE
from sklearn.preprocessing import LabelEncoder
X = df["Review"]
y = df["Sentiment"]

encoder = LabelEncoder()

#Could there be any issues with encoding before split
y = encoder.fit_transform(y)

In [165]:
df.head()

Unnamed: 0,Review,Sentiment
0,Shiri Appleby is the cutest little embodiment ...,Negative
1,"Normally, I have much better things to do with...",Negative
2,this movie is not good.the first one almost su...,Negative
3,"As a biographical film, ""The Lady With Red Hai...",Positive
4,I do not fail to recognize Haneke's above-aver...,Negative


Split the dataset into training set (80%) and test set (20%).

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [166]:
# YOUR CODE HERE

from sklearn.model_selection import train_test_split
#random_state will ensure reproducibility
#sratify will be crucial in smaller dataset, to ensure even split
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42 ,stratify=y)

import numpy as np

np.mean(y_test)

0.5

We see that we have 50% of positive reviews.

In [167]:
print("Train_X:", X_train.shape)
print("Train_y:", y_train.shape)
print("Test_X:", X_test.shape)
print("Test_y:", y_test.shape)

Train_X: (4000,)
Train_y: (4000,)
Test_X: (1000,)
Test_y: (1000,)


# Classification Task



*   Create a machine learning approach using Count Vectorizer and KNN Classifier.
*   Use the given parameters grid and GridSearchCV to find the optimal set
*   Fit the the best model on the full training set.
*   Evaluate its performance on the test set.
*   Assess if adding TruncatedSVD (feature extraction) is improving the performance.


https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html


In [168]:
# YOUR CODE HERE

#COUNTVECTORIZER
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD


vect = CountVectorizer()
X_train = vect.fit_transform(X_train) # applying BoW


svd = TruncatedSVD()
X_train = svd.fit_transform(X_train) # applying SVD -> smilar to PCA and in this case better for sparse data

In [169]:
#KNN Classifier
from sklearn import neighbors
from sklearn import metrics
from sklearn import model_selection
import matplotlib.pyplot as plt

knn = neighbors.KNeighborsClassifier()

parameters = {'n_neighbors': [1, 3, 5, 7], 'p' : [1,2]}

clf = model_selection.GridSearchCV(knn, parameters)
clf.fit(X_train, y_train)

print("The best classifier is:", clf.best_estimator_)
print("Its accuracy is:",clf.best_score_)
print("Its parameters are:",clf.best_params_)

The best classifier is: KNeighborsClassifier(n_neighbors=7, p=1)
Its accuracy is: 0.50875
Its parameters are: {'n_neighbors': 7, 'p': 1}


In [170]:
#Drops accuracy as a result of dimension reduction, why ?
#Explain why there may still be some data leakage ?

X_test = vect.transform(X_test)
X_test = svd.transform(X_test)
clf.best_estimator_.score(X_test, y_test)

0.51