<a href="https://colab.research.google.com/github/ErenB02/comp_bio/blob/Labs(DM)/LAB8(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bag of Words and feature reduction

In this notebook, we will see how to develop a machine learning model on textual inputs. The goal of the project is to classify the sentiment of movie reviews.

Let's import useful Python packages.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Dataset

Stanford University researchers have taken 50,000 movie reviews from [IMDB](https://www.imdb.com/) labelled them as either positive or negative and [made them available](http://ai.stanford.edu/~amaas/data/sentiment/). We created a dataset with 2,500 positive reviews and the 2,500 negative reviews

Let's read the dataset available on: https://github.com/andvise/DataAnalyticsDatasets/blob/8e8f6475f49d2a587e4f5c76cdf0b011b22c6ac1/dataset_5000_reviews.csv

In [None]:
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/8e8f6475f49d2a587e4f5c76cdf0b011b22c6ac1/dataset_5000_reviews.csv?raw=true")

In [None]:
df.head()

Unnamed: 0,Review,Sentiment
0,Shiri Appleby is the cutest little embodiment ...,Negative
1,"Normally, I have much better things to do with...",Negative
2,this movie is not good.the first one almost su...,Negative
3,"As a biographical film, ""The Lady With Red Hai...",Positive
4,I do not fail to recognize Haneke's above-aver...,Negative


In [None]:
df.tail()

Unnamed: 0,Review,Sentiment
4995,Los Angeles TV news reporter Jennifer (the bea...,Positive
4996,"This film is absolutely awful, but nevertheles...",Negative
4997,...however I am not one of them. Caro Diario a...,Negative
4998,This film had a great cast going for it: Chris...,Negative
4999,If you look at Corey Large's information here ...,Negative


In [None]:
df['Sentiment'].value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
Negative,2500
Positive,2500


## Preprocessing

Let's encode the labels as 0 and 1 using the *LabelEncoder*.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [None]:
from sklearn.preprocessing import LabelEncoder

y = df['Sentiment']

X = df['Review']

encoder = LabelEncoder()

y = encoder.fit_transform(y)


Split the dataset into training set (80%) and test set (20%).



https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [None]:
print("Train_X:", X_train.shape)
print("Train_y:", y_train.shape)
print("Test_X:", X_test.shape)
print("Test_y:", y_test.shape)

Train_X: (4000,)
Train_y: (4000,)
Test_X: (1000,)
Test_y: (1000,)


# Classification Task



*   Create a machine learning approach using Count Vectorizer and KNN Classifier.
*   Use the given parameters grid and GridSearchCV to find the optimal set
*   Fit the the best model on the full training set.
*   Evaluate its performance on the test set.
*   Assess if adding TruncatedSVD (feature extraction) is improving the performance.


https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html


In [None]:
# Import necessary libraries
from sklearn import neighbors
from sklearn import metrics
from sklearn import model_selection
import matplotlib.pyplot as plt

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer (for Bag of Words model)
vec = CountVectorizer()

# Initialize TruncatedSVD for dimensionality reduction (similar to PCA)
svd = TruncatedSVD(n_components=50)


#NUM TO WORDS (Count Verctorizer)
# Apply Bag of Words model on training data
X_train = vec.fit_transform(X_train)  # Convert text data to a sparse matrix of word counts

#DIMENSIONALITY REDUCTION
# Apply TruncatedSVD to reduce dimensionality of the feature space
X_train = svd.fit_transform(X_train)  # Reduce features while retaining 50 components


# Initialize KNN classifier
knn = neighbors.KNeighborsClassifier()

# Define the grid of hyperparameters to search over
parameters = {'n_neighbors': [1, 3, 5],  # Number of neighbors to use
              'p': [1,2]}  # Distance metric to use (1 = Manhattan, 2 = Euclidean)

# Initialize GridSearchCV to search for the best hyperparameters
clf = model_selection.GridSearchCV(knn, parameters)

# Fit the grid search with training data
clf.fit(X_train, y_train)  # Perform cross-validation to find the best hyperparameters

# Print the best classifier found by the grid search
print("The best classifier is:", clf.best_estimator_)
# Print the best accuracy score from cross-validation
print("Its accuracy is:", clf.best_score_)
# Print the best hyperparameters found during grid search
print("Its parameters are:", clf.best_params_)


The best classifier is: KNeighborsClassifier(p=1)
Its accuracy is: 0.6102500000000001
Its parameters are: {'n_neighbors': 5, 'p': 1}


In [None]:

# Preprocess the test data using the same vectorizer and SVD
X_test = vec.transform(X_test)  # Convert test text data to the same sparse matrix form
X_test = svd.transform(X_test)  # Apply the same dimensionality reduction to the test data

# Evaluate the best model from grid search on the test set
clf.best_estimator_.score(X_test, y_test)  # Compute the accuracy of the best model on test data


# clf.best_estimator_ gives you the best model found by grid search.
# clf.best_estimator_.score() gives you the performance of that model on test data

0.613