# Week8 - Homework KNN-NB-SVM

- Use GridSearchCV on X_train dataset
    - KNN, NB, SVM, Logistic Regression, Decision Trees
- Test on X_test dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
X_train = pd.read_csv('https://github.com/msaricaumbc/DS_data/raw/master/ds602/movie/X_train.csv')
y_train = pd.read_csv('https://github.com/msaricaumbc/DS_data/raw/master/ds602/movie/y_train.csv')

In [3]:
X_test = pd.read_csv('https://github.com/msaricaumbc/DS_data/raw/master/ds602/movie/X_final.csv')
y_test = pd.read_csv('https://github.com/msaricaumbc/DS_data/raw/master/ds602/movie/y_final.csv')

# Exploratory Data Analysis(EDA)

Here we are using EDA  because we dont know how the data will be and we dont how the data will be and how can we able to 
perform data cleaning with the help of this analysis we will be able to identify it:

In [4]:
print("Checking for the train data containing the number of rows and columns")
print("X_train data:", X_train.shape)
print("y_train data:", y_train.shape)

Checking for the train data containing the number of rows and columns
X_train data: (40000, 1)
y_train data: (40000, 1)


In [5]:
print("Checking for the test data containing number of rows and columns")
print("X_test data:", X_test.shape)
print("y_test data:", y_test.shape)

Checking for the test data containing number of rows and columns
X_test data: (10000, 1)
y_test data: (10000, 1)


Checking for the training data to show how the data was represented:

In [6]:
print("Sample data in X_train:")
print(X_train.head())

Sample data in X_train:
                                              review
0  Shame, is a Swedish film in Swedish with Engli...
1  I know it's rather unfair to comment on a movi...
2  "Bread" very sharply skewers the conventions o...
3  After reading tons of good reviews about this ...
4  During the Civil war a wounded union soldier h...


In [7]:
print("Sample data of y_train:")
print(y_train.head())

Sample data of y_train:
   sentiment
0          1
1          0
2          1
3          1
4          1


Checking for the test data to show how the data was represented:

In [8]:
print("Sample data in X_test:")
print(X_test.head())

Sample data in X_test:
                                              review
0  I first saw Heimat 2 on BBC2 in the 90's when ...
1  I sat down to watch "Midnight Cowboy" thinking...
2  I can never fathom why people take time to rev...
3  With that line starts one silly, boring Britis...
4  Here's the spoiler: At the end of the movie, a...


In [9]:
print("Sample data in y_test:")
print(y_test.head())

Sample data in y_test:
   sentiment
0          1
1          1
2          1
3          0
4          0


To check any Null Values or any values are missing in the data we are using isnul().sum() method

In [10]:
print("Number of missing values present in X_train:\n", X_train.isnull().sum())
print("Number of missing values present in y_train:\n", y_train.isnull().sum())

Number of missing values present in X_train:
 review    0
dtype: int64
Number of missing values present in y_train:
 sentiment    0
dtype: int64


# Train-Test Split

In [11]:
# Here by squeeze method we are converting them into 1 dimensional arrays
y_train = y_train.squeeze()
y_test = y_test.squeeze()

In [12]:
# Due to large data we are going to take less sample size
subset_size = 0.1  
X_train, _, y_train, _ = train_test_split(X_train, y_train, train_size=subset_size, random_state=42)

In [13]:
label_encoder_y = LabelEncoder()
y_train_encoded = label_encoder_y.fit_transform(y_train)
y_test_encoded = label_encoder_y.transform(y_test)

# Pipelines

In this data set as we see we dont have numerical and categorical data with the help of vectorizer you will be able to get better results

In [14]:
# Defining the pipelines and parameter grids
pipelines = {
    'Logistic Regression': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('clf', LogisticRegression())
    ]),
    'SVM': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('clf', SVC())
    ]),
    'KNN': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('clf', KNeighborsClassifier())
    ]),
    'Naive Bayes': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('clf', MultinomialNB())
    ]),
    'Decision Tree': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('clf', DecisionTreeClassifier())
    ])
}

# Hyper Parameter Tuning 

In [15]:
param_grids = {
    'Logistic Regression': {
        'clf__C': [0.1, 1, 10]
    },
    'SVM': {
        'clf__C': [0.1, 1, 10],
        'clf__gamma': [0.01, 0.1, 1]
    },
    'KNN': {
        'clf__n_neighbors': [3, 5, 7, 9]
    },
    'Naive Bayes': {},
    'Decision Tree': {
        'clf__max_depth': [None, 10, 20]
    }
}

In [16]:
# Grid Search for Every Model 
results = {}
for model_name, pipeline in pipelines.items():
    param_grid = param_grids[model_name]
    grid_search = GridSearchCV(pipeline, param_grid, cv=3, n_jobs=-1)
    grid_search.fit(X_train.squeeze(), y_train_encoded)
    results[model_name] = grid_search

In [17]:
# Model test accuracy check with parameters
for model_name, grid_search in results.items():
    accuracy = grid_search.score(X_test.squeeze(), y_test_encoded)
    print(f'{model_name} - Test Accuracy: {accuracy}')
    print()


Logistic Regression - Test Accuracy: 0.8641

SVM - Test Accuracy: 0.8662

KNN - Test Accuracy: 0.6924

Naive Bayes - Test Accuracy: 0.8355

Decision Tree - Test Accuracy: 0.7054



# Conclusion

I have observed that Logistic Regression and SVM is dominating the test accuracy 
when compared to other models and Bayes also good too 
and i was dissapointed on knn because it has least accuracy when compared to other models.