<a href="https://colab.research.google.com/github/JainAnki/ADSMI-Notebooks/blob/main/Copy_of_Copy_of_M2_MP1_NB_LinearClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Module 2 Mini Project: Sentiment Analysis using linear classifiers and unsupervised clustering.

## Learning Objectives

At the end of the mini project, you will be able to -

* use a real world dataset.
* undertake several important steps like cleaning the data and normalizing the data points.
* do sentiment classification.
* compare between different types of classification methods and their pros and cons. 
* compare between supervised and unsupervised (clustering) techniques. 

### Goal of the project
The goal of this project is to train linear classification models that can recognize the sentiment of the reviewer. In this project we will be dealing with only positive and negative sentiments (binary classification).

**Disclaimer**: 
There are multiple ways to solve this problem, as there is no unique formula to solve.
This is just one such approach.


**Packages used:**  
* `Pandas` for data frames and easy to read csv files  
* `Numpy` for array and matrix mathematics functions  
* `Matplotlib` and `Seaborn` for visualization
*  `sklearn` for the metrics and pre-processing
* `scipy` for helper functions required at various stages of the project.
* `warnings` is used to supress warnings from different libraries used in the project.

### Importing the packages

In [None]:
# Importing standard libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import pandas as pd
import scipy
import math
import random

# Importing linear classification algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis       
from sklearn.tree import DecisionTreeClassifier       
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.ensemble import VotingClassifier, RandomForestClassifier

# Importing the clustering algorithms
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# Importing preprocessing functions
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

# Importing metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

### Downloading a dataset containing amazon review information along with ratings
To download the data, we will use **`!gdown`**. 


In [None]:
# Downloading the dataset from the Google Drive link.
!gdown https://drive.google.com/uc?id=1kd0RZvI4ur2ehkv4zAriXg6g2W1Mh3xO

Downloading...
From: https://drive.google.com/uc?id=1kd0RZvI4ur2ehkv4zAriXg6g2W1Mh3xO
To: /content/amazon_baby.csv
  0% 0.00/88.8M [00:00<?, ?B/s]  8% 7.34M/88.8M [00:00<00:01, 71.9MB/s] 37% 32.5M/88.8M [00:00<00:00, 175MB/s]  69% 61.3M/88.8M [00:00<00:00, 226MB/s]100% 88.8M/88.8M [00:00<00:00, 228MB/s]


## How does the dataset look like?
Lets use a standard dataset from Amazon which contains reviews and ratings from the customer. The original dataset has three features: name(name of the products), review(Customer reviews of the products), and rating(rating of the customer of a product ranging from 1 to 5). The review column will be the input column and the rating column will be used to understand the sentiments of the review. Here are some important data preprocessing steps:
The dataset has about 183,500 rows of data. There are 1147 null values which will be removed.
As the dataset is pretty big, it takes a lot of time to run some machine learning algorithms. We will use 30% of the data in this project which is still 54,000+ data points! The sample will be representative of the whole dataset.
If the rating is 1 and 2 that will be considered a negative review. And if the review is 3, 4, and 5, the review will be considered as a  positive review. We add a new column named ‘sentiments’ to the dataset that will use 1 for the positive reviews and 0 for the negative reviews. We read and display the contents of the dataset down below.

**Exercise 1**: Load the data and perform the following (1 points)
- Exploratory Data Analysis 
- Preprocessing 


**Hints:** 

- checking for the number of rows and columns
- summary of the dataset
- statistical description of the features 
- check for the duplicate values
- Show the top 5 and the last 5 rows of the data
- check for the null values, and handle them if *any*

For Exercises, 2 to 9, use sklearn library to model, fir, train and see the metrics [Accuracy and F1_score]. Writing your own custom functions is  not required.


1.   **Exercise 1**: Load the data and perform the following : (1 point)
      - Exploratory Data Analysis 
      - Preprocessing 
2.   **Exercise 2**: **Implementation using K-Nearest Neighbor (KNN) Classifier**:  (1 point)

3.   **Exercise 3**: **Implementation using Support Vector Machines (SVM) Classifier**:  (3 points)
      - First Reduce the features using PCA
      - use Hard-Margin Classifier
      - use Soft-Margin Classifier
      - use Kernel SVM Classifier
4.   **Exercise 4**: **Implementation using Decision Trees**:  (1 point)
5.   **Exercise 5**: **Implementation using Ensemble Classifier**:  (1 point) 
      - use LogisticRegression, KNN, SVM, and VotingClassifier as the weak classifiers

6.   **Exercise 6**: **Implementation using Random Forest Classifier**:  (1 point)
7.   **Exercise 7**: **Implementation using Clustering**: (1 point)
      - k Means Clustering
      - Gaussian Mixture Models
8.   **Exercise 8**: **Test your own sentence**: (1 point)
      - Input your sentences ( One for positive and negative each)
      - Print the output sentiment.

**Sample code using Logistic Regression**

The logistic function, more popularly called the sigmoid function was to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. 

It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

$\frac{1}{ (1 + e^{-value})}$

Where $e$ is the base of the natural logarithms and value is the actual numerical value that you want to transform. Below is a plot of the numbers between $-5$ and $5$ transformed into the range $0$ and $1$ using the logistic function.



In [None]:
# Logistic regression model is defined
logistic_regression = LogisticRegression()

# Training the logistic regression classifier
logistic_regression.fit(X_train_vec, y_train)

# Calculating accuracy on the logistic regression classifier
# The accuracy is within 0 and 1 in this snippet
lr_score = logistic_regression.score(X_test_vec, y_test)
print("Accuracy of the sentiment classification using the Logistic Regression based classifier: ", lr_score)

# Predicting on the test set
y_pred_lr = logistic_regression.predict(X_test_vec)

# F1 score calculation
lr_f1_score = f1_score(y_pred_lr, y_test)

print ("F1 Score for sentiment classification using the Logistic Regression based classifier: ", lr_f1_score)

NameError: ignored

**Exercise 1**: Load the data and perform the following: (1 point)

- Exploratory Data Analysis (Use Pandas, Seaborn)
- Preprocessing (Use Pandas)

**Hints:** 

- checking for the number of rows and columns
- summary of the dataset
- statistical description of the features 
- check for the duplicate values
- Show the top 5 and the last 5 rows of the data
- check for the null values, and handle them if *any*

In [None]:
# YOUR CODE HERE
review_data = pd.read_csv('amazon_baby.csv')
review_data

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5


In [None]:
review_data.shape #checking for the number of rows and columns

(183531, 3)

In [None]:
review_data.info() #summary of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


In [None]:
review_data.describe() #statistical description of the features

Unnamed: 0,rating
count,183531.0
mean,4.120448
std,1.285017
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [None]:
#check for the duplicate values
review_data.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
183526    False
183527    False
183528    False
183529    False
183530    False
Length: 183531, dtype: bool

In [None]:
print(review_data.duplicated().value_counts()) #count of duplicated values if any

False    183469
True         62
dtype: int64


In [None]:
review_data.head() #top 5 rows of the dataset

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [None]:
review_data.tail() #last 5 rows of the data

Unnamed: 0,name,review,rating
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5
183530,Best 2 Pack Baby Car Shade for Kids - Window S...,I love this product very mush . I have bought ...,5


In [None]:
#check of the null values
review_data.isnull().values.any()

True

In [None]:
review_data.isnull().value_counts() #count null values if any

name   review  rating
False  False   False     182384
       True    False        829
True   False   False        318
dtype: int64

In [None]:
len(review_data) - len(review_data.dropna())

1147

In [None]:
#Handle null values in case any
review_data.dropna(inplace=True)

#Handle null values in case any
# dropna()
# fillna(pad, bfill)
# replace()
# interpolate()

In [None]:
review_data.shape

(182384, 3)

**Exercise 2**: **Implementation using K-Nearest Neighbor (KNN) Classifier**:  (1 point)


[Refer to the Logistic Regression Example in the above cells]

- Define the KNN classifier with Number of neighbours=5 using sklearn's **KNeighborsClassifier** function
- Train the KNN classifier
- Predict the test set
- Calculate accuracy on the KNN classifier
- Compute the F1 score

In [None]:
#lets first create sentiments column with 0, 1 values (0: review 1,2; 1: review 3,4,5)
review_data['sentiment'] = 1 
review_data.loc[review_data['rating'] < 3, 'sentiment'] = 0

In [None]:
int(0.3 * len(review_data))

54715

In [None]:
review_data = review_data[:54715]

In [None]:
X = review_data['review'] #input
y = review_data['sentiment'] #target

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42) #split data into train, test

In [None]:
#convert string to float
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train = cv.fit_transform(x_train)
X_test = cv.transform(x_test)

In [None]:
knn = KNeighborsClassifier() #instansiate model object

In [None]:
knn.fit(X_train, y_train) #fit on training data

KNeighborsClassifier()

In [None]:
test_pred = knn.predict(X_test) #predict on test data
test_pred

array([1, 1, 1, ..., 1, 1, 1])

In [None]:
accuracy_score(y_test, test_pred) #accuracy

0.8369032824036845

In [None]:
f1_score(y_test, test_pred) #f1 score

0.9101019462465245

**Exercise 3**: **Implementation using Support Vector Machines (SVM) Classifier**:  (3 points)
  - First Reduce the features using PCA
  - use Hard-Margin Classifier
  - use Soft-Margin Classifier
  - use Kernel SVM Classifier



Background:
The next classifier we look into are support vector machines. 

![wget](https://cdn.talentsprint.com/aiml/aiml_2020_b14_hyd/experiment_details_backup/linear_data.png)

While the other classifiers such as the perceptron and the logistic regression uses a similar concept of finding a boundary between two classes using a straight line, SVMs aim to maximize this boundary. Therefore, not only the SVM tries to find a boundary, it tries to find the best boundary that separates the two classes. Again, with very simple tricks the two class classification can be easily extended to a multiclass classification. The formal formulation of a SVM is,

$g(x) = w^Tx + b$, is the equation of the line we want to find with weights $w$ and a bias $b$.

Now as seen from the figure, $g(x) = k$ and $g(x) = -k$ will give two worst lines for classification as they are right at the boundary of one of the classes. We need to maximize the distance of the line from both of the classes.

Therefore,

Maximize $k$ such that :

$-w^Tx + b \geq k \: for \: d_i == 1$

$-w^Tx + b \leq k \: for \: d_i == -1$

We keep $g(x) \geq 1$ and minimize $||w||$.

We finally write the final minimization function (uses Lagrangians to come to this solution).

Minimize: $J(w, b, \alpha) = \frac{1}{2}w^Tw - \Sigma_{i=1}^{N}(\alpha_id_i(w^Tx_i + b)) + \Sigma_{i=1}^{N}(\alpha_i)$

There are multiple types of SVM. We first use the standard linear SVM and check the performance of the model. However, SVM cannot be directly used on this dataset.   

The data is too large and the normal SVM function from `sklearn` will take a lot of time to run. Therefore, we first apply a PCA based dimensionality reduction technique on the input data. This will be followed by different types of SVM techniques and the performance can be compared. Since, dimensionality reduction is applied, a slight drop in performance is expected. However, with the improvement in the time taken for training a SVM in mind, it is important we first apply PCA based dimensionality reduction.

In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.


**Hints**
- Define the PCA model using sklearn's **TruncatedSVD**
- Fit the training data using **model.fit**
- Reduce the dimensions of the training data using **model.transform**
- Reduce the dimensions of the testing data using **model.transform**


- Use sklearn's **svm.SVC**. Appropriately choose the arguments - *kernel*, *gamma*, and *C* for hard-margin, soft-margin and kernel SVM classifiers.



In [None]:
pca_model = TruncatedSVD() #initansiate pca model

In [None]:
pca_model.fit(X_train, y_train) #train pca model

TruncatedSVD()

In [None]:
red_x_train = pca_model.transform(X_train) #transform training data
red_x_test = pca_model.transform(X_test) #transform test data

In [None]:
#hard-margin
model = svm.SVC() #C=1, kernel='rbf'

In [None]:
model.fit(red_x_train, y_train)

SVC()

In [None]:
test_pred = model.predict(red_x_test)

In [None]:
accuracy_score(y_test, test_pred)

0.8385846918634403

In [None]:
f1_score(y_test, test_pred) #f1 score

0.91220675944334

In [None]:
#soft-margin
model = svm.SVC(C=0.1, kernel='linear', gamma='auto')

In [None]:
model.fit(red_x_train, y_train)

SVC(C=0.1, gamma='auto', kernel='linear')

In [None]:
test_pred = model.predict(red_x_test)

In [None]:
print(accuracy_score(y_test, test_pred))
print(f1_score(y_test, test_pred)) #f1 score

0.8385846918634403
0.91220675944334


In [None]:
#kernel SVM

In [None]:
model = svm.SVC(C=0.8, kernel='rbf', gamma='auto')

In [None]:
# YOUR CODE(s) HERE
model.fit(red_x_train, y_train)

SVC(C=0.8, gamma='auto')

In [None]:
test_pred = model.predict(red_x_test)

In [None]:
print(accuracy_score(y_test, test_pred))
print(f1_score(y_test, test_pred)) #f1 score

0.8385115871043205
0.9121635055071773


4.   **Exercise 4**: **Implementation using Decision Trees**:  (1 point)

Decision Trees are supervised Machine Learning algorithms that can perform both classification and regression tasks and even multioutput tasks. They can handle complex datasets. As the name shows, it uses a tree-like model to make decisions in order to classify or predict according to the problem. It is an ML algorithm that progressively divides datasets into smaller data groups based on a descriptive feature until it reaches sets that are small enough to be described by some label.

The most important part of a decision tree is its explainability!

The importance of decision tree algorithm is that it has many applications in the real world. For example:

1. In the Healthcare sector: To develop Clinical Decision Analysis tools which allow decision-makers to apply for evidence-based medicine and make objective clinical decisions when faced with complex situations.
2. Virtual Assistants (Chatbots): To develop chatbots that provide information and assistance to customers in any required domain.
3. Retail and Marketing: Sentiment analysis detects the pulse of customer feedback and emotions and allows organizations to learn about customer choices and drives decisions.

**Hint**
Use sklearn's **DecisionTreeClassifier** function

In [None]:
DT = DecisionTreeClassifier()

DT.fit(red_x_train, y_train)
test_pred = DT.predict(red_x_test)
print(accuracy_score(y_test, test_pred))
print(f1_score(y_test, test_pred)) #f1 score

0.72987791505227
0.8383144444930644


**Exercise 5**: **Implementation using Ensemble Classifier**:  (1 point) 
- use LogisticRegression, KNN, SVM, Naive Bayes and VotingClassifier as the weak classifiers

In [None]:
#clf1 = svm.SVC()
clf2 = KNeighborsClassifier()
clf3 = GaussianNB()
clf4 = LogisticRegression()

In [None]:
eclf1 = VotingClassifier(estimators=[('knn', clf2), ('gnb', clf3), ('lr', clf4)], voting='hard')

In [None]:
from scipy import sparse
eclf1.fit(sparse.csr_matrix(red_x_train).toarray(), y_train)

VotingClassifier(estimators=[('knn', KNeighborsClassifier()),
                             ('gnb', GaussianNB()),
                             ('lr', LogisticRegression())])

In [None]:
test_pred =eclf1.predict(red_x_test)

In [None]:
print(accuracy_score(y_test, test_pred))
print(f1_score(y_test, test_pred)) #f1 score

0.8371957014401638
0.911363184079602


**Exercise 6**: **Implementation using Random Forest Classifier**:  (1 point)

A random forest is a collection of decision trees whose results are aggregated into one final result. Random Forest  is a supervised classification algorithm. There is a direct relationship between the number of trees in the forest and the results it can get: the larger the number of trees, the more accurate the result. But here creating the forest is not the same as constructing the decision tree with the information gain or gain index approach.
Steps:
1. Randomly select “k” features from total “m” features where k << m as shown in the figure below
2. Among the “k” features, calculate the node “d” using the best split point
3. Split the node into leaf nodes using the best split
4. Repeat the 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.
6. Take the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome (target)
7. Calculate the votes for each predicted target
8. Consider the high voted predicted target as the final prediction from the random forest algorithm

**Hint**:
- Use sklearn's **RandomForestClassifier**
- Experiment with n_estimators, max_depth, max_leaf_nodes

In [None]:
forest = RandomForestClassifier(n_estimators=1000, max_leaf_nodes=50)

In [None]:

forest.fit(X_train, y_train)
test_pred = forest.predict(X_test)
print(accuracy_score(y_test, test_pred))
print(f1_score(y_test, test_pred)) #f1 score

0.8385846918634403
0.91220675944334


**Exercise 7**: **Implementation using Clustering**: (1 point)
- k Means Clustering, with and without PCA=2
- Gaussian Mixture Models

**Hints**:
- Use sklearn's **MiniBatchKMeans**
- Use sklearn's **GaussianMixture**

In [None]:
# A helper function to help labelling the test predictions
def label(n_clusters, real_labels, labels):
    permutation=[]
    for i in range(n_clusters):
        idx = labels == i
        new_label=scipy.stats.mode(real_labels[idx])[0][0]  # Choose the most common label among data points in the cluster
        permutation.append(new_label)
    return permutation

# Use the above custom function
# YOUR CODE(s) HERE

In [None]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture

In [None]:
kmeans  = MiniBatchKMeans()

In [None]:
kmeans.fit(X_train)

MiniBatchKMeans()

In [None]:
test_pred = kmeans.predict(X_test)

**Exercise 8**: **Test your own sentence**: (1 point)
- Input your sentences ( One for positive and negative each)
- Print the output sentiment.**Exercise**

In [None]:
negative = 'i did not liek the product, it was damaged'
positive = 'i loved using this kit. it has great quality'

In [None]:
vc = CountVectorizer()
test = vc.fit_transform([negative, positive])

In [None]:
test = test.toarray()

In [None]:
eclf1.predict(test.T)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [None]:
model.predict(test.T)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [None]:
X_test

In [None]:
test