## Final Project Day 1 Solution: K Nearest Neighbors Model for a Classification Task 

For the final project, build a K Nearest Neighbors model to predict the __isPositive__ field of the dataset. 
* We are giving you two pieces of code to read your training and test datasets. 
* Use the notebooks from the class and implement the model, train and test with the corresponding datasets.
* You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review


### 1. Reading the dataset

We will use the __pandas__ library to read our datasets.

In [1]:
import pandas as pd

df_train = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')

Let's look at the first five rows of the datasets.

In [2]:
df_train.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


In [3]:
import pandas as pd

df_test = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

In [4]:
df_test.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,Kaspersky offers the best security for your co...,State of the art protection,True,1465516800,0.0,1.0
1,This Value was extremely discounted which I ap...,Quickbooks,True,1393632000,0.0,1.0
2,Some dufus probably got stock options by the t...,Sad,False,1228176000,2.639057,0.0
3,I have reviewed the software and it is beyond ...,Excellent product,True,1402531200,0.0,1.0
4,"Plain old simple you need Anti-Virus,I have tr...",A must have,True,1367539200,0.0,1.0


### 2. Exploratory Data Analysis and Missing Value Imputation

Let's look at the target distribution for our datasets.

In [5]:
df_train["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [6]:
df_test["isPositive"].value_counts()

1.0    4980
0.0    3020
Name: isPositive, dtype: int64

Checking the number of missing values:    

In [7]:
print(df_train.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


In [8]:
print(df_test.isna().sum())

reviewText    2
summary       1
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


Let's fill-in a placeholder for the text missing values:

In [9]:
df_train["reviewText"].fillna("Missing", inplace=True)
df_test["reviewText"].fillna("Missing", inplace=True)

### 3. Stop Word Removal and Stemming

We will apply the text processing methods discussed in the class. 

In [10]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
             'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
             'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn',
            "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
    
    return final_text_list

In [12]:
print("Pre-processing training reviewText")
df_train["reviewText"] = process_text(df_train["reviewText"].tolist())

print("Pre-processing test reviewText")
df_test["reviewText"] = process_text(df_test["reviewText"].tolist())


Pre-processing training reviewText
Pre-processing test reviewText


### 4. Splitting the training dataset into training and validation

Sklearn library has a useful function to split datasets. We will use the train_test_split() function. In the example below, we get 90% of the data for training and 10% is left for validation.

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df_train["reviewText"].tolist(), 
                                                  df_train["isPositive"].tolist(), 
                                                  test_size=0.10, 
                                                  shuffle=True)

### 5. Computing TF-IDF Vectors

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

X_test = df_test["reviewText"].tolist()
y_test = df_test["isPositive"].tolist()

# Initialize the binary count vectorizer
tfidf_vectorizer = CountVectorizer(binary=True,
                                   max_features=50    # Limit the vocabulary size
                                  )

X_train_text_vectors = tfidf_vectorizer.fit_transform(X_train) # Fit and transform
X_val_text_vectors = tfidf_vectorizer.transform(X_val)       # Only transform
X_test_text_vectors = tfidf_vectorizer.transform(X_test)       # Only transform


### 6. Fitting the Classifier

Using the KNeighborsClassifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [15]:
from sklearn.neighbors import KNeighborsClassifier

knnClassifier = KNeighborsClassifier(n_neighbors=5)
knnClassifier.fit(X_train_text_vectors, 
                  y_train
                 )

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

### 7. Checking the Validation Performance

In [16]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = knnClassifier.predict(X_val_text_vectors)

print("Confusion Matrix: \n", confusion_matrix(y_val, val_predictions))
print("Classification_report: \n", classification_report(y_val, val_predictions))
print("Accuracy Score: \n", accuracy_score(y_val, val_predictions))

Confusion Matrix: 
 [[1470 1140]
 [ 782 3608]]
Classification_report: 
               precision    recall  f1-score   support

         0.0       0.65      0.56      0.60      2610
         1.0       0.76      0.82      0.79      4390

   micro avg       0.73      0.73      0.73      7000
   macro avg       0.71      0.69      0.70      7000
weighted avg       0.72      0.73      0.72      7000

Accuracy Score: 
 0.7254285714285714


### 8. Checking the Performance on the Test dataset

In [18]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

test_predictions = knnClassifier.predict(X_test_text_vectors)

print("Confusion Matrix: \n", confusion_matrix(y_test, test_predictions))
print("Classification_report: \n", classification_report(y_test, test_predictions))
print("Accuracy Score: \n", accuracy_score(y_test, test_predictions))

Confusion Matrix: 
 [[1653 1367]
 [ 837 4143]]
Classification_report: 
               precision    recall  f1-score   support

         0.0       0.66      0.55      0.60      3020
         1.0       0.75      0.83      0.79      4980

   micro avg       0.72      0.72      0.72      8000
   macro avg       0.71      0.69      0.69      8000
weighted avg       0.72      0.72      0.72      8000

Accuracy Score: 
 0.7245
