<a href="https://colab.research.google.com/github/LxYuan0420/aws-machine-learning-university-accelerated-nlp/blob/master/colab_notebooks/MLA_NLP_Lecture1_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
%pwd

'/content'

In [3]:
%cd /gdrive/MyDrive/Colab Notebooks/git/aws-machine-learning-university-accelerated-nlp/colab_notebooks

/gdrive/MyDrive/Colab Notebooks/git/aws-machine-learning-university-accelerated-nlp/colab_notebooks


**Machine Learning Accelerator - Natural Language Processing - Lecture 1**

K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative
In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the isPositive field of our review dataset (that is very similar to the final project dataset).

1. Reading the dataset
2. Exploratory data analysis
3. Text Processing: Stop words removal and stemming
4. Train - Validation Split
5. Data processing with Pipeline
6. Train the classifier
7. Test the classifier Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
8. Ideas for improvement

Overall dataset schema:

- reviewText: Text of the review
- summary: Summary of the review
- verified: Whether the purchase was verified (True or False)
- time: UNIX timestamp for the review
- log_votes: Logarithm-adjusted votes log(1+votes). This field is a processed version of the votes field. People can click on the "helpful" button when they find a customer review helpful. This increases the vote by 1. log_votes is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.
- isPositive: Whether the review is positive or negative (1 or 0)

**1. Reading dataset**

In [5]:
import pandas as pd

df = pd.read_csv("../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv")

print(f"The shape of the dataset is {df.shape}")

The shape of the dataset is (70000, 6)


In [6]:
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


**2. Exploratory data analysis**

In [8]:
df['isPositive'].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [9]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


**3. Text Processing: Stop words removal and steamming**

In [10]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [22]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list = []

    for sent in texts:

        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence = []

        sent = sent.lower()
        sent = sent.strip()
        sent = re.sub("\s+", "", sent) 
        sent = re.compile("<.*?>").sub("", sent)

        for w in word_tokenize(sent):
            if (not w.isnumeric()) and (len(w)>2) and (w not in stop_words):
                filtered_sentence.append(w)
        final_string = " ".join(filtered_sentence)

        final_text_list.append(final_string)

    return final_text_list



**Train - validation Split**

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df[["reviewText"]],
                                                  df["isPositive"],
                                                  test_size =0.10,
                                                  shuffle=True,
                                                  random_state=324)

In [24]:
print("Processing the reviewText fields")
train_text_list = process_text(X_train["reviewText"].tolist())
val_text_list = process_text(X_val["reviewText"].tolist())

Processing the reviewText fields


**5. Data processing with Pipeline**

(Go to top)

Today we will use a simple pipeline to use our text field and fit a simple K Nearest Neighbors classifier. This example only uses a single field (reviewText). In the next lecture, we will see how to combine multiple fields.

Our CountVectorizer() will return binary values and use 15 vocabulary words. Feel free to experiment with different numbers here.

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier

pipeline = Pipeline([
    ("text_vect", CountVectorizer(binary=True, max_features=15)),
    ("knn", KNeighborsClassifier())
])

#Visualize the pipline
pipeline

Pipeline(memory=None,
         steps=[('text_vect',
                 CountVectorizer(analyzer='word', binary=True,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=15, min_df=1, ngram_range=(1, 1),
                                 preprocessor=None, stop_words=None,
                                 strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('knn',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)

**Train the classifier**

In [26]:
X_train = train_text_list
X_val = val_text_list


pipeline.fit(X_train, y_train.values)

Pipeline(memory=None,
         steps=[('text_vect',
                 CountVectorizer(analyzer='word', binary=True,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=15, min_df=1, ngram_range=(1, 1),
                                 preprocessor=None, stop_words=None,
                                 strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('knn',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)

**Test classififer**

In [29]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_preds = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_preds))
print(classification_report(y_val.values, val_preds))
print(f"Accuracy (validation): {accuracy_score(y_val.values, val_preds)}")

[[ 493 2112]
 [ 416 3979]]
              precision    recall  f1-score   support

         0.0       0.54      0.19      0.28      2605
         1.0       0.65      0.91      0.76      4395

    accuracy                           0.64      7000
   macro avg       0.60      0.55      0.52      7000
weighted avg       0.61      0.64      0.58      7000

Accuracy (validation): 0.6388571428571429
