## Part 1: Online Product Review Data
## Part 2: Chatbot

## 1: Introduction

This dataset contains consumer reviews of some selected online shopping products.

**Description of the data:**

- **`product_review.csv`** contains the dataset. 
- Each observation (row) in this dataset is a review of a particular product by a particular user.
- The **date** column is the date when the review was provided.
- The **product** column is the name of the product reviewed.
- The **category** column is the primary category of the product reviewed.
- The **text** column is the review text.
- The **user** column is the name of the user who gives the review
- The **rating** column is the number of stars (1 through 5) assigned by the reviewer to the product. (Higher stars is better.) In other words, it is the rating of the product by the user who wrote the review.

**Goal**:
 - Perform some data explorations.
 - Generate training, validation, and test datasets before model building and prediction

## Import libraries

In [92]:
import pandas as pd
from scipy.sparse import vstack, hstack
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.dummy import DummyClassifier
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# import other libraries/functions if they are needed in your coding

## Task 1.1

Read **product_review.csv** into a Pandas DataFrame and find the **user** who have written the most number of reviews. Please explicitly answer the question after showing the code/printing out the dataset.

* **Note**: user names such as `Anonymous`, `ByAmazon Customer`, `ByKindle Customer` etc. will not be considered as answers in this task. Make sure that the user name can refer to a certain user. However, you should NOT exclude these users for the following tasks.

In [93]:
# We read the product_review.csv into pandas dataframe

product_review = pd.read_csv('product_review.csv')
product_review

Unnamed: 0,date,product,category,text,user,rating
0,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,I order 3 of them and one of the item is bad q...,Byger yang,3
1,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bulk is always the less expensive way to go fo...,ByMG,4
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5
...,...,...,...,...,...,...
28327,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I got 2 of these for my 8 yr old twins. My 11 ...,Mom2twinsplus1,5
28328,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I bought this for my niece for a Christmas gif...,fireman21,4
28329,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Very nice for light internet browsing, keeping...",suzannalicious,5
28330,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,This Tablet does absolutely everything I want!...,SandyJ,5


In [94]:
# Finding the user who have written the most number of reviews
product_review['user'].value_counts(ascending=False)

ByAmazon Customer    889
Mike                  63
ByKindle Customer     45
Dave                  44
Chris                 38
                    ... 
Bysarah k              1
ByRobert Roussel       1
ByK Davis              1
ByDevin                1
erockmon               1
Name: user, Length: 16269, dtype: int64

Answer: As we are not considering user names such as Anonymous, ByAmazon Customer, ByKindle Customer the user with most number of reviews is Mike with 63 reviews.


## Task 1.2
Create another column named `review_length`, which is the number of words in the review text. 


In [95]:
# Creating a column named 'review_length' that shows the number of words in the review text.
product_review["review_length"] = product_review["text"].apply(lambda n: len(n.split()))
product_review

Unnamed: 0,date,product,category,text,user,rating,review_length
0,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,I order 3 of them and one of the item is bad q...,Byger yang,3,31
1,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bulk is always the less expensive way to go fo...,ByMG,4,13
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10
...,...,...,...,...,...,...,...
28327,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I got 2 of these for my 8 yr old twins. My 11 ...,Mom2twinsplus1,5,29
28328,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I bought this for my niece for a Christmas gif...,fireman21,4,18
28329,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Very nice for light internet browsing, keeping...",suzannalicious,5,57
28330,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,This Tablet does absolutely everything I want!...,SandyJ,5,43


## Task 1.3

What is the product (or products) with the maximum number of words in a single review?

In [96]:
product_review.sort_values(['review_length'], ascending=False)[product_review['review_length']==max(product_review['review_length'])][['product','review_length']]

  product_review.sort_values(['review_length'], ascending=False)[product_review['review_length']==max(product_review['review_length'])][['product','review_length']]


Unnamed: 0,product,review_length
24278,"Fire HD 8 Tablet with Alexa, 8 HD Display, 32 ...",1539
15434,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",1539
15435,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",1539
18411,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",1539


## Task 1.4
Create a new DataFrame that only contains products with number of reviews more than `1000`. 

In [97]:
df1 = pd.DataFrame(product_review.groupby('product')['user'].count())
df1 = df1[df1['user']>1000]
df1 = product_review[product_review['product'].isin(df1.index)]
df1

Unnamed: 0,date,product,category,text,user,rating,review_length
0,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,I order 3 of them and one of the item is bad q...,Byger yang,3,31
1,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bulk is always the less expensive way to go fo...,ByMG,4,13
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10
...,...,...,...,...,...,...,...
28327,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I got 2 of these for my 8 yr old twins. My 11 ...,Mom2twinsplus1,5,29
28328,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I bought this for my niece for a Christmas gif...,fireman21,4,18
28329,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Very nice for light internet browsing, keeping...",suzannalicious,5,57
28330,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,This Tablet does absolutely everything I want!...,SandyJ,5,43


## Task 1.5

Create a new DataFrame that only contains the ratings: 1, 2 and 5. Then create a new column `target`, whose value is 1 if rating is 5 and 0 otherwise. 

In [98]:
# Creating a dataframe that contains ratings: 1,2,5
ratings_125 = pd.DataFrame(df1.loc[(df1.rating==1) | (df1.rating==2) | (df1.rating==5)])
ratings_125

Unnamed: 0,date,product,category,text,user,rating,review_length
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10
5,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bought a lot of batteries for Christmas and th...,ByPainter Marlow,5,48
6,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,ive not had any problame with these batteries ...,ByAmazon Customer,5,17
...,...,...,...,...,...,...,...
28325,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"its fast, it has good lighting. its got the 16...",erockmon,5,18
28326,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Where do I begin...good clarity, I love the si...",cmorris,5,40
28327,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I got 2 of these for my 8 yr old twins. My 11 ...,Mom2twinsplus1,5,29
28329,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Very nice for light internet browsing, keeping...",suzannalicious,5,57


In [99]:
# creating a new column 'target', whose value is 1 if rating is 5 and 0 otherwise
ratings_125["target"] = None
for index, row in ratings_125.iterrows():
    if row["rating"] == 5:
        ratings_125.at[index,'target'] = 1
    else:
        ratings_125.at[index,'target'] = 0
        
ratings_125

Unnamed: 0,date,product,category,text,user,rating,review_length,target
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12,1
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14,1
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10,1
5,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bought a lot of batteries for Christmas and th...,ByPainter Marlow,5,48,1
6,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,ive not had any problame with these batteries ...,ByAmazon Customer,5,17,1
...,...,...,...,...,...,...,...,...
28325,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"its fast, it has good lighting. its got the 16...",erockmon,5,18,1
28326,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Where do I begin...good clarity, I love the si...",cmorris,5,40,1
28327,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I got 2 of these for my 8 yr old twins. My 11 ...,Mom2twinsplus1,5,29,1
28329,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Very nice for light internet browsing, keeping...",suzannalicious,5,57,1


## Task 1.6

Define X (features) and y (target) from the new DataFrame, and then split X and y into training and testing sets, using the `text` and `product` as the only features and the `target` as the target variable.

In [100]:
# define X and y
X = ratings_125[['text','product']]
y = ratings_125.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022,shuffle=True,stratify=ratings_125['target'])

In [101]:
# examine the object shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(13724, 2)
(3432, 2)
(13724,)
(3432,)


## Task 1.7

Use CountVectorizer to create **document-term matrices** from the column: `text` of **X_train** and **X_test**.

In [102]:
vectorizer = CountVectorizer()
X_train_text = vectorizer.fit_transform(X_train['text'])
X_test_text = vectorizer.transform(X_test['text'])

print(X_train_text.shape)
print(X_test_text.shape)

(13724, 6762)
(3432, 6762)


## Task 1.8
Use one-hot encoding to process the feature **product**. 



In [103]:
enc = OneHotEncoder(handle_unknown='ignore', sparse = True)
enc_train_product = enc.fit_transform(X_train[['product']])
enc_test_product = enc.transform(X_test[['product']])

print(enc_train_product.shape)
print(enc_test_product.shape)

(13724, 8)
(3432, 8)


## Task 1.9
Concatenate the feature matrices from **CountVectorizer** and **one hot encoding** for both train and test datasets.

In [104]:
hstack_train = hstack([X_train_text,enc_train_product])
hstack_test = hstack([X_test_text,enc_test_product])

print(hstack_train.shape)
print(hstack_test.shape)

(13724, 6770)
(3432, 6770)


## 2: Introduction

You are required to create a chatbot using the concepts of vectorization and cosine similarity. For the purposes of the chatbot that you will create, you will be using a repository of questions and answers gathered from
online shopping website for electronic items. Being trained on Q&A data for electronic items,your chatbot could be deployed as automated Q&A support under the Electronic Items section. The corpus **Electronics_QA.json** is in a JavaScript Object Notation (JSON)-like format. It contains multiple features for each pair of Q&A, but you will only use the feautres **question** and **answer**.

## Import libraries

In [105]:
import numpy as np
import shutil
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

#library for loading json file
import ast

# import other libraries/functions if they are needed in your coding

## Task 2.1

- You need to import the corpus (**Electronics_QA.json**) into Python and read the file as a text file and then use the ast library's `literal_eval` function or json library's `load` to convert the rows from a string to a Python dictionary. 
- Then you need to store questions and answers in **separate** lists. While importing, please perform the necessary preprocessing step of converting all characters to lowercase.

In [106]:
# We unzip the json file
shutil.unpack_archive('Electronics_QA.json.zip', '.')
list_questions = []
list_answers = []
with open('Electronics_QA.json') as f:
    for line in f:
        literal_eval = ast.literal_eval(line)
        list_questions.append(literal_eval['question'].lower())
        list_answers.append(literal_eval['answer'].lower())

In [107]:
len(list_questions)

314263

In [108]:
len(list_answers)

314263

## Task 2.2

Use `CountVectorizer` module of the sklearn library to convert the questions list into a sparse matrix and apply TF-IDF transformation. This will generate a repository matrix.
* **Hint**: You should exclude stopwords when using CountVertorizer module.

In [109]:
vectorizer = CountVectorizer(stop_words = 'english')
transformer = TfidfTransformer()
X_list_questions = vectorizer.fit_transform(list_questions)
transformer_fit = transformer.fit_transform(X_list_questions)
print(transformer_fit.shape)

(314263, 69189)


## Task 2.3

Your repository matrix generated in Task 2.2 will be searched every time a new question is entered in the chatbot in order to find the most similar question. To implement this, please create your own function `conversation` here including the following steps:

- Calculate the angle between every row of the repository matrix and the new question vector. Use the sklearn library's `cosine_similarity` module to calculate the cosine between each row and the vector, and then convert the cosine into degrees by using numpy library's function `rad2deg`. (1 mark)
- Search the row that has the maximum cosine (or the minimum angle) with the new question vector and return the corresponding answer to that question as the response. If the smallest angle between the question vector and every row of the matrix is greater than a threshold value, i.e., 60,then you consider that question to be different enough and return a message that states the chatbot cannot understand the question. (1 mark)

* **Hint**: 
- You need to transform the input question (im) to repository matrix using the previous fit CountVectorizer generated in Task 2.2. 
- You need to use np.arccos to get the radian before using np.rad2deg.

In [110]:
def conversation(im):
    vectorizer_im = vectorizer.transform([im])
    transformer_im = transformer.transform(vectorizer_im)
    cosine = cosine_similarity(transformer_fit, transformer_im)
    arccos = np.arccos(cosine)
    rad2deg = np.rad2deg(arccos)
    argmin = np.argmin(rad2deg)
    position = list_answers[argmin]
    if rad2deg[argmin] > 60:
        return "Chatbot cannot understand the question"    
#     print(list_questions[argmin])
    return position
# print(conversation("com3 nus hof hdxv ldjd?"))

## Task 2.4

Implement the chat, wherein 

- The user enters their username and is then greeted by the chatbot. (0.3 mark)
- The chat is initiated with the user asking questions and the bot providing a response based on the `conversation` function created in Task 2.3. (0.5 mark)
- The chat continues until the user types 'bye'.(0.2 mark)

- Please demonstrate the interactions with your chatbot using the functions that you have generated.(0.5 mark)

In [111]:
def main():
    name = input("Enter username:")
    print("Hi,", name, ", Welcome!")
    question = input("What would you like to know?")
    while(question!='bye'):
        print(conversation(question))
        question = input("What else would you like to know?")
        if question == 'bye':
            print("Thank you for connecting with us!")
main()

Enter username:Kevin Owne
Hi, Kevin Owne , Welcome!
What would you like to know?Best reviewed products with excellent ratings
Chatbot cannot understand the question
What else would you like to know?cheap offers for appliances
hello, we are here to bring the savings to our customers and provide a great product as well. thank you, and please don't hesitate to let us know if we can help you with anything else. have a spectacular week! sincerely your cableforge/readyplug/cskins customer service team
What else would you like to know?best batteries available
don't know for sure..i bought extras a while ago by googling the camera name, model, and any numbers on the battery and found them @ an electronic company online.
What else would you like to know?bye
Thank you for connecting with us!
