# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Sentiment Classification on Amazon Food Reviews

## Learning Objectives

At the end of the experiment, you will be able to :

* perform data preprocessing, EDA and feature extraction on the Amazon food review dataset
* train an LSTM model for sentiment classification
* modularize the sentiment classification application

## Dataset description

The dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Data includes reviews from Oct 1999 - Oct 2012 period, with 568454 reviews, 256059 users, 74258 products, and 260 users with more than 50 reviews each.

The data is in CSV format with below features:

- ***Id:*** Row Id
- ***ProductId:*** Unique identifier for the product
- ***UserId:*** Unqiue identifier for the user
- ***ProfileName:*** Profile name of the user
- ***HelpfulnessNumerator:*** Number of users who found the review helpful
- ***HelpfulnessDenominator:*** Number of users who indicated whether they found the review helpful or not
- ***Score:*** Rating between 1 and 5
- ***Time:*** Timestamp for the review
- ***Summary:*** Brief summary of the review
- ***Text:*** Text of the review


##  Grading = 10 Points

## Information

Companies often receive thousands of reviews regarding their products which can be analysed to get incites on what customers think about their products.

Every positive review highlights the beneficial key features of the product, which can be replicated to other products making them more likable. On the other hand, every negative review highlights the weaknesses of the product, which can be treated as feedback to make improvements.

### Import required packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from bs4 import BeautifulSoup
import io
import re
import json
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer, tokenizer_from_json
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import layers

In [None]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/Reviews.csv
!ls | grep ".csv"

**Exercise 1: Read the Reviews.csv dataset**

**Hint:** pd.read_csv()

In [None]:
# YOUR CODE HERE

### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [1 Mark]**

- Remove unnecessary columns - 'Id', 'HelpfulnessNumerator', 'HelpfulnessDenominator'
- Check missing values
- Add a new `Sentiment` column using `Score` column ('positive' if score >=3)
- Remove duplicates from data considering `Sentiment` and `Text` columns
- Change `Time` in proper format


- **Remove unnecessary columns - 'Id', 'HelpfulnessNumerator', 'HelpfulnessDenominator'**

In [None]:
# YOUR CODE HERE

- **Check missing values**

In [None]:
# YOUR CODE HERE

- **Add a new `Sentiment` column using `Score` column**

Consider a review to be negative is the score is less than 3, else positive.

In [None]:
# Add 'Sentiment' column
# YOUR CODE HERE

- **Remove duplicates from data considering `Sentiment` and `Text` columns**

**Hint:** To check duplicates rows, refer [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html).

In [None]:
# Check duplicates
# YOUR CODE HERE

In [None]:
# Remove duplicates
# YOUR CODE HERE

- **Change `Time` in proper format**

The Time is in unix time, change it in UTC.

Note: Unix time is a way of representing a timestamp by representing the time as the number of seconds since January 1st, 1970 at 00:00:00 UTC

**Hint:** datetime.fromtimestamp()

In [None]:
# YOUR CODE HERE

**Exercise 3: Identify the `ProductId`s with highest number of positive and negative reviews. Use barplot. [0.5 Marks]**

In [None]:
# YOUR CODE HERE

**Exercise 4: Identify the `UserId`s who has given highest number of positive and negative reviews**

In [None]:
# YOUR CODE HERE

### Pre-process `Text` reviews

**Exercise 5: Create functions to perform below tasks: [1.5 Marks]**

- remove HTML, XML, etc. markup and metadata
- remove punctuations
- remove stopwords


- **Remove HTML, XML, etc.**


In [None]:
# YOUR CODE HERE

- **Remove punctuations**

In [None]:
# YOUR CODE HERE

- **Remove stopwords**

In [None]:
# YOUR CODE HERE

**Exercise 6: Convert `Sentiment` to numerical**

In [None]:
# YOUR CODE HERE

### Split data into train, validation, and test set

Use 60% for training, 20% for validation, and 20% for testing.

In [None]:
# YOUR CODE HERE

### Tokenization and padding

**Exercise 7: Convert the review text to sequence of integer values, and make them of uniform length [0.5 Marks]**

In [None]:
# YOUR CODE HERE

**Exercise 8: Save the Tokenizer fitted on training set in a json file, and load it back [0.5 Marks]**

**Hint:**

- To get json string from tokenizer, refer [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#to_json).
- To save json string in a json file: `json.dumps()`
- To load json string from a json file: `json.load()`
- To get a tokenizer instance from json string, refer [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/tokenizer_from_json).

In [None]:
def save_tokenizer(tokenizer_to_save):
    # YOUR CODE HERE

In [None]:
def load_tokenizer(filename):
    # YOUR CODE HERE

### Build the model using LSTM

**Exercise 9: Create a model using Embedding, LSTM, and Dense layers for sentiment classification [1 Mark]**

In [None]:
# YOUR CODE HERE

**Exercise 10: Train and evaluate the model**

In [None]:
# Train model
# YOUR CODE HERE

In [None]:
# Final evaluation of the model on test data
# YOUR CODE HERE

### Test Prediction

Create function to get sentiment prediction for user input data.

In [None]:
# YOUR CODE HERE

## Modularization [5 points]

- Modularize the above sentiment classification application for Amazon food reviews dataset as per instructions given in the Instructions document