## AML 2304 – Natural Language Processing

Instructor: Bhavik Gandhi

Members:

* Anmolpreet Kaur (C0895954)
* Antonio Carlos De Mello Mendes (C0866063)
* Ann Margaret Silva (C0903604)
* Eduardo Jr Morales (C0900536)
* Flora Mae Villarin (C0905584)
* Maria Jessa Cruz (C0910329)
* Prescila Mora (C0896891)

Datasets:
* Bakhet, M. (2022). Amazon Book Reviews. Kaggle. Retrieved from https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?fbclid=IwZXh0bgNhZW0CMTAAAR1CeZc5I7CIAawbB9Bq_sephstdZ04MStFp0Nr1PT7vHtkIoy-wiZ33fcs_aem_ZmFrZWR1bW15MTZieXRlcw

### **Amazon Book Recommendation with Emotion Analysis**

In [1]:
# Loading libraries
import numpy as np
import pandas as pd
import ast
import torch

from gensim.models import Word2Vec
from transformers import AutoTokenizer, DistilBertModel


In [2]:
# Load cleaned dataset
data_cleaned = pd.read_csv('/kaggle/input/amazon/data_cleaned.csv')

# Display the first 5 entries of the DataFrame
data_cleaned.head(5)

Unnamed: 0.1,Unnamed: 0,Id,categories,User_id,review_helpfulness,review_score,review_text,processed_text,tokens
0,0,1882931173,['Comics & Graphic Novels'],AVCGYZL8FQQTD,7/7,4.0,This is only for Julie Strain fans. It's a col...,this is only for julie strain fans its a colle...,"['julie', 'strain', 'fan', 'collection', 'phot..."
1,1,826414346,['Biography & Autobiography'],A30TK6U7DNS82R,10/10,5.0,I don't care much for Dr. Seuss but after read...,i dont care much for dr seuss but after readin...,"['dont', 'care', 'much', 'dr', 'seuss', 'readi..."
2,2,826414346,['Biography & Autobiography'],A3UH4UZ4RSVO82,10/11,5.0,"If people become the books they read and if ""t...",if people become the books they read and if th...,"['people', 'become', 'book', 'read', 'child', ..."
3,3,826414346,['Biography & Autobiography'],A2MVUWT453QH61,7/7,4.0,"Theodore Seuss Geisel (1904-1991), aka &quot;D...",theodore seuss geisel aka quotdr seussquot wa...,"['theodore', 'seuss', 'geisel', 'aka', 'quotdr..."
4,4,826414346,['Biography & Autobiography'],A22X4XUPKF66MR,3/3,4.0,Philip Nel - Dr. Seuss: American IconThis is b...,philip nel dr seuss american iconthis is basi...,"['philip', 'nel', 'dr', 'seuss', 'american', '..."


#### **C. Feature Extraction**

Word2Vec is preferable when semantic relationships are crucial, especially with large datasets. This approach captures the context of words in a corpus and learns word associations, making it ideal for tasks such as natural language processing, recommendation systems and understanding word similarities. 

In [3]:
# Train Word2Vec model
data_cleaned['tokens'] = data_cleaned['tokens'].apply(ast.literal_eval)
word2vec_model = Word2Vec(sentences=data_cleaned['tokens'], vector_size=100, window=5, min_count=1, workers=4)

# Function to extract embeddings for a list of tokens
def extract_embeddings(tokens_list, model):
    embeddings = []
    for token in tokens_list:
        if token in model.wv:
            embeddings.append(model.wv[token])
        else:
            # Use zero vector for out-of-vocabulary tokens
            embeddings.append(np.zeros(model.vector_size))  
    if embeddings:
        # Average of word embeddings
        return np.mean(embeddings, axis=0)  
    else:
        # Return zero vector if no embeddings found
        return np.zeros(model.vector_size)  

# Apply the function to each row in df
data_cleaned['embedding_word'] = data_cleaned['tokens'].apply(lambda tokens: extract_embeddings(tokens, word2vec_model))

In [8]:
# Testing: Retrieve embeddings for specific tokens and verify
test_tokens = ['']
test_embeddings = extract_embeddings(test_tokens, word2vec_model)
print(f"Embeddings for tokens {test_tokens}:")
print(test_embeddings)

Embeddings for tokens ['']:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


In [None]:
# Embedding using pretrained distilbert-base-uncased
# Took time for large data

data_embed_distilbert = data_cleaned.copy()

# Extract the text data from the tokens column and join them into a single string for each review
text_data = data_embed_distilbert['tokens'].apply(lambda tokens: ' '.join(tokens)).tolist()

# Load the tokenizer and model
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = DistilBertModel.from_pretrained(model_ckpt)

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Batch size for processing
batch_size = 16

# Function to get DistilBERT embeddings for a batch
def get_distilbert_embeddings(text_batch, tokenizer, model):
    inputs = tokenizer(text_batch, return_tensors="pt", truncation=True, padding=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).cpu().numpy()

# Process data in batches
embeddings = []
for i in range(0, len(text_data), batch_size):
    batch = text_data[i:i+batch_size]
    batch_embeddings = get_distilbert_embeddings(batch, tokenizer, model)
    embeddings.append(batch_embeddings)

# Flatten the list of embeddings
embeddings = np.vstack(embeddings)

# Add embeddings to the DataFrame
data_cleaned['embedding_word_distilbert'] = list(embeddings)

In [9]:
# Save file to local
data_cleaned.to_csv("data_embedded.csv")