## AML 2304 – Natural Language Processing

Instructor: Bhavik Gandhi

Members:

* Anmolpreet Kaur (C0895954)
* Antonio Carlos De Mello Mendes (C0866063)
* Ann Margaret Silva (C0903604)
* Eduardo Jr Morales (C0900536)
* Flora Mae Villarin (C0905584)
* Maria Jessa Cruz (C0910329)
* Prescila Mora (C0896891)

Datasets:
* Bakhet, M. (2022). Amazon Book Reviews. Kaggle. Retrieved from https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?fbclid=IwZXh0bgNhZW0CMTAAAR1CeZc5I7CIAawbB9Bq_sephstdZ04MStFp0Nr1PT7vHtkIoy-wiZ33fcs_aem_ZmFrZWR1bW15MTZieXRlcw

### **Amazon Book Recommendation with Emotion Analysis**

In [10]:
# Loading libraries
import numpy as np
import pandas as pd
import ast
import torch

from gensim.models import Word2Vec
from transformers import AutoTokenizer, DistilBertModel


In [11]:
base_dir = "../data"

# Load cleaned dataset
data_cleaned = pd.read_csv(f"{base_dir}/data_cleaned.csv")

# Display the first 5 entries of the DataFrame
data_cleaned.head(5)

Unnamed: 0.1,Unnamed: 0,Id,categories,User_id,review_helpfulness,review_score,review_text,processed_text,tokens
0,0,B000H9R1Q0,['Juvenile Fiction'],Unknown,0/0,5.0,I read this book before reading any other fant...,i read this book before reading any other fant...,"['read', 'book', 'reading', 'fantasy', 'novel'..."
1,1,B000JC6MMO,['Fiction'],A6PVMUJOXAWRM,0/0,5.0,I read this book not knowing a lot about Dean ...,i read this book not knowing a lot about dean ...,"['read', 'book', 'knowing', 'lot', 'dean', 'ko..."
2,2,1558322175,['Cooking'],A1OX82JPAQLL60,28/30,5.0,"Villas is right, it is American, this casserol...",villas is right it is american this casserole ...,"['villa', 'right', 'american', 'casserole', 'd..."
3,3,0711217599,['Religion'],A2VE83MZF98ITY,31/31,5.0,Augustine's 'Confessions' is among the most im...,augustines confessions is among the most impor...,"['augustine', 'confession', 'among', 'importan..."
4,4,B000OUEI1I,['Fiction'],ABBQNOK1V3521,1/1,3.0,This wasn't the best written book I've ever re...,this wasnt the best written book ive ever read...,"['wasnt', 'best', 'written', 'book', 'ive', 'e..."


#### **C. Feature Extraction**

Word2Vec is preferable when semantic relationships are crucial, especially with large datasets. This approach captures the context of words in a corpus and learns word associations, making it ideal for tasks such as natural language processing, recommendation systems and understanding word similarities. 

In [12]:
# Train Word2Vec model
data_cleaned['tokens'] = data_cleaned['tokens'].apply(ast.literal_eval)
word2vec_model = Word2Vec(sentences=data_cleaned['tokens'], vector_size=100, window=5, min_count=1, workers=4)

# Function to extract embeddings for a list of tokens
def extract_embeddings(tokens_list, model):
    embeddings = []
    for token in tokens_list:
        if token in model.wv:
            embeddings.append(model.wv[token])
        else:
            # Use zero vector for out-of-vocabulary tokens
            embeddings.append(np.zeros(model.vector_size))  
    if embeddings:
        # Average of word embeddings
        return np.mean(embeddings, axis=0)  
    else:
        # Return zero vector if no embeddings found
        return np.zeros(model.vector_size)  

# Apply the function to each row in df
data_cleaned['embedding_word'] = data_cleaned['tokens'].apply(lambda tokens: extract_embeddings(tokens, word2vec_model))

In [13]:
# Testing: Retrieve embeddings for specific tokens and verify
test_tokens = ['']
test_embeddings = extract_embeddings(test_tokens, word2vec_model)
print(f"Embeddings for tokens {test_tokens}:")
print(test_embeddings)

Embeddings for tokens ['']:
[0. 0. 0. 0. 0.]


In [None]:
'''
TODO: Remove this code for embeddings using pretrained "distilbert-base-uncased" as it is consuming a lot of time.

# Embedding using pretrained distilbert-base-uncased
# Took time for large data

data_embed_distilbert = data_cleaned.copy()

# Extract the text data from the tokens column and join them into a single string for each review
text_data = data_embed_distilbert['tokens'].apply(lambda tokens: ' '.join(tokens)).tolist()

# Load the tokenizer and model
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = DistilBertModel.from_pretrained(model_ckpt)

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Batch size for processing
batch_size = 16

# Function to get DistilBERT embeddings for a batch
def get_distilbert_embeddings(text_batch, tokenizer, model):
    inputs = tokenizer(text_batch, return_tensors="pt", truncation=True, padding=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).cpu().numpy()

# Process data in batches
embeddings = []
for i in range(0, len(text_data), batch_size):
    batch = text_data[i:i+batch_size]
    batch_embeddings = get_distilbert_embeddings(batch, tokenizer, model)
    embeddings.append(batch_embeddings)

# Flatten the list of embeddings
embeddings = np.vstack(embeddings)

# Add embeddings to the DataFrame
data_cleaned['embedding_word_distilbert'] = list(embeddings)
'''

In [14]:
data_cleaned.rename(columns={'User_id': 'user_id'}, inplace=True)

# Save file to local
data_cleaned.to_csv(f"{base_dir}/data_embedded.csv")

In [15]:
data_cleaned

Unnamed: 0.1,Unnamed: 0,Id,categories,user_id,review_helpfulness,review_score,review_text,processed_text,tokens,embedding_word
0,0,B000H9R1Q0,['Juvenile Fiction'],Unknown,0/0,5.0,I read this book before reading any other fant...,i read this book before reading any other fant...,"[read, book, reading, fantasy, novel, read, si...","[0.0033755056, 0.024587737, 0.029524483, -0.00..."
1,1,B000JC6MMO,['Fiction'],A6PVMUJOXAWRM,0/0,5.0,I read this book not knowing a lot about Dean ...,i read this book not knowing a lot about dean ...,"[read, book, knowing, lot, dean, koontz, story...","[0.04009305, 0.008773962, 0.01123846, -0.01507..."
2,2,1558322175,['Cooking'],A1OX82JPAQLL60,28/30,5.0,"Villas is right, it is American, this casserol...",villas is right it is american this casserole ...,"[villa, right, american, casserole, defines, r...","[0.017385779, 0.023026405, 0.01447912, 0.01746..."
3,3,0711217599,['Religion'],A2VE83MZF98ITY,31/31,5.0,Augustine's 'Confessions' is among the most im...,augustines confessions is among the most impor...,"[augustine, confession, among, important, book...","[-0.0123476405, -0.00028553308, 0.024723677, -..."
4,4,B000OUEI1I,['Fiction'],ABBQNOK1V3521,1/1,3.0,This wasn't the best written book I've ever re...,this wasnt the best written book ive ever read...,"[wasnt, best, written, book, ive, ever, read, ...","[0.02614239, 0.0142108, 0.017003234, -0.038453..."
5,5,B000NWQXBA,['Juvenile Fiction'],A3IB92ACL9LLJ3,1/1,5.0,This was the sole book that got me into readin...,this was the sole book that got me into readin...,"[sole, book, got, reading, book, used, really,...","[-3.464837e-05, 0.017863374, 0.028809272, 0.02..."
6,6,0838460151,['Reference'],AQJ53LIBRLGJY,7/7,5.0,As a middle school teacher of both English lan...,as a middle school teacher of both english lan...,"[middle, school, teacher, english, language, l...","[-0.0010210477, 0.012527325, -0.019546598, -0...."
7,7,1587248611,['Fiction'],A19NEUK1GR692L,0/1,5.0,Another chapter in the shopaholic series & Bec...,another chapter in the shopaholic series beck...,"[another, chapter, shopaholic, series, becky, ...","[0.013754645, 0.0013688209, 0.0141074555, 0.00..."
8,8,B000K4XV8O,['Religion'],A36BIIOWDYI4N7,41/89,3.0,Contains substantial duplicity of standards. C...,contains substantial duplicity of standards cr...,"[contains, substantial, duplicity, standard, c...","[0.001599986, 0.012784105, 0.039912686, -0.011..."
9,9,B000INZAJK,['Technology & Engineering'],Unknown,0/0,4.0,&quot;Pest Control&quot; was one of the best b...,quotpest controlquot was one of the best books...,"[quotpest, controlquot, one, best, book, read,...","[0.014956436, -0.03519292, -0.0072899144, 0.01..."
