## AML 2304 – Natural Language Processing

Instructor: Bhavik Gandhi

Members:

* Anmolpreet Kaur (C0895954)
* Antonio Carlos De Mello Mendes (C0866063)
* Ann Margaret Silva (C0903604)
* Eduardo Jr Morales (C0900536)
* Flora Mae Villarin (C0905584)
* Maria Jessa Cruz (C0910329)
* Prescila Mora (C0896891)

Datasets:
* Bakhet, M. (2022). Amazon Book Reviews. Kaggle. Retrieved from https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?fbclid=IwZXh0bgNhZW0CMTAAAR1CeZc5I7CIAawbB9Bq_sephstdZ04MStFp0Nr1PT7vHtkIoy-wiZ33fcs_aem_ZmFrZWR1bW15MTZieXRlcw

### **Amazon Book Recommendation with Emotion Analysis**

In [2]:
# Loading libraries
import numpy as np
import pandas as pd
import ast

from gensim.models import Word2Vec

In [9]:
# Load datasets
base_dir = ".\Dataset"

# Cleaned Dataset
data_cleaned = pd.read_csv(f'{base_dir}\\data_cleaned.csv')

# Display the first 5 entries of the DataFrame
data_cleaned.head(5)

Unnamed: 0.1,Unnamed: 0,Id,categories,User_id,review_helpfulness,review_score,review_text,processed_text,tokens,embedding_word
0,0,1882931173,['Comics & Graphic Novels'],AVCGYZL8FQQTD,7/7,4.0,This is only for Julie Strain fans. It's a col...,this is only for julie strain fans its a colle...,"[julie, strain, fan, collection, photo, page, ...","[-2.222338, 0.22019848, 0.09945665, 0.01612767..."
1,1,826414346,['Biography & Autobiography'],A30TK6U7DNS82R,10/10,5.0,I don't care much for Dr. Seuss but after read...,i dont care much for dr seuss but after readin...,"[dont, care, much, dr, seuss, reading, philip,...","[-1.4164933, 0.31558242, 0.015342161, 0.611179..."
2,2,826414346,['Biography & Autobiography'],A3UH4UZ4RSVO82,10/11,5.0,"If people become the books they read and if ""t...",if people become the books they read and if th...,"[people, become, book, read, child, father, ma...","[-1.858772, -0.14652297, -0.19540891, 0.551445..."
3,3,826414346,['Biography & Autobiography'],A2MVUWT453QH61,7/7,4.0,"Theodore Seuss Geisel (1904-1991), aka &quot;D...",theodore seuss geisel aka quotdr seussquot wa...,"[theodore, seuss, geisel, aka, quotdr, seussqu...","[-1.297682, -0.1976624, -0.31652075, 0.3948929..."
4,4,826414346,['Biography & Autobiography'],A22X4XUPKF66MR,3/3,4.0,Philip Nel - Dr. Seuss: American IconThis is b...,philip nel dr seuss american iconthis is basi...,"[philip, nel, dr, seuss, american, iconthis, b...","[-1.7385675, 0.15008123, -0.16322877, 0.705278..."


#### **C. Feature Extraction**

Word2Vec is preferable when semantic relationships are crucial, especially with large datasets. This approach captures the context of words in a corpus and learns word associations, making it ideal for tasks such as natural language processing, recommendation systems and understanding word similarities. 

In [4]:
# Train Word2Vec model
data_cleaned['tokens'] = data_cleaned['tokens'].apply(ast.literal_eval)
word2vec_model = Word2Vec(sentences=data_cleaned['tokens'], vector_size=100, window=5, min_count=1, workers=4)

# Function to extract embeddings for a list of tokens
def extract_embeddings(tokens_list, model):
    embeddings = []
    for token in tokens_list:
        if token in model.wv:
            embeddings.append(model.wv[token])
        else:
            # Use zero vector for out-of-vocabulary tokens
            embeddings.append(np.zeros(model.vector_size))  
    if embeddings:
        # Average of word embeddings
        return np.mean(embeddings, axis=0)  
    else:
        # Return zero vector if no embeddings found
        return np.zeros(model.vector_size)  

# Apply the function to each row in df
data_cleaned['embedding_word'] = data_cleaned['tokens'].apply(lambda tokens: extract_embeddings(tokens, word2vec_model))

In [8]:
# Testing: Retrieve embeddings for specific tokens and verify
test_tokens = ['care']
test_embeddings = extract_embeddings(test_tokens, word2vec_model)
print(f"Embeddings for tokens {test_tokens}:")
print(test_embeddings)

Embeddings for tokens ['care']:
[-0.8181723  -1.0012636  -1.2368373   0.96496713 -0.601813   -4.604459
  3.5553896   1.5855979   0.2679521   0.71615857 -1.1164246   0.70556545
 -5.2474113   0.09633974 -4.860483    5.7150345   0.08756095  1.8314395
  3.359977   -0.24423656 -0.07642583 -1.1243268   1.3301704  -2.2251573
  1.1131893   1.7977425  -0.24453217  1.9105943  -0.5047036  -0.29841867
  1.3712335   4.7104173  -1.7278252   0.83549565 -0.70165735 -3.5643833
  0.830941   -1.1965132  -0.47553265 -2.1157048  -2.9535613   1.4494973
  3.1285124  -0.0110596  -0.6475354  -2.871101   -0.42643344 -1.8404366
 -1.6280702  -3.6207392  -3.5842254   5.950976    0.5037155  -1.1870943
 -0.12848634 -0.2233201  -0.55824894 -2.986611   -1.852268    1.4505099
  2.9567862   1.1599938  -2.9143927  -0.06997781 -1.492328   -3.7699478
 -3.6043122   0.2597901   1.4550586  -3.7549188  -5.75379    -1.0765803
 -1.3230596   0.82481354 -2.5402846  -0.5145821   1.0601343   2.2980456
  0.21901053  2.244655   -0.539