<a href="https://colab.research.google.com/github/NegarTajziyehchi/Data_Science_Projects/blob/main/Amazon_Review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive

In [2]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import os
os.chdir("/content/drive/MyDrive/Colab_Notebooks/Amazon Review/Dataset")

In [4]:
!ls

readme.txt  test.csv  train.csv


In [5]:
# Import necessary libraries for EDA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Import libraries for text preprocessing
import string
import nltk
import unicodedata  # Import the unicodedata library for handling unicode characters
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

### Load the data


In [6]:
train_df = pd.read_csv('train.csv',header= None)
train_df.rename(columns={0: 'rate', 1: 'title', 2: 'review'}, inplace=True)

In [7]:
train_df[:5]

Unnamed: 0,rate,title,review
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...


In [8]:
train_df['rate'].unique()

array([3, 5, 4, 1, 2])

In [9]:
test_df = pd.read_csv('test.csv',header= None)
test_df.rename(columns={0: 'rate', 1: 'title', 2: 'review'}, inplace=True)

In [10]:
test_df[:5]

Unnamed: 0,rate,title,review
0,1,mens ultrasheer,"This model may be ok for sedentary types, but ..."
1,4,Surprisingly delightful,This is a fast read filled with unexpected hum...
2,2,"Works, but not as advertised",I bought one of these chargers..the instructio...
3,2,Oh dear,I was excited to find a book ostensibly about ...
4,2,Incorrect disc!,"I am a big JVC fan, but I do not like this mod..."


In [11]:
# Display basic statistics of numerical columns ('rate' column)
print("Summary Statistics:")
print(train_df['rate'].describe())

Summary Statistics:
count    3.000000e+06
mean     3.000000e+00
std      1.414214e+00
min      1.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rate, dtype: float64


In [12]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   rate    int64 
 1   title   object
 2   review  object
dtypes: int64(1), object(2)
memory usage: 68.7+ MB


In [13]:
# Count the number of reviews in each sentiment rating category
class_counts = train_df['rate'].value_counts()

# Calculate the class distribution in percentages
class_percentages = (class_counts / train_df.shape[0]) * 100
class_percentages = class_percentages.round().astype(int).astype(str) + "%"

# Display the class distribution
print("Class Distribution (Percentage):")
print(class_percentages)



Class Distribution (Percentage):
3    20%
5    20%
4    20%
1    20%
2    20%
Name: rate, dtype: object


In [14]:
train_df.shape

(3000000, 3)

In [15]:
test_df.shape

(650000, 3)

In [16]:
# Display a few sample reviews for each sentiment rating
for rate in range(1, 6):
    sample_reviews = train_df[train_df['rate'] == rate].sample(5)
    print(f"\nSample Reviews for Sentiment Rating {rate}:")
    for index, row in sample_reviews.iterrows():
        print(f"Review Title: {row['title']}")
        print(f"Review Text: {row['review']}\n")


Sample Reviews for Sentiment Rating 1:
Review Title: i have to give her 1 star
Review Text: but I don't want to. There is ONE good song on the CD (Rain on Me). Ashanti, the self-proclaimed Princess of hip hop and r&b, should take the head phones off and LISTEN to the crap she put on this CD. The CD is garbage flat out, straight up. There is not a good thing that I can say except that this CD is the real reason they make STOP buttons on CD players.

Review Title: v9118 bad phone
Review Text: I had the v9110 first and this is the phone vtech replaced it with. It is the worst cordless phone I have ever owned. They are replacing it agian with another model, but they don't give you replacedments for the quality of the one you purchase originally.The 9118 has bad sound quality, 1/2 hour talk battery life and doesn't pick up right away when you answer the phone. My opinion is the phone sucks!

Review Title: Teenager Humor, I Fast Forwarded Most of the Movie
Review Text: I believe that this m

In [17]:
# Check for missing values in the trainset
missing_data = train_df.isnull().sum()
print('missing_data: ',missing_data)

missing_data:  rate       0
title     76
review     0
dtype: int64


In [18]:
# Check for missing values in the testset
missing_data = test_df.isnull().sum()
print('missing_data: ',missing_data)

missing_data:  rate       0
title     12
review     0
dtype: int64


- **Dataset Size:** The dataset includes 3,000,000 training observations and 650,000 testing observations.

- **Features:** Three main features are available: 'rate,' 'title,' and 'reviews.'

- **Missing Values:** In the training dataset, 76 titles are missing, and in the test dataset, 12 titles are missing.

- **Balanced Distribution:** Each sentiment rating (1 to 5) represents approximately 20% of the dataset.

- **Feature Focus:** To expedite processing due to the dataset's size, I've opted to use only the 'reviews' feature, omitting the 'title.'


### Pre-processing

In [19]:
train_df.drop('title',inplace =True, axis =1)
test_df.drop('title',inplace =True, axis =1)

In [20]:
train_df[:5]

Unnamed: 0,rate,review
0,3,Gave this to my dad for a gag gift after direc...
1,5,I hope a lot of people hear this cd. We need m...
2,5,I'm reading a lot of reviews saying that this ...
3,4,The music of Yasunori Misuda is without questi...
4,5,Probably the greatest soundtrack in history! U...


In [21]:
# Function for text preprocessing
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()

    # Remove punctuation and special characters
    text = ''.join([char for char in text if char not in string.punctuation])

    # Remove unicode characters
    text = ''.join([char for char in text if not unicodedata.category(char).startswith('P')])

    # Tokenization (split text into words)
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Keep only words which are longer than 2 letters for filtered_text and create the 'clean_text' by joining selected tokens
    clean_text = ' '.join([item for item in tokens if len(item) > 2])

    return clean_text

In [22]:
# Download NLTK's 'punkt' tokenizer models
nltk.download('punkt')

# Download NLTK's stopwords dataset
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
train_df['clean_review'] = train_df['review'].apply(preprocess_text)

In [24]:
train_df[:5]

Unnamed: 0,rate,review,clean_review
0,3,Gave this to my dad for a gag gift after direc...,gave dad gag gift directing nunsense got reall...
1,5,I hope a lot of people hear this cd. We need m...,hope lot people hear need strong positive vibe...
2,5,I'm reading a lot of reviews saying that this ...,reading lot reviews saying best game soundtrac...
3,4,The music of Yasunori Misuda is without questi...,music yasunori misuda without question close s...
4,5,Probably the greatest soundtrack in history! U...,probably greatest soundtrack history usually b...


In [25]:
# Save train_df to a CSV file
train_df.to_csv('train_data.csv', index=False)

In [26]:
test_df['clean_review'] = train_df['review'].apply(preprocess_text)

In [27]:
test_df[:5]

Unnamed: 0,rate,review,clean_review
0,1,"This model may be ok for sedentary types, but ...",gave dad gag gift directing nunsense got reall...
1,4,This is a fast read filled with unexpected hum...,hope lot people hear need strong positive vibe...
2,2,I bought one of these chargers..the instructio...,reading lot reviews saying best game soundtrac...
3,2,I was excited to find a book ostensibly about ...,music yasunori misuda without question close s...
4,2,"I am a big JVC fan, but I do not like this mod...",probably greatest soundtrack history usually b...


In [28]:
# Save test_df to a CSV file
test_df.to_csv('test_data.csv', index=False)


In [29]:
train_df

Unnamed: 0,rate,review,clean_review
0,3,Gave this to my dad for a gag gift after direc...,gave dad gag gift directing nunsense got reall...
1,5,I hope a lot of people hear this cd. We need m...,hope lot people hear need strong positive vibe...
2,5,I'm reading a lot of reviews saying that this ...,reading lot reviews saying best game soundtrac...
3,4,The music of Yasunori Misuda is without questi...,music yasunori misuda without question close s...
4,5,Probably the greatest soundtrack in history! U...,probably greatest soundtrack history usually b...
...,...,...,...
2999995,1,The high chair looks great when it first comes...,high chair looks great first comes box hill im...
2999996,2,I have used this highchair for 2 kids now and ...,used highchair kids finally decided sell like ...
2999997,2,"We have a small house, and really wanted two o...",small house really wanted two high chairs twin...
2999998,3,I agree with everyone else who says this chair...,agree everyone else says chair hard clean boug...
