In [47]:
import pandas as pd

In [67]:
df = pd.read_csv('/content/drive/MyDrive/Data Science with Advanced Gen AI Internship/Internship Tasks/data.csv')


### Exploratory Data Analysis

In [68]:
df.head()

Unnamed: 0,Reviewer Name,Review Title,Place of Review,Up Votes,Down Votes,Month,Review text,Ratings
0,Kamal Suresh,Nice product,"Certified Buyer, Chirakkal",889.0,64.0,Feb 2021,"Nice product, good quality, but price is now r...",4
1,Flipkart Customer,Don't waste your money,"Certified Buyer, Hyderabad",109.0,6.0,Feb 2021,They didn't supplied Yonex Mavis 350. Outside ...,1
2,A. S. Raja Srinivasan,Did not meet expectations,"Certified Buyer, Dharmapuri",42.0,3.0,Apr 2021,Worst product. Damaged shuttlecocks packed in ...,1
3,Suresh Narayanasamy,Fair,"Certified Buyer, Chennai",25.0,1.0,,"Quite O. K. , but nowadays the quality of the...",3
4,ASHIK P A,Over priced,,147.0,24.0,Apr 2016,Over pricedJust â?¹620 ..from retailer.I didn'...,1


- Removing unwanted/irrelevant columns which does not add value to our project

In [69]:
df = df[['Review Title',
         'Review text',
         'Ratings']]

In [70]:
df = df.rename(columns = {'Review text':'Review Text'})
df.head()

Unnamed: 0,Review Title,Review Text,Ratings
0,Nice product,"Nice product, good quality, but price is now r...",4
1,Don't waste your money,They didn't supplied Yonex Mavis 350. Outside ...,1
2,Did not meet expectations,Worst product. Damaged shuttlecocks packed in ...,1
3,Fair,"Quite O. K. , but nowadays the quality of the...",3
4,Over priced,Over pricedJust â?¹620 ..from retailer.I didn'...,1


In [71]:
data = df

In [72]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8518 entries, 0 to 8517
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Review Title  8508 non-null   object
 1   Review Text   8510 non-null   object
 2   Ratings       8518 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 199.8+ KB


In [73]:
# we can see there are few null values in the dataset
# let's count them

In [74]:
data.isnull().sum()

Unnamed: 0,0
Review Title,10
Review Text,8
Ratings,0


In [75]:
len(data)

8518

In [76]:
(10/8518)*100

0.11739845034045551

In [77]:
# lets verify it with code
null_percentage = (data.isnull().sum()/len(data)) * 100
null_percentage

Unnamed: 0,0
Review Title,0.117398
Review Text,0.093919
Ratings,0.0


In [78]:
data[data[['Review Title', 'Review Text']].isnull().any(axis=1)]

Unnamed: 0,Review Title,Review Text,Ratings
8508,,No complaints about the item . Its the best on...,5
8509,,Not sure why we have charged for this product ...,1
8510,,,1
8511,,,1
8512,,,2
8513,,,5
8514,,,2
8515,,,4
8516,,,1
8517,,,4


In [61]:
(10/8518) * 100

0.11739845034045551

- Losing 10 rows out of 8,518 = losing ~0.11% data
- Rows with missing review title or review text accounted for less than ~0.11% of the dataset and were removed to maintain semantic consistency of the textual inputs.

In [79]:
data=data.dropna()

In [82]:
len(data)

8508

In [83]:
data.duplicated().sum()

np.int64(1494)

In [84]:
data[data.duplicated()]

Unnamed: 0,Review Title,Review Text,Ratings
165,Unsatisfactory,WorstREAD MORE,1
194,Perfect product!,Nice productREAD MORE,5
322,Simply awesome,Very good productREAD MORE,5
353,Terrific purchase,GoodREAD MORE,5
362,Wonderful,GoodREAD MORE,5
...,...,...,...
8403,Brilliant,GoodREAD MORE,5
8405,Nice,GoodREAD MORE,4
8413,Really Nice,Good shuttleREAD MORE,5
8414,Highly recommended,GoodREAD MORE,4


In [85]:
# As One unique opinion = one training example
# So, we should drop these duplicates

In [86]:
data.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop_duplicates(inplace=True)


In [87]:
len(data)

7014

In [88]:
# data remaining after duplicates and null value removal -> 7014

In [89]:
# Rating distribution
data['Ratings'].value_counts()

Unnamed: 0_level_0,count
Ratings,Unnamed: 1_level_1
5,3978
4,1402
1,759
3,572
2,303


In [None]:
# Rating 3 often contains mild dissatisfaction
# From a business perspective:
# “Anything below 4 needs attention”

In [92]:
data['Sentiment'] = data['Ratings'].apply(lambda x:1 if x>=4 else 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Sentiment'] = data['Ratings'].apply(lambda x:1 if x>=4 else 0)


In [100]:
data

Unnamed: 0,Review Title,Review Text,Ratings,Sentiment
0,Nice product,"Nice product, good quality, but price is now r...",4,1
1,Don't waste your money,They didn't supplied Yonex Mavis 350. Outside ...,1,0
2,Did not meet expectations,Worst product. Damaged shuttlecocks packed in ...,1,0
3,Fair,"Quite O. K. , but nowadays the quality of the...",3,0
4,Over priced,Over pricedJust â?¹620 ..from retailer.I didn'...,1,0
...,...,...,...,...
8503,Yones Mavis 350 Blue cap,Wrost and duplicate productDon't buy this sell...,1,0
8504,For Mavis350,Received product intact and sealedREAD MORE,5,1
8505,Very Good,Delivered before time but price is high from m...,3,0
8506,Don't waste your money,up to the mark but same is available in market...,4,1


In [105]:
data = data[['Review Title', 'Review Text', 'Sentiment']]

In [106]:
data

Unnamed: 0,Review Title,Review Text,Sentiment
0,Nice product,"Nice product, good quality, but price is now r...",1
1,Don't waste your money,They didn't supplied Yonex Mavis 350. Outside ...,0
2,Did not meet expectations,Worst product. Damaged shuttlecocks packed in ...,0
3,Fair,"Quite O. K. , but nowadays the quality of the...",0
4,Over priced,Over pricedJust â?¹620 ..from retailer.I didn'...,0
...,...,...,...
8503,Yones Mavis 350 Blue cap,Wrost and duplicate productDon't buy this sell...,0
8504,For Mavis350,Received product intact and sealedREAD MORE,1
8505,Very Good,Delivered before time but price is high from m...,0
8506,Don't waste your money,up to the mark but same is available in market...,1


In [107]:
data['Sentiment'].value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
1,5380
0,1634


In [110]:
text_to_detect = 'READ MORE'
inconsistencies = data['Review Text'].str.contains(text_to_detect, na=False)
inconsistencies.sum()

np.int64(7014)

In [None]:
# All reviews are having a UI-generated ‘READ MORE’ artifact introduced during scraping
# so, we need to eliminate those inconsistencies to ensure that the model learns only from genuine user-generated content.

In [116]:
data['Review Text'] = (data['Review Text'].str.replace('read more', '', case = False, regex=False).str.strip())

# If sometimes casing varies (Read More, read more), so used case = False

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Review Text'] = (data['Review Text'].str.replace('read more', '', case = False, regex=False).str.strip())


In [117]:
data['Review Text'].str.contains('READ MORE', na=False).sum()

np.int64(0)

In [118]:
# now lets find the review length of each review text
data['Review Length'] = data['Review Text'].apply(lambda x: len(x.split()))

In [119]:
data

Unnamed: 0,Review Title,Review Text,Sentiment,Review Length
0,Nice product,"Nice product, good quality, but price is now r...",1,36
1,Don't waste your money,They didn't supplied Yonex Mavis 350. Outside ...,0,19
2,Did not meet expectations,Worst product. Damaged shuttlecocks packed in ...,0,23
3,Fair,"Quite O. K. , but nowadays the quality of the...",0,80
4,Over priced,Over pricedJust â?¹620 ..from retailer.I didn'...,0,16
...,...,...,...,...
8503,Yones Mavis 350 Blue cap,Wrost and duplicate productDon't buy this sell...,0,16
8504,For Mavis350,Received product intact and sealed,1,5
8505,Very Good,Delivered before time but price is high from m...,0,9
8506,Don't waste your money,up to the mark but same is available in market...,1,13


In [120]:
data['Review Length'].describe()

Unnamed: 0,Review Length
count,7014.0
mean,6.155974
std,8.221354
min,1.0
25%,2.0
50%,3.0
75%,7.0
max,95.0


In [121]:
data.groupby('Sentiment')['Review Length'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1634.0,8.787638,10.328034,1.0,2.0,5.0,11.0,80.0
1,5380.0,5.356691,7.279899,1.0,2.0,3.0,6.0,95.0


In [None]:
# All reviews contained a UI-generated ‘READ MORE’ artifact introduced during scraping, which was removed globally to preserve only genuine user-written content.
# UI-related artifacts such as ‘READ MORE’ introduced during web scraping were removed from the review text to ensure that the model learns only from genuine user-generated content.

### Dataset Quality Status -> Officially “Good to Model”

In [122]:
data.to_csv('/content/drive/MyDrive/Data Science with Advanced Gen AI Internship/Internship Tasks/cleaned_data.csv',
            index = False)

### Preprocessing

In [130]:
data['Final Text'] = (
    data['Review Title'].fillna('') + '. ' + data['Review Text'].fillna('')
    ).str.strip()

In [131]:
data['Final Text'].isnull().sum()

np.int64(0)

In [133]:
data = data[['Final Text', 'Sentiment']]
data

Unnamed: 0,Final Text,Sentiment
0,"Nice product. Nice product, good quality, but ...",1
1,Don't waste your money. They didn't supplied Y...,0
2,Did not meet expectations. Worst product. Dama...,0
3,"Fair. Quite O. K. , but nowadays the quality ...",0
4,Over priced. Over pricedJust â?¹620 ..from ret...,0
...,...,...
8503,Yones Mavis 350 Blue cap. Wrost and duplicate ...,0
8504,For Mavis350. Received product intact and sealed,1
8505,Very Good. Delivered before time but price is ...,0
8506,Don't waste your money. up to the mark but sam...,1


In [135]:
data['Sentiment'].value_counts()

# double-checking the class distribution

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
1,5380
0,1634


# Dataset Locked 🔒

In [140]:
import re

def clean_text(text):
  # converting to lower case
  text = text.lower()

  # Remove unwanted special characters (keep letters, numbers, spaces, periods)
  text = re.sub(r"[^a-z0-9\.\' ]+", " ", text)

  # Normalize whitespace
  text = re.sub(r'\s+', ' ', text).strip()

  return text

In [141]:
data['Final Text'] = data['Final Text'].apply(clean_text)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Final Text'] = data['Final Text'].apply(clean_text)


In [142]:
data

Unnamed: 0,Final Text,Sentiment
0,nice product. nice product good quality but pr...,1
1,don t waste your money. they didn t supplied y...,0
2,did not meet expectations. worst product. dama...,0
3,fair. quite o. k. but nowadays the quality of ...,0
4,over priced. over pricedjust 620 ..from retail...,0
...,...,...
8503,yones mavis 350 blue cap. wrost and duplicate ...,0
8504,for mavis350. received product intact and sealed,1
8505,very good. delivered before time but price is ...,0
8506,don t waste your money. up to the mark but sam...,1


✅ Workflow Completed

1. Dropped irrelevant columns (Reviewer Name, Month, Place of Review, etc.)

2. Handled null values (removed rows with nulls in critical columns)

3. Removed duplicates (clean dataset: 7014 rows)

4. Encoded target column (Sentiment: 0 = negative, 1 = positive)

5. Detected & removed scraping artifacts (READ MORE)

6. Merged Review Title + Review Text with a period separator

7. Text preprocessing:

    - Lowercased
    - Removed unwanted special characters (kept letters, numbers, spaces, periods)
    - Normalized whitespace

8. Final dataset columns: *Final Text + Sentiment*

In [143]:
data.to_csv('/content/drive/MyDrive/Data Science with Advanced Gen AI Internship/Internship Tasks/preprocessed_data.csv',
            index = False)