## Part 2 : Preprocessing 

We use the previously filtered downloaded Amazon Sports_and_Outdoors_5 dataset to analyze.

In [1]:
import pandas as pd

# File path
file_path = '/home/marshal/protonotebook/VS Workspace/Sports_and_Outdoors_5_filtered.csv'

# Read CSV file into DataFrame
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,overall,reviewerID,reviewText,summary
0,5,A180LQZBUWVOLF,What a spectacular tutu! Very slimming.,Five Stars
1,1,ATMFGKU5SVEYY,What the heck? Is this a tutu for nuns? I know...,Is this a tutu for nuns?!
2,5,A1QE70QBJ8U6ZG,Exactly what we were looking for!,Five Stars
3,5,A22CP6Z73MZTYU,I used this skirt for a Halloween costume and ...,I liked that the elastic waist didn't dig in (...
4,4,A22L28G8NRNLLN,This is thick enough that you can't see throug...,This is thick enough that you can't see throug...


In [2]:
print(data.overall.value_counts(normalize=True))

overall
5    0.681357
4    0.170021
3    0.072186
1    0.041056
2    0.035380
Name: proportion, dtype: float64


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332447 entries, 0 to 332446
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   overall     332447 non-null  int64 
 1   reviewerID  332447 non-null  object
 2   reviewText  332306 non-null  object
 3   summary     332385 non-null  object
dtypes: int64(1), object(3)
memory usage: 10.1+ MB


### Check NaN Values

In [11]:
nan_check = data.isna().sum()
print(nan_check)

overall         0
reviewerID      0
reviewText    141
summary        62
dtype: int64


In [12]:
# Drop NaN Values
data.dropna(subset=['reviewText'], inplace=True)

### Applying Preprocessing features

1. Tokenizer
2. Stemming
3. remove Stop Words

In [13]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import inflect

p = inflect.engine()  # Initialize inflect engine inside the function
porter_stemmer = PorterStemmer()  # Initialize Porter Stemmer

def cleaningText(text):
    
    text = re.sub("[^a-zA-Z0-9]", " ", text) # Remove Punctuation
    text = ' '.join(p.number_to_words(word) if word.isdigit() else word for word in text.split()) #convert Numbers to Words
    text = [ porter_stemmer.stem(word.lower()) for word in word_tokenize(text) if word not in stopwords.words('english') ]
    return " ".join(text)
     

### Test the function

In [14]:
# Test the function
text = "I have 3 cats and 2 dogs."
cleaned_text = cleaningText(text)
print(cleaned_text)

i three cat two dog


In [15]:
filtered_data = data[['overall','reviewText']]
filtered_data

Unnamed: 0,overall,reviewText
0,5,What a spectacular tutu! Very slimming.
1,1,What the heck? Is this a tutu for nuns? I know...
2,5,Exactly what we were looking for!
3,5,I used this skirt for a Halloween costume and ...
4,4,This is thick enough that you can't see throug...
...,...,...
332442,5,Works as expected!
332443,5,As described. easy to assemble with shock cord.
332444,5,Really Nice set of Carbon bars that are very l...
332445,5,Ive been using these for about two months so f...


In [16]:
nan_check = filtered_data.isna().sum()
print(nan_check)

overall       0
reviewText    0
dtype: int64


In [20]:
filtered_data['filtered_review'] = filtered_data['reviewText'].apply(lambda x: cleaningText(str(x)))
filtered_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data['filtered_review'] = filtered_data['reviewText'].apply(lambda x: cleaningText(str(x)))


Unnamed: 0,overall,reviewText,filtered_review
0,5,What a spectacular tutu! Very slimming.,what spectacular tutu veri slim
1,1,What the heck? Is this a tutu for nuns? I know...,what heck is tutu nun i know cut still also se...
2,5,Exactly what we were looking for!,exactli look
3,5,I used this skirt for a Halloween costume and ...,i use skirt halloween costum glu bunch feather...
4,4,This is thick enough that you can't see throug...,thi thick enough see long sure check dimens i ...
...,...,...,...
332442,5,Works as expected!,work expect
332443,5,As described. easy to assemble with shock cord.,as describ easi assembl shock cord
332444,5,Really Nice set of Carbon bars that are very l...,realli nice set carbon bar light strong realli...
332445,5,Ive been using these for about two months so f...,ive use two month far i love the color bright ...


use .loc[] to ensure that you're modifying the original DataFrame

In [21]:
filtered_data.to_csv('Sports_and_Outdoors_5_Filtered_review.csv', index=False)