## 2. Data Preprocessing
The uploaded CSV file `1.1_lazada_reviews.csv` contains 658 entries with the following columns:
- **Username**: The username of the reviewer.
- **Rating**: The rating given by the reviewer (all ratings are 5).
- **Date**: The date of the review.
- **Comment**: The review text.

### Observations:
1. **Non-null Entries**: 
   - All entries in `Username`, `Rating`, and `Date` columns are non-null.
   - There are 31 missing entries in the `Comment` column.

2. **Data Types**:
   - `Username`: Object (string)
   - `Rating`: Integer (all entries are 5)
   - `Date`: Object (string)
   - `Comment`: Object (string)

3. **Ratings**: 
   - All ratings are 5, meaning there is no variation in the ratings, which could limit sentiment analysis based on ratings alone.

## Data Preprocessing Steps

1. **Handle Missing Values**:
   - Remove rows with missing `Comment` values as they do not contribute to sentiment analysis.

2. **Text Normalization**:
   - Convert text to lowercase.
   - Remove punctuation and special characters.
   - Remove stop words.
   - Perform lemmatization.

In [2]:
import pandas as pd
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# Load data
df = pd.read_csv('1.1_lazada_reviews.csv')

# Remove rows with missing comments
df.dropna(subset=['Comment'], inplace=True)

# Convert text to lowercase
df['Comment'] = df['Comment'].str.lower()

# Remove punctuation
df['Comment'] = df['Comment'].str.replace('[{}]'.format(string.punctuation), '', regex=True)

# Remove stop words
stop_words = set(stopwords.words('english'))
df['Comment'] = df['Comment'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

# Lemmatization
lemmatizer = WordNetLemmatizer()
df['Comment'] = df['Comment'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in word_tokenize(x)))

# Save the preprocessed data
df.to_csv('2.1_preprocessed_reviews.csv', index=False)

df.head()


Unnamed: 0,Username,Rating,Date,Comment
0,A***.,5,1 day ago,highly responsive accurate sensor great feelin...
1,Tiar Y.,5,14 Mar 2024,actually quite hesitate whrn buy mouse cu craz...
2,Metheldis R.,5,31 Jan 2024,cheapest price itemsuperbly fast drop delivery...
3,Jeff T.,5,10 Mar 2021,skeptical buy store zero review product decide...
4,D***.,5,18 Sep 2021,fadt delivery well packaging box still intact ...
