## Data Cleanup
### Introduction

In this notebook, we focus on preparing the dataset for further analysis and modeling, specifically for sentiment analysis of reviews. The dataset currently contains several columns that are not necessary for our predictive modeling. Thus, we aim to streamline the dataset by retaining only the essential information, which will serve as input (review `content`) and output (review `score`) for sentiment analysis models.

### Setup and Preliminary Viewing

First, let's load the necessary libraries and preview the dataset to understand its current structure:

In [37]:
# Importing the pandas library for data manipulation
import pandas as pd

# Reading the dataset from a CSV file
df = pd.read_csv('../DATASETS/netflix_reviews.csv')

# Displaying the first few rows of the dataset to understand its initial structure
df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,cc1cfcd2-dc8a-4ead-88d1-7f2b2dbb2662,NR Bharadwaj,Plsssss stoppppp giving screen limit like when...,2,0,8.120.0 build 10 50712,2024-07-02 17:17:53,8.120.0 build 10 50712
1,7dfb1f90-f185-4e81-a97f-d38f0128e5a4,Maxwell Ntloko,Good,5,1,,2024-06-26 15:38:06,
2,3009acc4-8554-41cf-88de-cc5e2f6e45b2,Dilhani Mahanama,👍👍,5,0,,2024-06-24 15:29:54,
3,b3d27852-9a3b-4f74-9e16-15434d3ee324,Karen Gulli,Good,3,0,,2024-06-22 15:41:54,
4,8be10073-2368-4677-b828-9ff5d06ea0b7,Ronny Magadi,"App is useful to certain phone brand ,,,,it is...",1,0,8.105.0 build 15 50626,2024-06-22 05:16:03,8.105.0 build 15 50626


### Selecting Relevant Columns

Given our focus on sentiment analysis, we only need the text of the review and the associated score. However, we'll temporarily keep the reviewId to help identify and remove duplicate entries:

In [38]:
# Selecting only the necessary columns for sentiment analysis
df = df[['reviewId', 'content', 'score']]

# Displaying the updated DataFrame to verify the dropped columns
df.head()

Unnamed: 0,reviewId,content,score
0,cc1cfcd2-dc8a-4ead-88d1-7f2b2dbb2662,Plsssss stoppppp giving screen limit like when...,2
1,7dfb1f90-f185-4e81-a97f-d38f0128e5a4,Good,5
2,3009acc4-8554-41cf-88de-cc5e2f6e45b2,👍👍,5
3,b3d27852-9a3b-4f74-9e16-15434d3ee324,Good,3
4,8be10073-2368-4677-b828-9ff5d06ea0b7,"App is useful to certain phone brand ,,,,it is...",1


### Identifying and Removing Duplicates and Missing Values

It's essential to ensure our dataset is clean and free from duplicates and missing entries before proceeding:

In [39]:
# Function to display detailed information about missing and duplicated data
def show_details(dataset):
    missed_values = dataset.isnull().sum()
    missed_values_percent = 100 * (dataset.isnull().sum() / len(dataset))
    duplicated_values = dataset.duplicated().sum()
    duplicated_values_percent = 100 * (dataset.duplicated().sum() / len(dataset))
    info_frame = pd.DataFrame({'Missing_Values': missed_values, 
                               'Missing_Values %': missed_values_percent,
                               'Duplicated Values': duplicated_values,
                               'Duplicated Values %': duplicated_values_percent})
    return info_frame.T

# Displaying details of missing and duplicated data
show_details(df)

Unnamed: 0,reviewId,content,score
Missing_Values,0.0,2.0,0.0
Missing_Values %,0.0,0.00176,0.0
Duplicated Values,316.0,316.0,316.0
Duplicated Values %,0.278145,0.278145,0.278145


Given that these issues constitute only a small percentage of our data, we will remove them to ensure the quality of our dataset:

In [40]:
# Removing duplicate rows
df.drop_duplicates(inplace=True)

# Removing rows with missing values
df.dropna(inplace=True)

# Verifying that duplicates and missing values have been removed
show_details(df)

Unnamed: 0,reviewId,content,score
Missing_Values,0.0,0.0,0.0
Missing_Values %,0.0,0.0,0.0
Duplicated Values,0.0,0.0,0.0
Duplicated Values %,0.0,0.0,0.0


### Final Column Selection

Now that duplicates have been managed, we no longer need the reviewId:

In [41]:
# Removing the 'reviewId' column as it is no longer necessary
df = df[['content', 'score']]

### Normalizing the Score

For many modeling techniques, normalizing the score between 0 and 1 can improve performance:

In [42]:
# Importing MinMaxScaler for normalization
from sklearn.preprocessing import MinMaxScaler

# Creating a scaler object
scaler = MinMaxScaler()

# Applying the scaler to the 'score' column to normalize its values
df[['score']] = scaler.fit_transform(df[['score']])

# Displaying the first few rows to check the normalized scores
df.head()

Unnamed: 0,content,score
0,Plsssss stoppppp giving screen limit like when...,0.25
1,Good,1.0
2,👍👍,1.0
3,Good,0.5
4,"App is useful to certain phone brand ,,,,it is...",0.0


### Saving the Cleaned Data

Finally, let's save the cleaned dataset for use in subsequent modeling:

In [43]:
# Saving the cleaned dataset to a new CSV file
df.to_csv('../DATASETS/cleaned_data.csv', index=False)