## Data Cleanup
### Introduction

In this notebook we will focus on transforming the dataset into a simplified format. This will ease later analysis and modeling tasks. We will remove duplicate rows and drop any row with a missing value. Then we will get rid of any column that is not valuable to us. And in the end, we will have a cleaned up dataset with only the `content` and `score` of the review.

### Setup and Preview

Let's start by loading the necessary libraries. We'll preview the dataset to understand its current structure:

In [21]:
# Importing the pandas library for data manipulation
import pandas as pd

# Reading the dataset from a CSV file
df = pd.read_csv('../DATASETS/netflix_reviews.csv')

# Displaying the first few rows of the dataset
df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,cc1cfcd2-dc8a-4ead-88d1-7f2b2dbb2662,NR Bharadwaj,Plsssss stoppppp giving screen limit like when...,2,0,8.120.0 build 10 50712,2024-07-02 17:17:53,8.120.0 build 10 50712
1,7dfb1f90-f185-4e81-a97f-d38f0128e5a4,Maxwell Ntloko,Good,5,1,,2024-06-26 15:38:06,
2,3009acc4-8554-41cf-88de-cc5e2f6e45b2,Dilhani Mahanama,👍👍,5,0,,2024-06-24 15:29:54,
3,b3d27852-9a3b-4f74-9e16-15434d3ee324,Karen Gulli,Good,3,0,,2024-06-22 15:41:54,
4,8be10073-2368-4677-b828-9ff5d06ea0b7,Ronny Magadi,"App is useful to certain phone brand ,,,,it is...",1,0,8.105.0 build 15 50626,2024-06-22 05:16:03,8.105.0 build 15 50626


### Selecting Relevant Columns

As we said before, we only care about the `content` and the `score` of the reviews. But we will keep the `reviewId` for now, so we can identify any duplicate reviews.

In [22]:
# Selecting only the necessary columns
df = df[['reviewId', 'content', 'score']]

# Displaying the updated DataFrame to verify the dropped columns
df.head()

Unnamed: 0,reviewId,content,score
0,cc1cfcd2-dc8a-4ead-88d1-7f2b2dbb2662,Plsssss stoppppp giving screen limit like when...,2
1,7dfb1f90-f185-4e81-a97f-d38f0128e5a4,Good,5
2,3009acc4-8554-41cf-88de-cc5e2f6e45b2,👍👍,5
3,b3d27852-9a3b-4f74-9e16-15434d3ee324,Good,3
4,8be10073-2368-4677-b828-9ff5d06ea0b7,"App is useful to certain phone brand ,,,,it is...",1


### Identifying and Removing Duplicates and Missing Values

Before we start working with our data set, it's important to make sure that there are no missing values or duplicate entries. The function below gives a clear overview of how much data is missing and how many duplicate rows exist. So we can see what we need to clean up.

In [23]:
# Function to display detailed information about missing and duplicated data
def show_details(dataset):
    # Calculate the number of missing values and their percentage
    missed_values = dataset.isnull().sum()
    missed_values_percent = 100 * (dataset.isnull().sum() / len(dataset))
    
    # Calculate the number of duplicated values and their percentage
    duplicated_values = dataset.duplicated().sum()
    duplicated_values_percent = 100 * (dataset.duplicated().sum() / len(dataset))
    
    # Create a dataframe to store the information in a structured format
    info_frame = pd.DataFrame({'Missing_Values': missed_values, 
                               'Missing_Values %': missed_values_percent,
                               'Duplicated Values': duplicated_values,
                               'Duplicated Values %': duplicated_values_percent})
    
    # Transpose the dataframe to improve readability
    return info_frame.T

# Displaying details of missing and duplicated data
show_details(df)


Unnamed: 0,reviewId,content,score
Missing_Values,0.0,2.0,0.0
Missing_Values %,0.0,0.00176,0.0
Duplicated Values,316.0,316.0,316.0
Duplicated Values %,0.278145,0.278145,0.278145


As shown above, there are few duplicate or missing entries in the data. So, we can remove the duplicates and drop the rows with missing values. This ensures that our dataset remains of high quality.

In [24]:
# Removing duplicate rows
df.drop_duplicates(inplace=True)

# Removing rows with missing values
df.dropna(inplace=True)

# Verifying that duplicates and missing values have been removed
show_details(df)

Unnamed: 0,reviewId,content,score
Missing_Values,0.0,0.0,0.0
Missing_Values %,0.0,0.0,0.0
Duplicated Values,0.0,0.0,0.0
Duplicated Values %,0.0,0.0,0.0


### Final Column Selection

Now that we cleaned up the dataset, we no longer need the reviewId:

In [25]:
# Removing the 'reviewId' column as it is no longer necessary
df = df[['content', 'score']]
df.head()

Unnamed: 0,content,score
0,Plsssss stoppppp giving screen limit like when...,2
1,Good,5
2,👍👍,5
3,Good,3
4,"App is useful to certain phone brand ,,,,it is...",1


### Saving the Cleaned Data

Finally, let's save the cleaned dataset for use later:

In [26]:
# Saving the cleaned dataset to a new CSV file
df.to_csv('../DATASETS/cleaned_data.csv', index=False)