# Data Cleaning

We clean the data by matching the amount of flagged reviews and clean reviews. This can reduce bias from our model.

## Install packages

In [None]:
%pip install pandas

## Import packages

In [4]:
import pandas as pd

## Match amount of clean and flagged reviews

Due to the overwhelming amount of clean reviews, we match the amount of both reviews for less bias during training of our model.

In [None]:
df = pd.read_csv("./data/review-Vermont_10-labeled.csv")
minority_classes = ["advertisement", "rant_without_visit", "irrelevant"]
minority_df = df[df['label'].isin(minority_classes)]
minority_total = len(minority_df)
clean_df = df[df['label'] == "clean"].sample(minority_total, random_state=42)
balanced_df = pd.concat([minority_df, clean_df]).reset_index(drop=True)
print(balanced_df['label'].value_counts())

label
clean                 1071
irrelevant             546
advertisement          271
rant_without_visit     254
Name: count, dtype: int64


  df = pd.read_csv("./data/review-Vermont_10-labeled-merged.csv")


## Convert to 0 and 1

0 is for clean review and 1 is for flagged review.

In [3]:
balanced_df['label'] = balanced_df['label'].apply(lambda x: 0 if str(x).lower() == "clean" else 1)
balanced_df = balanced_df[['text', 'label']]
print(balanced_df['label'].value_counts())

label
1    1071
0    1071
Name: count, dtype: int64


## Save to csv

Cleaned dataset will be saved to `./data`.

In [11]:
balanced_df.to_csv("./data/review-Vermont_10-cleaned.csv", index=False)