# 02. Data Cleaning

This notebook contains the first stage of the data cleaning process for the project dataset. It focuses on identifying and handling inconsistencies, missing values, and formatting issues to prepare the data for further analysis.


## Library Imports

We import the necessary libraries for data analysis and preprocessing.

In [None]:
import pandas as pd
import csv
import os
import matplotlib.pyplot as plt
import seaborn as sns

## Dataset Loading
Load the dataset from a CSV file for analysis.

In [None]:
## Dataset Loading

# Define the relative path to the data file
data_path = '../data/raw/dataset.csv'

# Load the dataset
try:
    # Attempt to read with UTF-8 encoding first
    df = pd.read_csv(data_path, encoding='utf-8') 
except UnicodeDecodeError:
    try:
        # Fallback to Latin-1 encoding if UTF-8 fails
        df = pd.read_csv(data_path, encoding='latin1') 
    except Exception as e:
        print(f"Error loading CSV file: {e}")
        df = None  # Indicate failure by setting df to None

# Check if the dataset loaded correctly
if df is not None:
    print(f"Dataset loaded successfully. Shape: {df.shape}")
else:
    print("Error loading dataset. Please verify the file path and encoding.")


## Remove Unnecessary Columns

Drop columns that are not relevant for the analysis to simplify the dataset.

In [None]:
# Display initial shape and columns of the dataframe
print(f"Initial dataframe shape: {df.shape}")
print(f"Dataframe columns: {df.columns.tolist()}")


In [None]:
# Columns to drop (modify as needed)
columns_to_drop = ["col1", "col2", "col3", "col4"]

In [None]:
# Drop the specified columns from the dataframe
df = df.drop(columns=columns_to_drop)

In [None]:
# Display final dataframe shape and columns after dropping unnecessary columns
print(f"Final dataframe shape after dropping columns: {df.shape}")
print(f"Dataframe columns: {df.columns.tolist()}")
df.head()

## Encoding Error Correction and Category Unification

In this step, we identify and correct encoding errors and unify similar categories across columns to improve data quality and consistency.

For each affected column, we replace incorrect or inconsistent values with their corrected or unified forms.


In [None]:
# Example: Replace encoding errors and unify categories for selected columns
corrections = {
    'COLUMN_1': {
        'IncorrectValue1': 'CorrectValue1',
        'IncorrectValue2': 'CorrectValue2',
        # ...
    },
    'COLUMN_2': {
        'OldCategoryA': 'UnifiedCategoryA',
        'OldCategoryB': 'UnifiedCategoryA',
        # ...
    },
    # Add more columns as needed
}

for col, mapping in corrections.items():
    if col in df.columns:
        df[col] = df[col].replace(mapping)


In [None]:
df.head(20)

## Export Cleaned Data to CSV

The cleaned dataset is exported to a CSV file located at ../data/processed/cleaned_data.csv for further analysis.

In [None]:
df.to_csv(
    path_or_buf='../data/processed/cleaned_data.csv',
    sep=',',
    na_rep='',
    header=True,
    index=False,
    encoding='utf-8',
    quoting=csv.QUOTE_MINIMAL,
    lineterminator=os.linesep,
    quotechar='"',
    decimal='.',
    errors='strict'
)