# **Data Cleaning Notebook**

## Objectives

* Evaluate missing data in the dataset.
* Clean the data by handling missing values, transforming variables, and dropping irrelevant features.
* Split the cleaned data into train and test sets.
* Save the cleaned datasets for further analysis and modeling.

## Inputs

* output/datasets/collection/insurance.csv

## Outputs

* Cleaned Train Set: outputs/datasets/cleaned/TrainSetCleaned.csv
* Cleaned Test Set: outputs/datasets/cleaned/TestSetCleaned.csv

## Additional Comments

* We also going to process the categorical variables in the dataset.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Dataset

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/insurance.csv"
df = pd.read_csv(df_raw_path)
df


---

### Remove Future Warnings

In [None]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

---

# Data Exploration

Check the distribution and shape of a variable with missing data

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
print(f"Variables with missing data: {vars_with_missing_data}")

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

---

# Data Cleaning


We can see that the dataset has no missing values, but we will still perform some cleaning steps to ensure the data is ready for analysis.

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute / len(df) * 100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                  "PercentageOfDataset": missing_data_percentage,
                                  "DataType": df.dtypes}
                        )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                      )

    return df_missing_data

missing_data_report = EvaluateMissingData(df)
print(missing_data_report)

---

# Push Files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

# Save the dataframe to a CSV file in the outputs folder
df.to_csv('outputs/datasets/cleaned/insurance_cleaned.csv', index=False)

# Conclusions and Next Steps

### Conclusions

We have checked the dataset for missing values and found that there are no missing values present.

### Next Steps

* Feature Engineering
* Model Development
* Model Evaluation