# **Data Cleaning Notebook**

## Objectives

* Evaluate missing data in the dataset.
* Clean the data by handling missing values, transforming variables, and dropping irrelevant features.
* Split the cleaned data into train and test sets.
* Save the cleaned datasets for further analysis and modeling.

## Inputs

* output/datasets/collection/insurance.csv

## Outputs

* Cleaned Train Set: outputs/datasets/cleaned/TrainSetCleaned.csv
* Cleaned Test Set: outputs/datasets/cleaned/TestSetCleaned.csv

## Additional Comments

* We also going to process the categorical variables in the dataset.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/medical-insurance-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/medical-insurance-prediction'

# Load Collected Data

In [4]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/insurance.csv"
df = pd.read_csv(df_raw_path)
df.head(10)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


---

### Remove Future Warnings

In [5]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

---

# Data Exploration

Check the distribution and shape of a variable with missing data

In [6]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
print(f"Variables with missing data: {vars_with_missing_data}")

Variables with missing data: []


In [7]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

There are no variables with missing data


---

# Data Cleaning


We can see that the dataset has no missing values, but we will still perform some cleaning steps to ensure the data is ready for analysis.

In [8]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute / len(df) * 100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                  "PercentageOfDataset": missing_data_percentage,
                                  "DataType": df.dtypes}
                        )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                      )

    return df_missing_data

missing_data_report = EvaluateMissingData(df)
print(missing_data_report)

Empty DataFrame
Columns: [RowsWithMissingData, PercentageOfDataset, DataType]
Index: []


## Transform Categorical Variables

In [9]:
# Encondering categorical variables
df.replace({'sex':{'male':0, 'female':1},
            'smoker':{'no':0, 'yes':1},
            'region':{'southwest':0, 'southeast':1, 'northwest':2, 'northeast':3}}, inplace=True)
df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,1,0,16884.924
1,18,0,33.77,1,0,1,1725.5523
2,28,0,33.0,3,0,1,4449.462
3,33,0,22.705,0,0,2,21984.47061
4,32,0,28.88,0,0,2,3866.8552
5,31,1,25.74,0,0,1,3756.6216
6,46,1,33.44,1,0,1,8240.5896
7,37,1,27.74,3,0,2,7281.5056
8,37,0,29.83,2,0,3,6406.4107
9,60,1,25.84,0,0,2,28923.13692


## Split Train and Test set

### Separate features and target

In [10]:
x = df.drop("charges", axis=1)
y = df["charges"]

In [11]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print(f"{x.shape} X_train shape: {TrainSet.shape}, X_test shape: {TestSet.shape}")

(1338, 6) X_train shape: (1070, 6), X_test shape: (268, 6)


## Re-evaluate missing data in Train and Test sets

In [12]:
missing_data_report_train = EvaluateMissingData(TrainSet)
print(f"* There are {missing_data_report_train.shape[0]} variables with missing data \n")
print(missing_data_report_train)

* There are 0 variables with missing data 

Empty DataFrame
Columns: [RowsWithMissingData, PercentageOfDataset, DataType]
Index: []


In [13]:
missing_data_report_train = EvaluateMissingData(TestSet)
print(f"* There are {missing_data_report_train.shape[0]} variables with missing data \n")
print(missing_data_report_train)

* There are 0 variables with missing data 

Empty DataFrame
Columns: [RowsWithMissingData, PercentageOfDataset, DataType]
Index: []


---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [14]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)


[Errno 17] File exists: 'outputs/datasets/cleaned'


In [15]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

In [16]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

---

# Conclusions and Next Steps

### Conclusions

We have successfully cleaned the dataset by handling missing values, transforming variables, and dropping irrelevant features. The cleaned data is now ready for further analysis and modeling.
Key steps completed in this notebook:

* Identified and handled missing data values.
* Transformed categorical variables for better usability.
* Split the cleaned dataset into train and test sets.
* Saved the cleaned datasets for further analysis.

### Next Steps

* Feature Engineering
* Model Development
* Model Evaluation