# Task 1: Data Wrangling

### Table of Contents

1. [Introduction](#introduction)
2. [Data Gathering](#data_gathering)
3. [Data Assessing](#data_assessing)
    * [Assessment Summary](#assessment_summary)
4. [Data Cleaning](#data_cleaning)
5. [Saving](#saving)

## Introduction <a class="anchor" id="introduction"></a>

Data wrangling is performed on the Ames Housing dataset, which describes the features of residential homes in Ames, Iowa. The dataset is separated in two files, namely training set and testing set. Both training and testing sets have 79 columns describing the features of the homes, but training set has one additional column, which consists of the home prices. The explanation for each column can be found [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

The steps for data wrangling includes gathering, assessing, cleaning. Data wrangling is only performed on the training set only because testing set should be considered as unseen data. For gathering, data is only gathered from one source, which is the training set. Subsequently, data is assessed with visual and programmatic assessments to look for data quality and tidiness issues. Finally, data is cleaned based on the issues detected during assessment stage.

## Data Gathering <a class="anchor" id="data_gathering"></a>

In [None]:
# Import libraries
import os
import pandas as pd
import numpy as np
import math
pd.set_option('display.max_columns', None)  

In [None]:
# Define URLs of training sets
dirname = '/kaggle/input'
subdirname = 'house-prices-advanced-regression-techniques'
train_filename = 'train.csv'
train_filepath = os.path.join(dirname, subdirname, train_filename)

# Load training and testing sets
df = pd.read_csv(train_filepath)

In [None]:
# Print out the first 5 rows of df_train
df.head()

In [None]:
print("Training set: {}".format(df.shape))

## Data Assessing <a class="anchor" id="data_accessing"></a>

Data is assessed with visual and programmatic assessment to look for data quality and tidiness issues

### Visual Assessment

In [None]:
# Sample n rows at random from the data frame for visual assessment
n_samples = 10
df.sample(n = n_samples)

### Programmatic Assessment

#### Step 1: Check if there are any duplicates

In [None]:
# Check for duplicates based on all columns
print("Number of duplicates: {}".format(sum(df.duplicated())))

# Check for duplicates based on all columns except 'Id'
columns_without_id = list(df.columns)
columns_without_id.remove('Id')

print("Number of duplicates (without 'Id'): {}".format(sum(df.duplicated(subset=columns_without_id))))

> ##### Finding(s) for Step 1:
> - There is no duplicate.


#### Step 2: Determine the number of nulls/missing data and the correct datatypes

In [None]:
# Check number of NaNs in each column and data type of each column
df.info()

> ##### Finding(s) for Step 2:

> - 80 variables (excld. ID)

>> -  34 Numeric variables 

>>> -  14 discrete variables (YearBuilt, YearRemodAdd, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, MoSold, YrSold)

>>> -  20 continuous variables (LotFrontage, LotArea, MasVnrArea, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal, SalePrice)

>> - 46 categorical variables 

>>> - 25 nominal variables (MSSubClass, MSZoning, Street, Alley, Utilities, LotConfig, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, Foundation, Heating, CentralAir, Electrical, GarageType, PavedDrive, MiscFeature, SaleType, SaleCondition)

>>> - 21 ordinal variables (LotShape, LandContour, LandSlope, OverallQual, OverallCond, ExterQual, ExterCond, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, HeatingQC, KitchenQual, Functional, FireplaceQu, GarageFinish, GarageQual, GarageCond, PoolQC, Fence)

> - Wrong data types (MSSubClass) - should be string instead of integer

> - Wrong data types (LotArea, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal) - should be float instead of integer

> - Wrong data types (BsmtFullBath, BsmtHalfBath, GarageYrBlt) - should be integer instead of float

> - 19 variables with nulls/missing data

>> - 3 numeric variables (LotFrontage, MasVnrArea,  GarageYrBlt)

>> - 16 categorical variables (Alley, MasVnrType, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1,BsmtFinType2, Electrical, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature)

#### Step 3: Investigate the missing data further

Based on data_description_txt (can be found [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)), certain variables consist of nulls or NANs because these variables are not possessed by the house. For example, NAN for alley variable means that the house does not have alley access. Thus, null or NAN doesn't mean missing data for these variables. 

> ##### Finding(s) for Step 3: 

> - Nulls in 23 variables do not represent missing data 

>> - 9 numeric variable (BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath, GarageYrBlt, GarageCars, GarageArea)

>> - 14 categorical variable (Alley, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature)

#### Step 4: Look for any outliers in numeric variables

In [None]:
# Compute statistics of numeric variables
df.describe()

> ##### Finding(s) for Step 4:

> - The data for numeric variables are pretty clean as there isn't any obvious outlier.

#### Step 5: Look for any abnormal data in categorical variables

In [None]:
# Define categorical variables
cat_vars = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 
            'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 
            'MasVnrType', 'Foundation', 'Heating', 'CentralAir', 'Electrical', 'GarageType', 'PavedDrive', 
            'MiscFeature', 'SaleType', 'SaleCondition', 'LotShape', 'LandContour', 'LandSlope', 'OverallQual', 
            'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
            'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual',
            'GarageCond', 'PoolQC', 'Fence']

# Print the unique values for each categorical variable
for var in cat_vars:
    # Get unique values
    unique_vals = df[var].unique()
 
    print("{} : {}".format(var, unique_vals))

> ##### Finding(s) for Step 5:

> - After comparing the list of unique values of each variable with data_description_txt (can be found [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)), the data for categorical variables are pretty clean and only the following two data quality issues were found.

>> - 'WdShing' in 'Exterior1st' variable is the same as 'Wd Shng' in 'Exterior2nd' variable. For consistency, 'Wd Shng' in 'Exterior2nd' should be renamed as 'WdShing'

>> - 'BrkComm' in 'Exterior1st' variable is the same as 'Brk Cmn' in 'Exterior2nd' variable. For consistency, 'Brk Cmn' in 'Exterior2nd' should be renamed as 'BrkComm'

#### Step 6: Look for any discrepancies in year variables

There are 4 variables that are related to year, namely YearBuilt, YearRemodAdd, GarageYrBlt and YrSold. 

- YearBuilt:  Original construction date
- YearRemodAdd: Remodel date
- GarageYrBlt: Year garage was built
- YrSold: Year Sold

These 4 variables will be compared to ensure the chronological order of these variables is logical. YearBuilt should occur first, and YrSold should occur last. YearRemodAdd and GarageYrBlt should occur within YearBuilt and YrSold. Thus, the expected chronological order is YearBuilt -> YearRemodAdd/GarageYrBuilt -> YrSold


In [None]:
# Define year variables
year_vars = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

# Select YearBuilt, YearRemodAdd, GarageYrBlt and YrSold from the data frame
df_year = df[year_vars]

# Define a function to check the chronological order
def is_correct_order(x):
    # Set 'YearRemodAdd' or 'YearGarageYrBlt' as 'YearBuilt if they are NaN because NaN will cause
    # the condition for checking chronological order below to return False in any scenario.
    if pd.isna(x['YearRemodAdd']):
        x['YearRemodAdd'] = x['YearBuilt'] 
        
    if pd.isna(x['GarageYrBlt']):
        x['GarageYrBlt'] = x['YearBuilt'] 
    
    # Check correctness of chronological order
    if x['YearBuilt'] <= x['YearRemodAdd'] <= x['YrSold'] and x['YearBuilt'] <= x['GarageYrBlt'] <= x['YrSold']:
        return True
    else:
        return False

# Get row(s) whose chronological order is not logical
df_year[~(df_year.apply(is_correct_order, axis=1))]

> ##### Finding(s) for Step 6:

> - For rows with ID of 523, YearRemodAdd doesn't occur between YearBuilt and YrSold. YearRemodAdd should be replaced with YearBuilt or YrSold, depending on which is closer to YearRemodAdd

> - For rows with IDs of 29, 93, 324, 600, 736, 1103, 1376, 1414 and 1418, GarageYrBlt doesn't occur between YearBuilt and YrSold. GarageYrBlt should be replaced with YearBuilt or YrSold, depending on which is closer to GarageYrBlt

#### Step 7: Check consistency in area (square feet)

If the data is correct, the following equations must be satisfied.

- TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF
- GrLivArea = 1stFlrSF + 2ndFlrSF + LowQualFinSF

In [None]:
# Get the row(s) whose TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF is not satisfied
df[df['BsmtFinSF1'] + df['BsmtFinSF2'] + df['BsmtUnfSF'] != df['TotalBsmtSF']]

In [None]:
# Get the row(s) whose GrLivArea = 1stFlrSF + 2ndFlrSF + LowQualFinSF is not satisfied
df[df['1stFlrSF'] + df['2ndFlrSF'] + df['LowQualFinSF'] != df['GrLivArea']]

> ##### Finding(s) for Step 7:

> - Based on the results shown above, all the rows satisfy both 'TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF' and 'GrLivArea = 1stFlrSF + 2ndFlrSF + LowQualFinSF'

#### Step 8: Look for variables with many nulls

In [None]:
# Set the threshold for acceptable null percentage
threshold = 0.2

# Print the variables that have null percentage greater than the threshold
df.loc[:, df.isnull().mean() > threshold].isnull().mean()

> ##### Finding(s) for Step 8:

> - There are 5 variables with null percentage that is greater than 0.2 or 20% - Alley, FireplaceQu, PoolQC, Fence and MiscFeature. As these variables have too many missing data, they will be dropped from the data frame.

### Assessment Summary <a class="anchor" id="assessment_summary"></a>

Based on the findings above, the following data quality issues were detected.

- Wrong data types (MSSubClass) - should be string instead of integer
- Wrong data types (LotArea, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal) - should be float instead of integer
- Wrong data types (BsmtFullBath, BsmtHalfBath, GarageYrBlt) - should be integer instead of float
- Nulls in 23 variables do not represent missing data, but they means that these variables are not possessed by the house. Thus, nulls (if there are any) in these variables should be replaced with other string to indicate that these variables are not possessed by the house.
> - 9 numeric variable (BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath, GarageYrBlt, GarageCars, GarageArea)
> - 14 categorical variable (Alley, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature)

- 'WdShing' in 'Exterior1st' variable is the same as 'Wd Shng' in 'Exterior2nd' variable. For consistency, 'Wd Shng' in 'Exterior2nd' should be renamed as 'WdShing'
- 'BrkComm' in 'Exterior1st' variable is the same as 'Brk Cmn' in 'Exterior2nd' variable. For consistency, 'Brk Cmn' in 'Exterior2nd' should be renamed as 'BrkComm'
- For rows with ID of 523, YearRemodAdd doesn't occur between YearBuilt and YrSold. YearRemodAdd should be replaced with YearBuilt or YrSold, depending on which is closer to YearRemodAdd
- For rows with IDs of 29, 93, 324, 600, 736, 1103, 1376, 1414 and 1418, GarageYrBlt doesn't occur between YearBuilt and YrSold. GarageYrBlt should be replaced with YearBuilt or YrSold, depending on which is closer to GarageYrBlt
- Alley, FireplaceQu, PoolQC, Fence and MiscFeature should be dropped from the data frame as they have too many missing data.

## Data Cleaning <a class="anchor" id="data_cleaning"></a>

Data is cleaned based on the assessment summary. There are three steps for each cleaning prcoess, which are define, code, test.

- Define: define objective of the cleaning process
- Code: write code for performing the objective
- Test: verify cleaning process is carried out as intended

Before cleaning begins, a copy of the data frame is created. All the cleaning is performed on the copy, so that cleaned and uncleaned data frames can be compared if needed.

In [None]:
# Create a copy of the data frame
df_clean = df.copy()

#### Step 1: Change data types

***Define***

- Convert data type of MSSubClass from integer to string
- Convert data types of LotArea, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal from integer to float*
- Convert data types of BsmtFullBath, BsmtHalfBath from float to integer

***Code***

In [None]:
# Convert data type of MSSubClass to string
df_clean['MSSubClass'] = df_clean['MSSubClass'].astype(str)

# Convert data types of LotArea, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, WoodDeckSF, 
# OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal to float

columns = ['LotArea', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'WoodDeckSF', 'OpenPorchSF', 
           'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

for col in columns:
    df_clean[col] = df_clean[col].astype(float)
    
# Convert data types of BsmtFullBath, BsmtHalfBath to integer
columns = ['BsmtFullBath', 'BsmtHalfBath', 'GarageYrBlt']

for col in columns:
    df_clean[col] = df_clean[col].astype('Int64')

***Test***

In [None]:
df_clean.head()

In [None]:
df[['MSSubClass', 'LotArea', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 
    'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 
    'MiscVal', 'BsmtFullBath', 'BsmtHalfBath']].info()

In [None]:
df_clean[['MSSubClass', 'LotArea', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 
          'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 
          'MiscVal', 'BsmtFullBath', 'BsmtHalfBath']].info()

#### Step 2: Replace nulls in 23 variables

***Define***

Nulls in these 23 variables do not represent missing data, but they means that these variables are not possessed by the house. Out of 23 variables, 9 are numeric variables and 14 are categorical variables. Some of the variables are closely related to one anothers. For example, all the information about the garage of a house can be found from GarageYrBlt, GarageCars, GarageArea, GarageType, GarageFinish, GarageQual and GarageCond variables. If a house does not have a garage, then nulls in all the categorical variables that are related to garaga will be replaced with NoGarage. Whereas, nulls in all the numerical variables will be replaced with 0.

We can split all the 23 variables into 3 categories, which are:

- Basement (BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2)
- Garage (GarageYrBlt, GarageCars, GarageArea, GarageType, GarageFinish, GarageQual, GarageCond)
- Others (Alley, FireplaceQu, PoolQC, Fence, MiscFeature)

The 3 categories will be cleaned separately with the sequence shown in the list above.

***Code***

In [None]:
# Clean basement category

# Define the variables that are related to basement
var_numerical = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
var_categorical = ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']

# Get boolean index of rows for house without basement
bool_index = df_clean[var_categorical].isnull().all(axis=1)

# Replace numeric variables with 0
df_clean.loc[bool_index, var_numerical] = 0

# Replace categorical variables with NoBsmt
df_clean.loc[bool_index, var_categorical] = 'NoBsmt'

In [None]:
# Clean garage category

# Define the variables that are related to garage
var_numerical = ['GarageYrBlt', 'GarageCars', 'GarageArea']
var_categorical = ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']

# Get boolean index of rows for house without garage
bool_index = df_clean[var_categorical].isnull().all(axis=1)

# Replace numeric variables with 0
df_clean.loc[bool_index, var_numerical] = 0

# Replace categorical variables with NoBsmt
df_clean.loc[bool_index, var_categorical] = 'NoGarage'

In [None]:
# Clean other category

# Define the variables in other category
var_categorical = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

for var in var_categorical:
    # Get boolean index of rows with nulls
    bool_index = df_clean[var].isnull()
    
    # Replace categorical variables with No{Variable Name}
    df_clean.loc[bool_index, var] = 'No{}'.format(var)

***Test***

In [None]:
df_clean[['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtCond', 
          'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageType', 
          'GarageFinish', 'GarageQual', 'GarageCond', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']].info()

Based on the information shown in the previous cell, it can be seen that some variables that are related to basement still have nulls/missing data. Let's study the rows with nulls further.

In [None]:
# Get the rows that has nulls or missing data for basement-related variables
df_clean[df_clean[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType2']].isnull().any(axis=1)]

From the rows of data shown above, it can be observed that all the nulls in these variables actually represent missing data, instead of indicating that these variables are not possessed by the house. Missing data will be handled during Exploratory Data Analysis. Therefore, the objective of this particular cleaning process is achieved.

#### Step 3: Match the values of 'Exterior1st' and 'Exterior2nd'

***Define***

Change 'Wd Shng' in 'Exterior2nd' to 'WdShing' ('Exterior1st') and 'Brk Cmn' in 'Exterior2nd' to 'BrkComm' ('Exterior1st')

***Code***

In [None]:
# Replace 'Wd Shng' in 'Exterior2nd' with 'WdShing'
df_clean['Exterior2nd'] = df_clean['Exterior2nd'].replace('Wd Shng', 'WdShing') 

# Replace 'Brk Cmn' in 'Exterior2nd' with 'BrkComm'
df_clean['Exterior2nd'] = df_clean['Exterior2nd'].replace('Brk Cmn', 'BrkComm') 

***Test***

In [None]:
df_clean['Exterior2nd'].unique()

#### Step 4: Fix values of year-related variables

***Define***

- For rows with ID of 523, YearRemodAdd doesn't occur between YearBuilt and YrSold. YearRemodAdd should be replaced with YearBuilt or YrSold, depending on which is closer to YearRemodAdd

- For rows with IDs of 29, 93, 324, 600, 736, 1103, 1376, 1414 and 1418, GarageYrBlt doesn't occur between YearBuilt and YrSold. GarageYrBlt should be replaced with YearBuilt or YrSold, depending on which is closer to GarageYrBlt

***Code***

In [None]:
# Code for the first item in the list above

# Define the IDs to work on
ids_for_remod = 523

year_built = df.loc[ids_for_remod, 'YearBuilt']
year_sold = df.loc[ids_for_remod, 'YrSold']
year_remod_add = df.loc[ids_for_remod, 'YearRemodAdd']
    
# Compute absolute difference between YearBuilt and YearRemodAdd
diff_built_remod = abs(year_built - year_remod_add)
    
# Compute absolute difference between YearSold and YearRemodAdd
diff_sold_remod = abs(year_sold - year_remod_add)
    
# YearRemodAdd should is replaced with YearBuilt or YrSold, depending on which is closer to YearRemodAdd
df_clean.loc[ids_for_remod, 'YearRemodAdd'] = year_built if diff_built_remod <= diff_sold_remod else year_sold

In [None]:
# Code for the second item in the list above

# Define the IDs to work on
ids_for_garage = [29, 93, 324, 600, 736, 1103, 1376, 1414, 1418]

for row_id in ids_for_garage:
    year_built = df.loc[row_id, 'YearBuilt']
    year_sold = df.loc[row_id, 'YrSold']
    year_garage_built = df.loc[row_id, 'GarageYrBlt']
    
    # Compute absolute difference between YearBuilt and YearRemodAdd
    diff_built_garage = abs(year_built - year_garage_built)
    
    # Compute absolute difference between YearSold and YearRemodAdd
    diff_sold_garage = abs(year_sold - year_garage_built)
    
    
    df_clean.loc[row_id, 'GarageYrBlt'] = year_built if diff_built_garage <= diff_sold_garage else year_sold

***Test***

In [None]:
# Test for the first item in the list above
df_clean.loc[[ids_for_remod] , ['YearBuilt', 'YearRemodAdd', 'YrSold']]

In [None]:
# Test for the second item in the list above
df_clean.loc[ids_for_garage , ['YearBuilt', 'GarageYrBlt', 'YrSold']]

#### Step 5: Drop variables with too many missing data

***Define***

- Drop Alley, FireplaceQu, PoolQC, Fence and MiscFeature from the data frame due to too many missing data

***Code***

In [None]:
# Define columns to be dropped
cols = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

df_clean = df_clean.drop(cols, axis=1)

***Test***

In [None]:
# Verify that columns are dropped from the data frame
print(set(cols) & set(df_clean.columns))

## Data Saving <a class="anchor" id="saving"></a>

After data wrangling is completed, the cleaned data frame is saved to a CSV file

In [None]:
# Save cleaned dataframe to a CSV file
df_clean.to_csv('train_cleaned.csv', index=False)