# **Data cleaning**

## Objectives

**Perform Business requirement 2 user story task: Data cleaning and preparation ML tasks**
* Find and correct if necessary invalid data.
* Impute missing data.
* Handle outliers.
* Create data cleaning pipeline.




## Inputs
* house prices dataset: outputs/datasets/collection/house_prices.csv

## Outputs


---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load house price dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/collection/house_prices.csv')
house_prices_df.dtypes

We know from the data collection notebook, that there are no duplicates in the dataset.

---

## Invalid data

### Data types

Inspection of the data types for each variable, shows no discrepancies from the expectation for each variable's suitable data type.

### Value ranges

Checking the values for each variable are within the numeric valid range or equal to one of the categorical options, as indicated in the datasets metadata.

**First for numeric variables**.

In [None]:
numeric_house_prices_df = house_prices_df.select_dtypes(exclude=['object'])
numeric_house_prices_df.columns.tolist()

In [None]:

def check_value_ranges(variable, value_range):
    """
    Checks whether the non-missing values for a 'house_prices_df' numeric variable are in the valid variable range.

    Args:
        variable (str): name of variable.
        value_range (list): [minimum value, maximum value].
    
    Returns a boolean indicating whether all values of the variable are in the valid range.

    """
    variable_series = house_prices_df[variable]
    # drop missing data
    variable_series.dropna(inplace=True)
    result_series = variable_series[variable_series <= value_range[1]]
    result_series = result_series >= value_range[0]
    return result_series.size == variable_series.size


In [None]:
print('|Variable|Valid range|Data in valid range|')
variable_value_ranges = {'1stFlrSF': [334, 4692], '2ndFlrSF': [0, 2065], 'BedroomAbvGr': [0, 8], 'BsmtFinSF1': [0, 5644],
                         'BsmtUnfSF': [0, 2336], 'TotalBsmtSF': [0, 6110], 'GarageArea': [0, 1418], 'GarageYrBlt': [1900, 2010],
                         'GrLivArea': [334, 5642], 'LotArea': [1300, 215245], 'LotFrontage': [21, 313], 'MasVnrArea': [0, 1600],
                         'EnclosedPorchSF': [0, 286], 'OpenPorchSF': [0, 547], 'OverallCond': [1, 10], 'OverallQual': [1,10],
                         'WoodDeckSF': [0, 736], 'YearBuilt': [1872, 2010], 'YearRemodAdd': [1950, 2010], 'SalePrice': [34900, 755000]}

for variable in numeric_house_prices_df.columns:
    print(f'{variable}|', f'{variable_value_ranges[variable]}|', check_value_ranges(variable, variable_value_ranges[variable]))


All non-missing values are in the valid range for each numeric variable.

**Now for categorical variables**.

In [None]:
categorical_house_prices_df = house_prices_df.select_dtypes(include=['object'])
categorical_house_prices_df.columns.tolist()

In [None]:
import numpy as np

# include NaN as a valid value 
result_df = categorical_house_prices_df.isin({'BsmtExposure': ['Gd', 'Av', 'Mn', 'No', 'None', np.nan], 'BsmtFinType1': ['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None', np.nan],
                                  'GarageFinish': ['Fin', 'RFn', 'Unf', 'None', np.nan], 'KitchenQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po', np.nan]})
for col in result_df.columns:
    print(result_df[col].value_counts())

All values are valid for the categorical variables (allowing for missing data).