### Data Preprocessing
For this notebook, we'll look at some preprocessing areas, specifically:
* Identify and handle missing values, including ensuring correct data format (Data wrangling/ cleaning)
* Data standardization
* Data normalization (centering/scaling)
* Binning
* Indicator variable

In [None]:
# let us start by the regular imports
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
sns.set()


In [None]:
path = "/home/nyangweso/Desktop/Ds_1/Data-Analytics-Sample-Projects/automobile.csv"
headers = [
    "symboling",
    "normalized-losses",
    "make",
    "fuel-type",
    "aspiration",
    "num-of-doors",
    "body-style",
    "drive-wheels",
    "engine-location",
    "wheel-base",
    "length",
    "width",
    "height",
    "curb-weight",
    "engine-type",
    "num-of-cylinders",
    "engine-size",
    "fuel-system",
    "bore",
    "stroke",
    "compression-ratio",
    "horsepower",
    "peak-rpm",
    "city-mpg",
    "highway-mpg",
    "price",
]


In [None]:
df = pd.read_csv(path, names=headers)


In [None]:
df.head()


Upon viewing this dataset, we notice that the missing data comes inform of a '?', let's 1st replace with NaN

In [None]:
df.replace("?", np.nan, inplace=True)
df.head()


In [None]:
missing_data = df.isnull()
missing_data.head()


Let us count the missing values for each column

In [None]:
for column in missing_data.columns.values.tolist():
    print(f"{column}\n{missing_data[column].value_counts()}\n")


Based on the summary above, each column has 205 rows of data and seven of the columns containing missing data:

<ol>
    <li>"normalized-losses": 41 missing data</li>
    <li>"num-of-doors": 2 missing data</li>
    <li>"bore": 4 missing data</li>
    <li>"stroke" : 4 missing data</li>
    <li>"horsepower": 2 missing data</li>
    <li>"peak-rpm": 2 missing data</li>
    <li>"price": 4 missing data</li>
</ol>


<h3 id="deal_missing_values">Deal with missing data</h3>
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions
    </li>
</ol>


Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:

<b>Replace by mean:</b>

<ul>
    <li>"normalized-losses": 41 missing data, replace them with mean</li>
    <li>"stroke": 4 missing data, replace them with mean</li>
    <li>"bore": 4 missing data, replace them with mean</li>
    <li>"horsepower": 2 missing data, replace them with mean</li>
    <li>"peak-rpm": 2 missing data, replace them with mean</li>
</ul>

<b>Replace by frequency:</b>

<ul>
    <li>"num-of-doors": 2 missing data, replace them with "four". 
        <ul>
            <li>Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur</li>
        </ul>
    </li>
</ul>

<b>Drop the whole row:</b>

<ul>
    <li>"price": 4 missing data, simply delete the whole row
        <ul>
            <li>Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us</li>
        </ul>
    </li>
</ul>


##### 1. Calculating the mean value for the listed columns in "Replace by mean"

In [None]:
def replace_with_mean(dataframes):
    for dataframe in dataframes:
        mean = df[dataframe].astype("float").mean()
        df[dataframe].replace(np.nan, mean, inplace=True)

    return None

Replace "NaN" with mean value in "normalized-losses" column

In [None]:
ls = ["normalized-losses", "stroke", "bore", "horsepower", "peak-rpm"]
replace_with_mean(ls)


In [None]:
df["stroke"].info()

In [None]:
df["num-of-doors"].value_counts()
# One can alternatively use...
# df['num-of-doors'].value_counts().idxmax()

In [None]:
df["num-of-doors"].value_counts().idxmax()

##### 2.Replacing the columns in "Replace by frequency" with max values

In [None]:
df["num-of-doors"].replace(
    np.nan, df["num-of-doors"].value_counts().idxmax(), inplace=True
)

In [None]:
df["num-of-doors"].value_counts()

##### 3. Category 3 i.e "Drop all rows"

In [None]:
# dropping all records with NaN in 'price' column
df.dropna(subset=["price"], axis=0, inplace=True)

# resetting index since some rows were dropped
df.reset_index(drop=True, inplace=True)

#### Next step >> Ensuring data is in correct format