# GBA 6070 - Programming Foundation for Business Analytics
# Dr. Mohammad Salehan
# Assignment 11 - Preprocessing
Enter your name below.

In this assignment you will work with a dataset of cars. Let's start with loading the dataset. The missing values in the dataset are marked with ``?``.

In [None]:
import pandas as pd
cars = pd.read_csv('Cars.csv', na_values='?')

1. Examine the shape of the ``dataframe``.

In [None]:
cars.shape

2. Check the top 5 rows of the ``dataframe`` to see what it looks like.

In [None]:
cars.head()

3. Examine the number of missing values in each column.

In [None]:
cars.isna().sum()

4. Replace the missing values in ``num-of-doors`` with the most frequent value. Examine the missing values again to make sure missing values for ``num-of-doors`` are removed.

In [None]:
cars['num-of-doors'].fillna(cars['num-of-doors'].mode()[0], inplace=True)
cars.isna().sum()

5. Replace the rest of missing values with mean of each column. Examine the missing values again to make sure all of them are removed.

In [None]:
cars.fillna(cars.mean(), inplace=True)
cars.isna().sum()

6. Let's examine distinct values in ``num-of-doors``.

In [None]:
cars['num-of-doors'].unique()

Convert the string values in ``num-of-doors`` to their numeric equivalent (2, 4).

In [None]:
cars['num-of-doors'] = cars['num-of-doors'].apply(lambda x: 2 if x=='two' else 4)
cars['num-of-doors'].unique()

7. Do the same thing as above for ``num-of-cylinders``.

In [None]:
cars['num-of-cylinders'].unique()

In [None]:
mappings = dict(zip(cars['num-of-cylinders'].unique(), [4,6,5,3,12,2,8]))
cars['num-of-cylinders'] = cars['num-of-cylinders'].apply(lambda x: mappings[x])
cars.iloc[:, -13:].head()

8. For each ``make``, calculate maximum ``price``, minimum ``city-mpg``, and mean ``horsepower``.

In [None]:
cars.groupby('make').aggregate({'price': 'max',
                               'city-mpg': 'min',
                               'horsepower': 'mean'})

9. Which ``make`` is, on average, the most expensive?

In [None]:
cars.groupby('make').mean()['price'].sort_values(ascending=False)

10. Create dummies for all categorical columns.

In [None]:
cars = pd.get_dummies(cars, columns=['make', 'fuel-type', 'aspiration',
                                     'num-of-doors','body-style','drive-wheels','engine-location',
                                     'engine-type','fuel-system'

])
cars.columns

11. Normalize all numeric values in the dataset. Exclude the dummies.

In [None]:
from sklearn import preprocessing
cars.iloc[:, :17] = preprocessing.scale(cars.iloc[:, :17])
cars.head()

In [None]:
cars.iloc[:, 9:].head()