# Problem Set 2
Applied Machine Learning (Spring 2020)<br>
Instructor: Rahman Peimankar (abpe@mmmi.sdu.dk)

Due: March 20, 11:00.

The goal of this assignment is to revise the learning of the topics in Lesson 4 and Lesson 5, which are Preprocessing and
Feature Transformation and Linear Models for Regression, respectively.

# Part 1: ColumnTranformer

As we learnt in the lecture, there are some challenges in transforming the data in machine learning algorithms/pipelines such as:
* It would be challenging to transform different data types.
* How to use a proper transformation approach.

As you know, it is of utmost importance to prepare data before starting any modelling. 

For example, missing values should be replaced, numerical value should be scaled, and categorical variables need to be one hot encoded.

To transform the data in scikit-learn, there are different classes such as [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) for replacing missing values, [MinMAxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) for scaling numerical features, and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for encoding categorical features.

As an example:

``scaler = MinMaxScaler()``<br>
``scaler.fit(train_x)``<br>
``train_x = scaler.transform(train_x)``<br>

In addition, different data transformation can be done using [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), for example:

``trnsformPipeline = Pipeline(steps=[('i', SimpleImputer(strategy='median')), ('s', MinMaxScaler())])``<br>
``train_x = trnsformPipeline.fit_transform(train_x)``<br>

In machine learning projects, you usually need to perform preprocessing on different columns of your data. For example, you may perform imputation on the missing **numerical** values and to impute missing **categorical** values using the frequent values and **one hot encode** the categories. This way you need to split your dataset into numerical and categorical data and then apply the required transformation on each variable type separately. Afterwards, you should combine them again to form the entired transform dataset. 

**BUT**, now the ColumnTransformer can be used to do all these operation in one place for you!

### Applying ColumnTransformer on the Abalone Dataset

In this exercise, we use the abalone dataset. The aim is to prepare the dataset using ColumnTransformer. 

You can learn more about the abalone datsaset from the link below. It is already downloaded for you to use in this assignment!

[Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone)

The aim of this dataset is to the age of an abalone using different measurements.

The dataset has 

* 4177 instances/samples,
* 8 input variable of different types, and
* an integer target variable.


Q1. Please read the dataset using pandas ``read_csv`` method. You should name your dataframe as ``df``. Please remember that the dataset does not have any header. So, you should set ``header`` argument in ``read_csv`` function.

**Hint**: The best way to check whether you have read the dataset correctly is to check the shape of the dataframe by running ``df.shape`` which should give you the output as ``(4177, 9)``.

In [None]:
import pandas as pd
# YOUR CODE HERE
raise NotImplementedError()

Q2. You should split the dataset in a way that the last column of the dataset should be removed and saved as ``y``. And the other 8 columns should be named as ``X``.

**Hint**: you may use ``drop`` function.

In [None]:
last_column = len(df.columns) - 1
# YOUR CODE HERE
raise NotImplementedError()

Q3. Now it is the time to figure out which columns are numerical (both *'float64'* and *'int64'* types) and which of categorical type either *'object'* or *'bool'*. 

You should name the numerical and categorical column lists `numerical_col` and `categorical_col`, respectively. 

**Hint**: you may use ``select_dtypes`` function on the `X` dataframe. Read more about `select_dtypes` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Q4. It is the time for applying `ColumnTransfomer`. We should just one hot encode the first column using `OneHotEncoder` and scale the numerical columns (2-8) using `MinMaxScaler`. Name the ColumnTransformer in your code as `ct`.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# 2. kNN Imputation
* As we discussed in the lecture, our dataset may have some missing values. 
* For example, there are no data available for some rows in the dataset, either partially or completely. 
* Data missing has different reasons such as problem in measurement and unavailibility.
* Most of machine learning algorithms require input data as values to be able to generate proper outputs. And these missing values can prevent the algotithms to work properly.
* The process of finding missing values in a dataset and replace them with proper values is called data imputing. 

### Data Imputation Using ML Model 
* As we learnt in the lecture, one way to impute data is to fit a model to predict the missing values.
* Such a model should be created for the features/attributes in the data that have missing values.
* As an example, kNN can be used to impute a sample by averaging the *closest* samples to it (neighbors) in the training set. 

### Dataset
In this exercise, the Horse Colic Dataset is used, which includes medical features of horses with colic. The dataset is telling us whether the horses lived or died. You can read more about the dataset [here](https://archive.ics.uci.edu/ml/datasets/Horse+Colic).

In [None]:
import pandas as pd

Q5. First, please import the dataframe and name it as `df`.

In [None]:
def importDF():
    # YOUR CODE HERE
    raise NotImplementedError()
    return df

If you look into the dataframe by using `importDF().head()`, you can see that the missing values are marked as `?`.

Q6. Now you should replace these `?` with `NaN`. 

**Hint**: one way to do this can be by passing `na_values` in the `read_csv` function when reading the dataset. Please read more about `na_values` on this [page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [None]:
def replaceMissingValues():
    # YOUR CODE HERE
    raise NotImplementedError()
    return df

Run the cell below to check if the missing values have been replaced with NaN.

In [None]:
replaceMissingValues()

Q7. Complete the code below to report the number rows with missing values for each column.

**Hint**: You may use `.isnull()` function in pandas and then count the number of missing values for each column in the for loop below using `.sum()` function.

In [None]:
df = replaceMissingValues()
for col in range(df.shape[1]):
    # YOUR CODE HERE
    raise NotImplementedError()
    print('Column {} has {} missing values.'.format(col, n_miss.iloc[0]))

Now it is time to use `KNNImputer` from `sklearn` to implement the kNN imputation. You can read more about `KNNImputer` in [this link](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html).

Q8. Create an imputer object using `KNNImputer` such that `n_neighbors=5`, `weights='uniform'`, and `metric='nan_euclidean'`. Read more about theese parameters in the link above. 

In [None]:
from sklearn.impute import KNNImputer

In [None]:
def knnImputer():
    # YOUR CODE HERE
    raise NotImplementedError()
    return imputer

You should now be able to apply `.fit()` function on your created `imputer` object above. To do this, you should prepare the input data for the `.fit()` function.

Q9. Complete the code below to first split input features ($X$) and labels/ouputs ($y$). Then, you can use $X$ to apply `.fit()` function on `imputer` object. 

**Hint**: Column 23 should be considered as output ($y$). Please check the datset description file (*horse-colic-names.txt*). 

**Attention**: The operation in line 4 (cell below) is called *list comprehensions* in Python. List comprehensions is nothing but an wapproach in Python programming to build a list from a for loop all in one place. You can read more about it [here](https://jakevdp.github.io/WhirlwindTourOfPython/11-list-comprehensions.html).

In [None]:
def splitData():
    df = replaceMissingValues() 
    data = df.values
    idx = [col for col in range(data.shape[1]) if col != 23]
    # YOUR CODE HERE
    raise NotImplementedError()
    return X, y

Q10. Return a fitted `imputer` object in the function below using training set `X` using `.fit()` function. Then, transform the training set (`X`) by applying `transform` function. Name the transformed data as `Xtransformed` 

In [None]:
def fitImputer():
    imputer= knnImputer()
    # YOUR CODE HERE
    raise NotImplementedError()

**Hint**: If you would like to test your implementation, you can run the cell below and it should print zero as number of missing values.

In [None]:
from numpy import isnan
print('Number of missing values: %d' % sum(isnan(fitImputer()).flatten()))

# Part 2: Linear Regression

In this exercise, we will practice a step by step process for the implementation of Linear Regression with `sklearn`. As you learnt in the lecture, the goal of regression problems is to estimate continuos values as targets. Linear regression is a way to model the linear relationship between input features and target variables. 

In this exercise we use the [California Housing Prices](https://www.kaggle.com/camnugent/california-housing-prices) dataset. The dataset has been put in the same folder as the assignment for you to use.

You will apply Linear Regression and ColumnTransformer using a `sklearn` pipeline. 

Q11. Please import the dataset using pandas and name it as `df_housing`

In [None]:
import pandas as pd
# YOUR CODE HERE
raise NotImplementedError()

**Hint**: There are some missing values with `total_bedrooms` feature as you can see in the cell below.

In [None]:
df_housing.isna().sum()

Q12. Substitute the missing values using `median` of the `total_bedrooms`. A copy of the original datframe has been created inside the function `df_filled` so that you can use it in your implementation. 

**Hint**: You may set `inplace=True` using the `.fillna()` function. Read more about it [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).

In [None]:
def missingMedian(df_housing):
    df_filled = df_housing.copy()
    # YOUR CODE HERE
    raise NotImplementedError()
    return df_filled

Now it is time to use ColumnTransformer and Pipeline on our dataset!

First, we divide numeric and categorical columns of our dataset. There is only one categorical feature, which is `ocean_proximity`. The rest are numeric features. Look at the cell below.

In [None]:
missingMedian(df_housing).dtypes

We make two pipelines for both numeric and categorical columns. Name the two numeric and categorical pipelines as `numPipe` and `catPipe`, respectively.

Q13. Fisrt make the numeric pipeline (`numPipe`) and apply `PolynomialFeatures` and `StandardScaler`.

**Important**: The degree of polynomial should be set to two (`degree=2`) and the parameters of `StandardScaler` should be left as default. And you should name the steps of the Pipeline as `poly` and `scaler` for `PolynomialFeatures` and `StandardScaler`, respectively.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# YOUR CODE HERE
raise NotImplementedError()

Q13. Now make the categorical pipeline (`catPipe`) and apply `OneHotEncoder`. And you should name the step of the Pipeline as `onehot`. 

**Important**: Set the `handle_unknown='ignore'` for `OneHotEncoder`!

In [None]:
from sklearn.preprocessing import OneHotEncoder

# YOUR CODE HERE
raise NotImplementedError()

Q14. You have preprocessd both numeric and categorical features. So, combine them using `ColumnTransformer`. Name the `ColumnTransformer` as `colTrans`.

**Hint**: The cells below divide the numeric and categorical features.

In [None]:
numFeat = ['longitude','latitude','housing_median_age', 'total_rooms','total_bedrooms', 'population',
           'households','median_income']
catFeat = ['ocean_proximity']

In [None]:
from sklearn.compose import ColumnTransformer

# YOUR CODE HERE
raise NotImplementedError()

So far you have created completed the preprocessing step and you have a `ColumnTransformer` called `colTrans`. This is a very good news! 

Now, you can make another Pipeline to integrate `LinearRegression` model and the `ColumnTransformer`.  

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

lr_pipe = Pipeline(steps=[('preprocessing', colTrans),
                      ('classifier', LinearRegression())])

We are now ready to start the fun part! We can train our LinearRegression model. However, first we need to split inputs and target and name them as `X` and `y`.

**Hint**: You may use `df_filled` dataframe, which is the output of `missingMedian(df_housing)` function. The target (`y`) is the *median_house_value* column.

Q15. Define inputs and target data. 

In [None]:
df_filled = missingMedian(df_housing)
# YOUR CODE HERE
raise NotImplementedError()

Q16. Split the inputs and target into train and test sets using `train_test_split`. You should set the `random_state = 42`.

In [None]:
from sklearn.model_selection import train_test_split
def splitData(X, y):
    # YOUR CODE HERE
    raise NotImplementedError()
    return X_train, X_test, y_train, y_test

Q17. Train the Linear Regression model using `X_train` and `y_train` by applying `.fit()` function on the created Pipeline `lr_pipe`.

Then, evaluate the model using `.score()` function on both train and test sets.

**Hint**: The performance of your model on the train and test sets should be around 75% and 67%, respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()