# Exercise 2: Data Types, Distance Functions, Feature Extraction

## Exercise 2-4: KDD Process (Solution)

In this tutorial we want to gain an insight into the big picture of knowledge discovery and mining tasks. Therefore, we will have practical introduction and discuss the KDD process upon this task.

In general, the steps of the KDD process are 
1. Data Cleaning and Integration
2. Transformation, Selection, Projection
3. Data Mining
4. Evaluation and Visualization

Here, we focus on different aspects of these steps. In doing so, we will get to know useful python packages and functions.

#### Load dependencies

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Exploration

#### Load dataset

The dataset we are using (`housing.csv`) is from a Kaggle competition called "California Housing Prices - Median house prices for California districts derived from the 1990 census" (Link to source: https://www.kaggle.com/camnugent/california-housing-prices)
This data has contains features like population, median income, median housing price, ... for each block group (typically one block has a population of 600 to 3000 people in CA). 

Load the dataset into a `pandas.DataFrame`.

In [None]:
cal_housing = pd.read_csv('housing.csv')

Let's look at the data more closely. First, we will print some samples of the dataset:

In [None]:
cal_housing.head()

We can also have a more general overview of the dataset by having a look at the statistics of the data.

In [None]:
cal_housing.describe()

In terms of preprocessing step (1st step in the KDD pipeline), it is also crucial to know about the datatypes and formats of the dataset.

In [None]:
cal_housing.info()

Another way to get a feel of the numerical attributes is to plot them in histograms.

In [None]:
cal_housing.hist()
plt.show()

Explore the non-numerical attribute.

In [None]:
cal_housing['ocean_proximity'].hist()
plt.show()

Or in a more textual form:

In [None]:
{key:value for key,value in zip(*np.unique(cal_housing['ocean_proximity'], return_counts=True))}

#### Correlations in the dataset

Create a scatter plot matrix.

In [None]:
pd.plotting.scatter_matrix(cal_housing, figsize=(16,16))
plt.show()

## Preprocessing

#### Data cleaning
Let's search for missing/corrupted entries in the data

In [None]:
nas = cal_housing.isna()
nas.sum()

Option 1: Delete the corresponding entries

In [None]:
cal_housing = cal_housing.dropna(axis=0)

Option 2: Drop the whole feature

In [None]:
#cal_housing = cal_housing.dropna(axis=1)

Option 3: Introduce new values for the missing entries (zero, mean, median etc...). Has to be done with caution. Omitted here.

#### Create one-hot encoding for categorical attributes

Before that, we replace the category 'ISLAND' by 'NEAR OCEAN' as we can hardly learn anything from 5 samples.

In [None]:
cal_housing['ocean_proximity'] = cal_housing['ocean_proximity'].replace('ISLAND', 'NEAR OCEAN')

In [None]:
one_hot = pd.get_dummies(cal_housing['ocean_proximity'])
one_hot

In [None]:
cal_housing = pd.concat([cal_housing.drop('ocean_proximity', axis=1), one_hot], axis=1)
cal_housing

How could we encode the oceaen proximity feature alternatively?

We could consider it as an ordinal feature with 

[`INLAND`=0]
 <  
[`<1H OCEAN`=1]
 <  
[`NEAR BAY`=2]
 <  
[`NEAR OCEAN`=`ISLAND`=3]. 

#### Train-test Split
Divide the Data into a Train and a Test set.

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(cal_housing, test_size=0.2, random_state=0)

Separate the target (median house value) from the covariates, i.e. create X_train, y_train, X_test, and y_test.

In [None]:
X_train = train_set.drop('median_house_value', axis=1)
X_test = test_set.drop('median_house_value', axis=1)
y_train = train_set['median_house_value']
y_test = test_set['median_house_value']

#### Feature Scaling

Machine Learning algorithms perform best when the input numerical attributes have similar scales. Let's examing sklearn's StandardScaler to perform feature scaling on those numerical features.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = X_train.values # to numpy
X_test = X_test.values # to numpy
X_train[:,:-4] = scaler.fit_transform(X_train[:,:-4]) # exclude the one-hot features
X_test[:,:-4] = scaler.transform(X_test[:,:-4]) # exclude the one-hot features

## Select and Train a Model

Choose one or more models from sklearn and train them.

Here we will train a linear regression model (note: this choice makes the previous feature scaling unnecessary)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)

## Evaluation

Calculate the RÂ² score for the train and test set.

In [None]:
from sklearn.metrics import r2_score

train_score = r2_score(y_train, model.predict(X_train))
test_score = r2_score(y_test, model.predict(X_test))

train_score, test_score

## Visualization

For this part, use an ordinary linear regression model trained with the data above.

Plot the latitudes and longitudes of the houses in the dataset. Color the points according to the predicted house value. What can you see?

Hint: Use `np.argsort()` to plot the house in ascending value-order (clearer result). 

In [None]:
X_complete = np.concatenate([X_train, X_test], axis=0)
pred_complete = model.predict(X_complete)

In [None]:
order = np.argsort(pred_complete)

In [None]:
plt.scatter(X_complete[order,0], X_complete[order,1], c=pred_complete[order], cmap='coolwarm', alpha=1., s=1)
plt.show()

1. We can see a rough outline of the state of california (not very surprising).
2. Coastal regions seem to have more expensive houses.
3. We can also clearly see that the regions around San Francisco and Los Angeles have the highest house values.

It looks like the predictions are not linear w.r.t. latitude and longitude. How can this be explained?

In [None]:
plt.scatter(X_complete[order,0],pred_complete[order])
plt.show()

The reason for that is that the other covariates are not fixed in these plots. Only if we keep them constant, we can expect linearly changing predictions.

### Model Inspection

Examine the model and identify important features.

In [None]:
feature_names = list(train_set.columns)
feature_names.remove('median_house_value')
{feature:coef for feature, coef in zip(feature_names, model.coef_)}

As we scaled all features to the same scale, the absolute values of the coefficients can be seen as indicators for the feature importance.

The median income seems to be the strongest predictor.

Interestingly, longitude and latitude also have a rather strong impact. How would you explain that? Justify your hypothesis with an appropriate experiment.

We exclude longitude and latitude from the features and train the same model again.

In [None]:
model = LinearRegression()

model.fit(X_train[:,2:], y_train)

In [None]:
train_score = r2_score(y_train, model.predict(X_train[:,2:]))
test_score = r2_score(y_test, model.predict(X_test[:,2:]))

train_score, test_score

In [None]:
{feature:coef for feature, coef in zip(feature_names[2:], model.coef_)}

Apparently, the performance does hardly decrease. On the other hand, the coefficients for the ocean proximity features grow in absolute values while the other features coefficients remain largely the same. That is, the south-west component in latitude and longitude is used by the model as an in indicator for the ocean proximity.