# UNIST MGE303 Data Mining
## Lab Session 01 | 2020-04-20 (MON)
### Seok-Ju Hahn (sjhahn11512@unist.ac.kr)



## Supervised Learning (Regression): House Price Prediction

### Preparation
- Load packages using `import` command and alias command (`as`)
- Remember trio: numpy, pandas, matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Data Munging
* <a href='#Load-data'>Load data</a>
* <a href='#Handle-data'>Handle data</a>
* <a href='#Split-training-and-test-data'>Split training and test data</a>

#### Load data
* Load data using `read_csv()` method
  - Pandas package read data as `DataFrame` attribute

In [None]:
df = 

* Check the first few rows of data using `head()` method

#### Handle data
* Get some parts of data using `iloc[row_index, column_index]` and `loc[row_index_name, column_name]` methods
  - `:` means 'all'
  - Attaching `.values` returns numpy array, if not it reutrns `pd.Series` attribute

In [None]:
"""
- Get the first column of data
"""


In [None]:
"""
- Get the first column of data attaching `.values`
= CAUTION: in Python, index strats at 0
"""


In [None]:
"""
- Get the 11th data of 'total_rooms' feature 
"""


* You can refer a specific column using `[]` notation or `.column_name` notation

In [None]:
"""
- Get the 'latitude' column
"""


* Check simple description of data using `info()` method
  - Column name
  - Column/Row counts
  - Data types
  - Counts of non-null samples

* Check a categorical feature using `value_counts()` method

* Check summarized information of numerical features using `describe()` method

* Plot histogram of each column to check sanity of the data
  - Attach `hist()` method to your dataframe

##### What things can you find in the above histograms?
- 'median_income(중위 소득)' variable is not represented in dollar. (See horizontal axis)
- Maximum values of 'housing_median_age(중위 주택 연도)' and 'median_house_value(중위 주택 가격)' variables are strange.
  - It is intentionally set to be limited by data collector.
  - So, you don't have to care, but for other dataset, you should detect and ask this kinds of facts to data engineers.
- Scales (value range) of predictors are different from each other.
  - It is to be handled in data pre-processing stage.

#### Split training and test data
- Remember, you **MUST** split test data first for simulating unseen data
- Use `train_test_split()` method in scikit-learn package
- If you use test set on training your models, it underestimates generalization error, which induces __data snooping bias__.

In [None]:
# import method
from sklearn.model_selection import train_test_split

In [None]:
# set `random_state` for reproduciblity
training_set, test_set = 

- Check sample counts of training and test set using `len()` method

In [None]:
TR_LENGTH = 
TE_LENGTH = 

print(f'Training samples: {TR_LENGTH}, Test samples: {TE_LENGTH}')

### Exploratory Data Analysis (EDA)

In [None]:
# copy data for preventing damage in raw training data
data = training_set.copy()

* Plot scatter-plot of your training data for indicating house location
  - Attach `plot(kind='scatter', x='', y='')` method to your dataframe
  - You can add `alpha` argument for effective representation of scatter density

In [None]:
ax = 
ax.set(xlabel='Longitude', ylabel='Latitude')
plt.show()

* Calculate Pearson's r correlation coefficient among predictors
  - Attach `corr()` method to dataframe

In [None]:
correlation_matrix = 
correlation_matrix

* Inspect which variable is highly correlated with house value
  - Attach `sort_values(ascending=False)` method to correlation matrix you made, after referring the 'median_house_value' column

In [None]:
correlation_matrix

* Seems like 'median_income' is highly correalted with house price
  - let us see scatter plot again

In [None]:
data.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)
plt.show()

* Yes, it seems better to remove the uppermost lines (due to intentional limit by data engineers).
* You can filter out samples to be remained by providing conditions in `[]` notation.

In [None]:
data = 

* You can do more!
  - <u>Create new features</u> by combining existing predictors
  - This kind of work can strengthen our hypothesis (if it is done in a sophisticated manner) 

* For example, it is better to know the number of rooms per household rather than 'total_rooms'.
* Same logic can be applied for 'total_bedrooms' and 'population'
  - Let us make new features: 'rooms_per_household', 'bedrooms_per_room', 'people_per_household'

In [None]:
data['rooms_per_household'] = data['total_rooms'] / data['households']
data['bedrooms_per_room'] = data['total_bedrooms'] / data['total_rooms']
data['people_per_household'] = data['population'] / data['households']

In [None]:
# Check correlation matrix again for additional features
correlation_matrix = data.corr()
correlation_matrix['median_house_value'].sort_values(ascending=False)

### Data Pre-processing
* <a href='#Cleanse-data'>Cleanse data</a>
* <a href='#Scale-data'>Scale data</a>

#### Cleanse data
* Handle missing values
  - Just remove rows with missing values
  - Impute missing values using mean, median, or imputation algorithms (NOT covered today)
  - Collect data again
* Drop unncessary columns
* Remove duplicated samples
* Convert categorical data into numerical representation (encoding)

##### Handle missing values
- Remove rows with missing value
- Attach `dropna()` to the dataframe

In [None]:
# We have missing data in 'total_bedrooms' and 'bedrooms_per_room' features


In [None]:
data = 

##### Drop unncessary columns
- As we made new features ('rooms_per_household', 'bedrooms_per_room', 'people_per_household'), let us remove features used for making three predictors
- Attach `drop(columns=['COLUMN_NAME'], axis=1)` to the dataframe

In [None]:
data = data.drop(columns=['total_rooms', 'total_bedrooms', 'population', 'households'], axis=1)

In [None]:
data.info()

In [None]:
data.shape

##### Drop duplicated samples
- Attach `drop_duplicates()` to the dataframe

In [None]:
data = 
data.shape

#### (Advanced) Automation - <a href='#AddFeatures'>AddFeatures</a>
- This process can be automated by constructing a simple function
- Automation of such a process is important since data mining process requires fast prototyping and experiments

##### Encode cateogrical feature
- Convert categorical feature represented in string format into numerical representation ('ocean_proximity' feature)

In [None]:
# let us first separate numerical and categorical columns
cat_feat = ['ocean_proximity']
num_feat = ['longitude', 'latitude', 'housing_median_age', 'median_income', 'rooms_per_household', 'bedrooms_per_room', 'people_per_household']

- *Beware* that we at first need to split out the dependent variable first!
- Use `drop()` and `loc()` method!

In [None]:
X_train, y_train = 

- Use `OneHotEncoer` provided by Scikit-Learn package
- Select 'ocean_proximity' column only using `[]` or `.COLUMN_NAME` or `loc()` or `iloc()`, and call `fit_transform()` method to `OneHotEncoder`

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder(sparse=False)
encoder.fit(data['ocean_proximity'].values.reshape(-1, 1))
X_train_cat = encoder.transform(data['ocean_proximity'].values.reshape(-1, 1))

# OR
# X_train_cat = encoder.fit_transform(data['ocean_proximity'].values.reshape(-1, 1))

# CAUTION! for test set, you need to fit on training data first, and SHOULD only transform test set!
"""
example code snippet)
  encoder.fit(training_data)
  training_data = encoder.transform(training_data)
  test_data = encoder.transform(test_data)
"""

In [None]:
X_train_cat.shape

### Now, all data are transformed into numerical values!

#### Scale data
* Feature scaling means to transform ranges of all **numerical** features to be similar with each other. <br> (it is enough to just one-hot-encode categorical features)
* Standard scaling (standardization) is to make feature to have mean 0 and standard deviation 1.
  - It is **TOTALLY different** from converting data distribution to Gaussian ditsribution!!!
  - Except for models having assumption of Gaussian distributed data, such as Linear Discriminant analysis, Gaussian Mixture models, <br>
  it is NOT needed to convert data distribution to be Gaussian.
  - It is just shift the range of feature distribution
* Feature scaling is especially important for algorithms:
  - based on Euclidean distance like K-means clustering, k-NN (different scale distorts distance measure)
  - based on gradient-based optimizations like logistic regression, neural networks (different scales distorts loss surface)
  - regard scale of features with significance like PCA

##### Scale numerical features
- Scale numerical features to have mean 0 and standard deviation 1
- Use `StandardScaler` provided by Scikit-Learn package
- Select numerical feature columns and call `fit_trnasform()` method

In [None]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()
X_train_num = scalar.fit_transform(X_train[num_feat].values)

In [None]:
X_train_num.shape

##### Finish up data pre-processing
- Now, we need to concatenate categorical (one-hot encoded) and numerical (standardized) features!
- It can be easily done by `np.concatenate([*arrays], axis=1)` method

In [None]:
X_train = 
X_train.shape

### Don't forget! You should do the same process above on the test set you made in <a href='#Split-training-and-test-data'>here</a>

#### (Advanced) Automation - <a href='#Pipeline'>Pipeline</a>
- This process can be also automated by using `Pipeline` and `ColumnTransformer` method 
- Automation of such a process is important since data mining process requires fast prototyping and experiments

### Model training and evaluation
* <a href='#Train-model'>Train model</a>
* <a href='#Evaluate-model'>Evaluate model</a>

#### Train model
* Choose an appropriate algorithm for your problem setting
* There are tons of ready-made algorithms in here: <a href='https://scikit-learn.org/stable/supervised_learning.html'>Scikit-Learn</a>

* To train model, we need to create model instance such as `LinearRegression()` and call `fit()` method by providing independent and dependent variables.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = 
lin_reg.fit()

* Performance of the trained model on the training set can easily be found by calling `score()` method (which returns r-squared) or calling another metrics like `mean_squared_error` with result from `predict()` method.

In [None]:
lin_reg.score()

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

y_pred_train = lin_reg.predict()

lin_reg_mse = mean_squared_error()
lin_reg_mae = mean_absolute_error()

print(f'MAE: {lin_reg_mae:.4f}, MSE: {lin_reg_mse:.4f}, RMSE: {np.sqrt(lin_reg_mse):.4f}')

#### Evaluate model
* Internal evaluation
  - Evaluate the performance of the trained model using training data by simulating training-test split internally.
  - Bootstrapping (NOT covered), cross-validation
* External evaluation
  - Evaluate the performance of the trained model using unseen data (test set).

##### Internal evaluation
- Can be easily done by `cross_val_score()` method
- When passing argument `cv=10`, it executes 10-fold CV

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LinearRegression(), X_train, y_train, scoring='neg_mean_squared_error', cv=10)
lin_reg_rmse_cv_scores = np.sqrt(-scores)

In [None]:
print(f'Scores: {lin_reg_rmse_cv_scores},\nMean: {np.mean(lin_reg_rmse_cv_scores):.4f},\nStd: {np.std(lin_reg_rmse_cv_scores):.4f}')

##### External evaluation
- After processing <a href='#Split-training-and-test-data'>test set</a> you split above in the same way as training set, measure the performance of the trained model on this test set
- Use `predict()` method

In [None]:
y_pred_test = lin_reg.predict()

lin_reg_mse = mean_squared_error()
lin_reg_mae = mean_absolute_error()

print(f'MAE: {lin_reg_mae:.4f}, MSE: {lin_reg_mse:.4f}, RMSE: {np.sqrt(lin_reg_mse):.4f}')

#### (Advanced) Automation
- Full process can be done by automation

#### AddFeatures

In [None]:
class AddFeatures():
    def __init__(self):
        pass

    def transform(self, X, y):
        X['rooms_per_household'] = X['total_rooms'] / X['households']
        X['bedrooms_per_room'] = X['total_bedrooms'] / X['total_rooms']
        X['people_per_household'] = X['population'] / X['households']
        
        X.drop(columns=['total_rooms', 'total_bedrooms', 'population', 'households'], axis=1, inplace=True)
        
        return X.dropna(), y[y.index.isin(X.dropna().index)]

#### Pipeline

In [None]:
from sklearn.pipeline import Pipeline

# for numerical features
num_pipeline = Pipeline([('standardization', StandardScaler())])

# for categorical features
cat_pipeline = Pipeline([('one_hot_encoding', OneHotEncoder(sparse=False))])

#X_train_num = num_pipeline.fit_transform(X_train[num_feat])
#X_train_cat = = cat_pipeline.fit_transform(X_train[cat_feat])
#X_train = np.concatenate([X_train_num, X_train_cat], axis=1)

In [None]:
from sklearn.compose import ColumnTransformer

cat_feat = ['ocean_proximity']
num_feat = ['longitude', 'latitude', 'housing_median_age', 'median_income']

pipelines = ColumnTransformer([('numeric_features', num_pipeline, num_feat), ('categorical_feature', cat_pipeline, cat_feat)])

#### Full Automation

In [None]:
# Preapare and split data
X_train, y_train = AddFeatures().transform(training_set.drop(columns=['median_house_value'], axis=1, inplace=False), training_set.loc[:, ['median_house_value']])
X_test, y_test = AddFeatures().transform(test_set.drop(columns=['median_house_value'], axis=1, inplace=False), test_set.loc[:, ['median_house_value']])

In [None]:
# For test set, we need to call `transform`, because we don't know the test set in prior
X_train = pipelines.fit_transform(X_train)
X_test = pipelines.transform(X_test)

In [None]:
full_pipeline = Pipeline([('pre_processing', pipelines), ('linear_regression', LinearRegression())])

full_pipeline.fit(X_train, y_train)
cv_score = np.sqrt(-cross_val_score(full_pipeline, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
test_prediction = full_pipeline.predict(X_test)

In [None]:
cv_score

In [None]:
test_prediction