EDA, Exploratory Data Analysis
===
latitude/logitude are given by unknown consersion. Suppose that this conversion formula still kept the original infomation, how could we extract their usefule information in prediction? As well-known knowledge, house price is dependent on the region where their locate; this is why we have to consider lat/lon infomation seriously.
1. nan conversion
2. target, $\mathbf{y\Rightarrow\log(1+y)}$ (`np.log1p`), for normalize fitting; later, back by $\mathbf{y_p \Rightarrow \exp(y_p)-1}$ (`np.expm1`); or use `np.log` directly if all the target data are greater than 0.
- latitude/longitude conversion, a°). knn means, b°). dbscans, then one-hot converstion
  ```Lasso, 0.7012 ➡︎ 0.6893```, the last one can not assign a fixed value to fix the data.
- different models, xgb, lgb, ...; here we try the `lightgbm`;
- stack model, blend moder, ...; install `mlxtend` by pip.

Note
---
1. Since 2019/05/17, two new implementations of gradient boosting trees: `ensemble.HistGradientBoostingClassifier` and `ensemble.HistGradientBoostingRegressor` , are supported by `scikit_learning`-0.21.1.
```python
# usage
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import HistGradientBoostingRegressor
HistGradientBoostingRegressor(loss=’least_squares’, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_fraction=0.1, n_iter_no_change=None, tol=1e-07, verbose=0, random_state=None)
```


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tqdm import tqdm,tqdm_notebook
import folium

import seaborn as sns
%matplotlib inline

In [None]:
train_df = pd.read_csv('../dataset/train.csv')
test_df = pd.read_csv('../dataset/test.csv')

In [None]:
print ("Train: ",train_df.shape[0],"sales, and ",train_df.shape[1],"features")
print ("Test:  ",test_df.shape[0],"sales, and ",test_df.shape[1],"features")

In [None]:
def fillnan(df_1,df_2,features):
    df_1[features[0]].fillna(value=0,inplace=True)
    df_2[features[0]].fillna(value=0,inplace=True)
    df_1[features[1]].fillna(value=0,inplace=True)
    df_2[features[1]].fillna(value=0,inplace=True)
    
    df=pd.concat([df_1[features],df_2[features]],axis=0)
    for f in features[2:]:
        df_1[f].fillna(value=df[f].median(),inplace=True)
        df_2[f].fillna(value=df[f].median(),inplace=True)
    return df_1,df_2

nan_features=['parking_area','parking_price','txn_floor','village_income_median']

In [None]:
train_df,test_df=fillnan(train_df,test_df,nan_features)

ToDo
---
Outliers

Data
---
1. Quantitative
   - time-state: 'txn_dt', 'building_complete_dt'
   - non-time,
           'parking_price','building_area','village_income_median','town_population','town_area',
           'town_population_density','I_Min','II_MIN','III_MIN','IV_MIN','V_MIN','VI_MIN',
           'VII_MIN','VIII_MIN','IX_MIN','X_MIN','XI_MIN','XII_MIN','XIII_MIN','XIV_MIN',
   - location: 'lon','lat'
- Qualitative
  - building_material(9),city(11),total_floor(29),building_type(5),building_use(10),parking_way,
    parking_area, txn_floor,'doc_rate', 'master_rate', 'bachelor_rate', 'jobschool_rate',
       'highschool_rate', 'junior_rate', 'elementary_rate', 'born_rate',
       'death_rate', 'marriage_rate', 'divorce_rate'
  - village(2899)     

In [None]:
quantitative = ['txn_dt', 'building_complete_dt','parking_price','building_area','village_income_median','town_population','town_area',
           'town_population_density','I_MIN','II_MIN','III_MIN','IV_MIN','V_MIN','VI_MIN',
           'VII_MIN','VIII_MIN','IX_MIN','X_MIN','XI_MIN','XII_MIN','XIII_MIN','XIV_MIN',
           'lon','lat']
qualitative=['building_material','city','total_floor','building_type','building_use',
             'parking_way','parking_area','txn_floor','doc_rate', 'master_rate', 
             'bachelor_rate', 'jobschool_rate','highschool_rate', 'junior_rate', 
             'elementary_rate', 'born_rate','death_rate', 'marriage_rate', 'divorce_rate',
             'village']
target=['total_price']

 Normality test
---
For quanntitative features, do the features follow normal distributed? The Shapior test,  `scipy.stats.shapiro`, does help to filter out the data.

In [None]:
train_df['total_price_log']=np.log(train_df['total_price'])

In [None]:
import scipy.stats as stats
from scipy import stats
from scipy.stats import norm, skew 

In [None]:
def dist_check(y,kind='log'):
    if kind=='log':
       y_c=np.log(y)
    else:
       y_c=y 
    plt.figure(figsize=(12,6))
    plt.subplot(121)
    sns.distplot(y_c , fit=norm);
    (mu, sigma) = norm.fit(y_c)
    #print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
    plt.legend(['Normal dist. ($\mu:$ {:.2f}, $\sigma:$ {:.2f} )'.format(mu, sigma)],fontsize=14,
            loc='best')
    plt.title('Convert by %s' %kind)
    plt.ylabel('Frequency')
    
    plt.subplot(122)
    #fig = plt.figure(figsize=[8,6])
    res = stats.probplot(y_c, plot=plt)

In [None]:
y=(train_df['total_price'])

dist_check(y,kind='log')

In [None]:
dist_check(train_df['town_population'],kind='l')

Skewness and Kurtosis
---
\begin{eqnarray}
    \text{Skewness }&=&\frac{E(X-\mu)^3}{\sigma^3}\\
    \text{Kurtosis }&=&\frac{E(X-\mu)^4}{\sigma^4}
\end{eqnarray}
1. Skewness $>0$, log-tail on the right-side, and $<0$, on the left side,
- Kurtosis large, steep in the peak.

In [None]:
# too large for data skewness and kurtosis
print("Skewness: %f" % train_df['total_price'].skew()) 
print("Kurtosis: %f" % train_df['total_price'].kurt())

In [None]:
# try another one in log degree, look ...
print("Skewness: %f" % train_df['total_price_log'].skew()) 
print("Kurtosis: %f" % train_df['total_price_log'].kurt())

In [None]:
# try another one in log degree, look ...
print("Skewness: %f" % train_df['town_population'].skew()) 
print("Kurtosis: %f" % train_df['town_population'].kurt())

Linear regression assumes that data should follow the `normal distributed` rule. Thus, we want to check the  quantitative features, continuous variables, whether obey the normal rule;
```
        convert the feature, with skew >1, by logarithm, ln(var);
        translate the data if data is smaller than 0.
```        

In [None]:
# check degree of skew 
skew_var_X={}
for i in range(len(quantitative)):
    skew_var_X[i]=abs(train_df[quantitative[i]].skew())
skew_df=pd.Series(skew_var_X)#.sort_values(ascending=False)
skew_df.index=quantitative
skew_df.sort_values(ascending=False)

In [None]:
train_df_skew=train_df.copy()
var_X_ln=skew_df.index[skew_df>1]
for i in var_X_ln:
    if min(train_df_skew[i])<=0:
       train_df_skew[i]=np.log(train_df_skew[i]+abs(train_df_skew[i])+0.01) 
    else:
       train_df_skew[i]=np.log(train_df_skew[i])

Practice
---
1. As above, Process test_df data;
- Repeate Lasso regression.

In [None]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC,ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from mlxtend.regressor import StackingCVRegressor

In [None]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

#Validation function
n_folds = 5

def rmsle_cv(model,X,y):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X.values)
    rmse= np.sqrt(-cross_val_score(model, X.values, y, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [None]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0001, random_state=1,tol=0.01))

In [None]:
train_df_skew.columns

In [None]:
# Complete the X-y
X =
y = 

In [None]:
score = (rmsle_cv(lasso, X,y))
print("\nLasso score (alpha = {:.5f}): μ {:.4f} (𝝈 {:.4f})\n".format(0.0001,score.mean(), score.std()))