# House Prices - Advanced Regression Techniques
Hello everyone! In this notebook we are going to predict House Prices using sklearn library to solve regression task.


For this notebook **I would like to say thank you some authors for their notebooks that have inspired me to write own notebook**:
1. [PEDRO MARCELINO, PHD. Comprehensive data exploration with Python ](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)
2. [SERIGNE. Stacked Regressions : Top 4% on LeaderBoard](https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard)
3. [PRADNESH LACHAKE. Multiple Linear Regression and Regularization](https://www.kaggle.com/pradneshlachake/multiple-linear-regression-and-regularization)


# 1. Import libraries
For regression task we are going to use [sklearn](https://scikit-learn.org/stable/) library.


In [None]:
import math
import random
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt


from scipy import stats
from scipy.stats import norm, skew 


import pickle

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import explained_variance_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, \
                                 RidgeCV, Lasso, LassoCV, \
                                 ElasticNet, ElasticNetCV

# 2. Read data
Here we read train and test sets.

In [None]:
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv', header = 0)
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
df_train.head(5) # look for train first 5 rows

In [None]:
df_test.head(5) # look for test first 5 rows

In [None]:
print(df_train.shape)
print(df_test.shape)

# 2. Visualize data

# 2.1. the relation between target column and some other columns

In [None]:
fig = plt.figure(figsize = (18,10))

fig.add_subplot(121)
plt.scatter(x = df_train['GrLivArea'], y = df_train['SalePrice'], color = "g", edgecolor = 'k')
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")

fig.add_subplot(122)
plt.scatter(x = df_train['TotalBsmtSF'], y = df_train['SalePrice'], color = "m", edgecolor = 'k')
plt.xlabel("TotalBsmtSF")
plt.ylabel("SalePrice")

plt.show()

So, here we can see the relationship of **'SalePrice'** with **'GrLivArea'** and **'TotalBsmtSF'**. The first one looks like linear dependency and the second one is like exponential (or also linear).

# 2.2. distribution of target 'SalePrice' variable

Firstly, let's apply **describe()** method to look through the statistics.

In [None]:
stats = df_train['SalePrice'].describe()
stats

After that, we can plot the distribution of the target variable.

In [None]:
def plot_distribution(df):
    fig = plt.figure(figsize = (20,10))
    df['SalePrice'].plot.kde(color = 'r')
    df['SalePrice'].plot.hist(density = True, color = 'blue', edgecolor = 'k', bins = 100)
    plt.legend(['Normal distibution, ($\mu =${:.2f} and $\sigma =${:.2f})'.format(stats[1], stats[2])], loc='best')
    plt.title("Frequency distribution plot")
    plt.xlabel("SalePrice")
    # I don't like "1e6" number notation, so style will be 'plain'
    plt.ticklabel_format(style = 'plain', axis = 'y') 
    plt.ticklabel_format(style = 'plain', axis = 'x') 
    plt.show()


In [None]:
plot_distribution(df_train)

As the [evaluation](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation) of the code needs to be converted to log, we'll apply it to our target variable:
*Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally*. It is explained that the target variable is right skewed. As linear models like normally distributed data, we need to transform this variable and make it more normally distributed.

In [None]:
df_train["SalePrice"] = np.log1p(df_train["SalePrice"])
plot_distribution(df_train)

# 2.3. correlation between variables
Here we calculate the correlation bewween variables.

In [None]:
cor_matrix = df_train.corr()
cor_matrix.style.background_gradient(cmap='coolwarm')

And, for better unstanding, we can separate the correlation between the target variable and other variables for the better visualizing.

In [None]:
cor_matrix2 = cor_matrix["SalePrice"]
cor_matrix2 = cor_matrix2.to_frame()
cor_matrix2.style.background_gradient(cmap='coolwarm')

# 2.4. pairplots

In [None]:
# choosing some columns for plotting pairplots
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF']
pd.plotting.scatter_matrix(df_train[cols], alpha=0.2, figsize=(25, 25), color = 'cyan', edgecolor='k')
plt.show()

# 3. Delete NaNs
We can check dataframe for missing data (NaNs).

In [None]:
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'InPercents'])
missing_data.head(35).style.background_gradient(cmap='autumn')

Let's analyse this to understand how to handle the missing data. I'll consider that when more than 8 elements of the data is missing, I will delete the corresponding variable. According to the results above, cells that will be deleted are: 'BsmtFinType1', 'BsmtCond', ..., 'MiscFeature' and 'PoolQC'. And the point is: will we miss this data? I think no, because I don't think that these variables are vety important (in other case, I don't think that there wiuld be so much missing data in important variables).

In [None]:
# choosing data, where missed more, than 8 cells
mask = (missing_data["Total"] > 8)
missing_data = missing_data.loc[mask]

# dropping these columns from original datasets
df_train = df_train.drop(columns = missing_data.index)
df_test = df_test.drop(columns = missing_data.index)

# fill NaNs with "Unknown"
df_train = df_train.fillna("Unknown") 
df_test = df_test.fillna("Unknown") 

print(df_train.shape)
print(df_test.shape)

And other NaNs we will replace with **mode** of the column.

In [None]:
# fill other NaNs with mode
for col in df_train: 
    df_train[col] = df_train[col].replace("Unknown",df_train[col].mode()[0])
for col in df_test:
    df_test[col] = df_test[col].replace("Unknown",df_test[col].mode()[0])
    
print(df_train.shape)
print(df_test.shape)

And here we have the final check of the NaNs. The 'Total' column must have 0 in each cell.

In [None]:
# check missing data in df_train after working with missed values
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'InPercents'])
missing_data.head(5).style.background_gradient(cmap='autumn')

In [None]:
df_train.dtypes

# 4. Feature engineering
Now it's time to prepare data before we will train our regression models. Firstly, let's convert some obvious variables (like 'MoSold') to categorical type. We will make it by convertring data to **"str" type.**

In [None]:
# convert some vars to categorical features
def convert_to_categorical(df):
    df['MSSubClass'] = df['MSSubClass'].apply(str)
    df['OverallCond'] = df['OverallCond'].astype(str)
    # df['YrSold'] = df['YrSold'].astype(str)
    df['MoSold'] = df['MoSold'].astype(str)
    
convert_to_categorical(df_train)
convert_to_categorical(df_test)

print(df_train.shape)
print(df_test.shape)

For training data, we don't need 'SalePrice', because it is the tagret variable. In addition, we don't need the 'Id' column, because it doesn't influence on the inner properties of the house (like size, color, etc), it is just the number of the house in the dataset.

In [None]:
y_train = df_train["SalePrice"].copy()
x_train = df_train.copy().drop(columns = ["Id", "SalePrice"])
x_test = df_test.copy().drop(columns = ["Id"])

print(x_train.shape)
print(x_test.shape)

In [None]:
x_train.head() # check train data

In [None]:
x_test.head() # check test data

After that, we need to label our string data. We will apply one-hot encoding - **pd.dummies()**, to the categorical variables.

P.S. categorical variables has **'np.object' type**. It is the complex type of the variable.

In [None]:
x_all = pd.concat([x_train, x_test])
categorical_cols = x_all.select_dtypes(include=np.object).columns
x_all = pd.get_dummies(x_all, prefix=categorical_cols)

x_train = x_all[:len(x_train)]
x_test = x_all[len(x_train):]

print(x_train.shape)


# 5. Normalize data
We will apply min-max normalization in range [0,1].

In [None]:
scaler = MinMaxScaler(feature_range = (0, 1)) # range is [0, 1]

Here we normalize train set.

In [None]:
normed = scaler.fit_transform(x_train.copy())
x_train = pd.DataFrame(data=normed, columns=x_train.columns)
x_train.head() # check the result

And here we normalize test set.

In [None]:
normed = scaler.fit_transform(x_test.copy())
x_test = pd.DataFrame(data=normed, columns=x_test.columns)
x_test.head() # check the result

# 6. Machine learning (regression)
Here we will apply [sklearn linear models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) to regression task. 

In [None]:
all_regr_models = [
    LinearRegression(),
    Ridge(), 
    RidgeCV(),
    LassoCV(max_iter=100000),
    ElasticNetCV()  
]

In [None]:
# there we will store RMSEs
all_rmse_train = {}

# there we will store accuracies
all_acc_train = {}

In [None]:
# learn all regressors, write accuracy and save trained models in pickle-format
for model in all_regr_models:
    
    # get the regressor name 
    model_name = model.__class__.__name__ 
    print("♦ ", model_name)
    
    # train model
    model.fit(x_train, y_train)
    
    # calculate rmse on train set
    y_train_pred = model.predict(x_train)
    mse_train = mean_squared_error(y_train, y_train_pred)
    rmse_train = math.sqrt(mse_train)
    print("- rmse_train =", round(rmse_train,7))
    
    # calculate model accuracy on train set
    model_acc_train = explained_variance_score(y_train, y_train_pred)
    print("- model_acc_train =", round(model_acc_train,7))
    
    # save its rmse on train set
    all_rmse_train[model_name] = rmse_train
    all_acc_train[model_name] = model_acc_train
    
    # predict data on test set (result variable for competition)
    y_test_pred_log1p = model.predict(x_test)
    y_test_pred = np.expm1(y_test_pred_log1p)
    
    # save model
    filename = model_name + '_model.pickle'
    pickle.dump(model, open(filename, 'wb'))  
    
    # # load model
    # loaded_model = pickle.load(open(filename, 'rb'))
    # result = loaded_model.score(x_train, y_train)     
    
    
    # submit prediction for competition
    if (model_name == 'LassoCV'):
        submission = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': y_test_pred})
        submission.to_csv('submission.csv', index=False)
        print('! submission is successful')
        
    print()
    

As we can see, the best results were shown by LinearRegression and LassoCV. I decided to save LassoCV results as my submition (because for LinearRegression I achieved ~ 0.14 RMSE for test set in the Leaderboard and for LassoCV I achieved ~0.13 RMSE).

# 7. Visualize results

In [None]:
def visualize_results(fig, subplot_id, sort_order, all_list, color_map, title, y_label):
    # sort from biggest to smallest
    all_list = dict(sorted(all_list.items(), key=lambda item: item[1], reverse=sort_order))

    # get keys and values as parameters to build plot
    keys = all_list.keys()
    values = all_list.values()

    # color map for bar chart
    color = color_map(np.linspace(0, 1, len(keys)))

    # plot
    fig.add_subplot(subplot_id)
    plt.title(title)
    plt.xlabel('regressors')
    plt.ylabel(y_label)
    plt.bar(keys, values, color=color)
    plt.xticks(rotation = 'vertical')


In [None]:
fig = plt.figure(figsize=(15, 5))
visualize_results(fig, 121, True, all_acc_train, plt.cm.jet, 'ACCURACY OF REGRESSORS', 'accuracy')
visualize_results(fig, 122, False, all_rmse_train, plt.cm.copper, 'RMSE OF REGRESSORS', 'RMSE')
plt.show()

# 7. Conclusion
Thank you for reading my new article! Hope, you liked it and it was interesting for you! There are some more my articles:
* [Automobile Customer Clustering (K-means & PCA)](https://www.kaggle.com/maricinnamon/automobile-customer-clustering-k-means-pca)
* [Credit Card Fraud detection sklearn](https://www.kaggle.com/maricinnamon/credit-card-fraud-detection-sklearn)
* [Market Basket Analysis for beginners](https://www.kaggle.com/maricinnamon/market-basket-analysis-for-beginners)
* [Neural Network for beginners with keras](https://www.kaggle.com/maricinnamon/neural-network-for-beginners-with-keras)
* [Fetal Health Classification for beginners sklearn](https://www.kaggle.com/maricinnamon/fetal-health-classification-for-beginners-sklearn)
* [Retail Trade Report Department Stores (LSTM)](https://www.kaggle.com/maricinnamon/retail-trade-report-department-stores-lstm)