### EDA and Data Preparation


In this notebook we perform the Exploratory Data Analytics(EDA) and some technics of Data Preparation.<br>
Our work is divided on the following steps<br>
- Load necessary libraries<br>
- Load data<br>
- First look on dataset<br>
    * Shape
    * Check the missing values
- Data Preparation
    * Find new feautures
    * Outlier Handling
    * Encoding categorial data

### Load necessary libraries

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

### Load data

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

test_df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/test.csv')
train_df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')

### First look

In [None]:
train_df.head()

#### Shape

In [None]:
train_df.shape

#### Missing values

In [None]:
train_df.isnull().sum()

In [None]:
train_df.info()

### Data Preparation

As we see the column of the address contains two parts. Let's divide this value and add a column of cities in the dataset.<br>
May be **city** is important feature for predictions model. 

In [None]:
train_df['ADDRESS_PART1'] = train_df['ADDRESS'].apply(lambda x: x.split(',')[0].strip())
train_df['CITY'] = train_df['ADDRESS'].apply(lambda x: x.split(',')[1].strip())

In [None]:
train_df.head()

Count unique values for City column

In [None]:
len(train_df['CITY'].unique())

Check the correlation of columns

In [None]:
train_df.corr()

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(train_df.corr())

This isn't a surprise - max value of correlation between price and square of flat

#### Outlier Handling

**Outliers** are the values, which are too far from the rest of our observations in the columns.<br>
Outliers can distort statistic data.<br>
The best way to visualize outliers is by plotting box plots

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(y='SQUARE_FT', data=train_df)

There are a few outliers in the form of black dots. Let's remove the outliers from the dataset. One of the possible ways to do this - using Inter Quartile Range (IQR) <a href="https://pypi.org/project/remove-outliers/#:~:text=Multiply%20the%20interquartile%20range%20(IQR,IQR)%20from%20the%20first%20quartile"> More details </a> <br>
I'm going to create function for this

In [None]:
def get_outliers(df, column_name):
    
    IQR = df[column_name].quantile(0.75) - df[column_name].quantile(0.25)
    lower_sq_limit = df[column_name].quantile(0.25) - (IQR * 1.5)
    upper_sq_limit = df[column_name].quantile(0.75) + (IQR * 1.5)
    outliers = np.where(df[column_name] > upper_sq_limit, True,
    np.where(df[column_name] < lower_sq_limit, True, False))
    return outliers

In [None]:
sqr_ft_outliers = get_outliers(train_df, 'SQUARE_FT')
df_without_outliers = train_df.loc[~(sqr_ft_outliers),]
print(train_df.shape, df_without_outliers.shape)

In [None]:
print("{} rows was been deleted".format(
    train_df.shape[0] - df_without_outliers.shape[0]))

Check in

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(y='SQUARE_FT', data=df_without_outliers)

Check in outliers in target column

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(y='TARGET(PRICE_IN_LACS)', data=df_without_outliers)

The same situation, so I'm going to remove outliers from dataset

In [None]:
price_outliers = get_outliers(df_without_outliers, 'TARGET(PRICE_IN_LACS)')
len(price_outliers)

In [None]:
prepared_df = df_without_outliers.loc[~(price_outliers),]
print(train_df.shape, prepared_df.shape)

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(y='TARGET(PRICE_IN_LACS)', data=prepared_df)

In [None]:
print("{} rows was been deleted".format(
    train_df.shape[0] - prepared_df.shape[0]))

percent_of_deleted_rows = round((train_df.shape[0] - prepared_df.shape[0]) / train_df.shape[0] * 100, 2)
print("{}% data was been deleted".format(percent_of_deleted_rows))


In [None]:
prepared_df.index = np.arange(prepared_df.shape[0])
prepared_df.index

Let's look to the CITY column

In [None]:
head_values = prepared_df['CITY'].value_counts().head(20).index.to_list()
head_city = prepared_df[prepared_df['CITY'].isin(head_values)]
plt.figure(figsize=(10,8))
sns.boxplot(y='TARGET(PRICE_IN_LACS)', x='CITY', data=head_city)
plt.xticks(rotation=45)

As we can see, city is an important feature for prediction model. It's not surprise :)

#### Encoding categorical data
<a href="https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/"> More about encoding data </a>

In [None]:
prepared_df.nunique()

In [None]:
prepared_df = pd.concat([prepared_df, pd.get_dummies(prepared_df['POSTED_BY'])], axis=1)
prepared_df = pd.concat([prepared_df, pd.get_dummies(prepared_df['BHK_OR_RK'])], axis=1)

le = LabelEncoder()
le.fit(prepared_df['CITY'])
prepared_df['LE_CITY'] = le.transform(prepared_df['CITY'])

le.fit(prepared_df['ADDRESS_PART1'])
prepared_df['LE_ADDRESS_PART1'] = le.transform(prepared_df['ADDRESS_PART1'])

In [None]:
prepared_df.head()

In [None]:
prepared_df.drop(['POSTED_BY', 'BHK_OR_RK', 'ADDRESS', 'CITY', 'ADDRESS_PART1'], axis=1, inplace=True)

In [None]:
temp = prepared_df[['SQUARE_FT','LONGITUDE', 'LATITUDE', 'TARGET(PRICE_IN_LACS)']]
scaler = StandardScaler()
scaler.fit(temp)
temp_scaled = scaler.transform(temp)

temp_scaled = pd.DataFrame(temp_scaled, 
                           columns=temp.columns)

temp_scaled


In [None]:
prepared_df.drop(['SQUARE_FT','LONGITUDE', 'LATITUDE', 'TARGET(PRICE_IN_LACS)'], axis=1, inplace=True)
prepared_df = pd.concat([prepared_df, temp_scaled], axis=1)

In [None]:
prepared_df

#### Create model

Data has been ready, next step – create prediction model.

In [None]:
X = prepared_df.loc[:, prepared_df.columns != 'TARGET(PRICE_IN_LACS)']
y = prepared_df['TARGET(PRICE_IN_LACS)']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)

In [None]:
X_train.shape, X_test.shape

In [None]:
# gbr = GradientBoostingRegressor()

# parameters = {'max_depth':[9,],
#               'n_estimators':[154,],
#               'max_features': [6,],
#               'learning_rate':[x/10 for x in map(float, range(1,5))],
#              }
# clf = GridSearchCV(gbr, parameters)
# clf.fit(X_train, y_train)
# clf.best_score_, clf.best_params_


In [None]:
gbr = GradientBoostingRegressor(max_depth=9, n_estimators=154)
cross_val_score(gbr, X_train, y_train, cv=5)

In [None]:
gbr = GradientBoostingRegressor(max_depth=9, n_estimators=154, max_features=6)
cross_val_score(gbr, X_train, y_train, cv=5)

In [None]:
gbr.fit(X_train, y_train)

In [None]:
gbr.score(X_test, y_test)

#### Coefficients of features

In [None]:
pd.DataFrame(gbr.feature_importances_,X_train.columns, columns=['coef']).sort_values(by='coef', ascending=False)

#### Difference between real and predicted data

Let' create plot for first 150 predicted and real data.

In [None]:
fig, ax = plt.subplots(figsize=(30, 10))
ax.plot(y_test.to_list()[:150], 
        label='First 150 values', color='red', linewidth=2)
ax.plot(gbr.predict(X_test)[:150], 
        label='Predicted first 150 values', 
        linestyle='dashed', linewidth=2)
ax.legend(prop={"size":20})

# Thank you