# MODEL 1 

- Setting Zipcode as dummie 
- Adding a new year column (either the construction date or the reform date if reformed)

## Model 1 using Linear Regression

# Data Cleaning & Standarization

## Importing data

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings("ignore")
from sklearn.pipeline import Pipeline
pipe_lr = Pipeline([('scl', StandardScaler()),('pca', 
                  PCA(n_components=4)),('slr', LinearRegression())])
from sklearn.metrics import r2_score

In [None]:
df=pd.read_excel('regression_data_decade.xls')
pd.set_option("display.max_columns", None)
#pd.set_option("display.max_rows", None)
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## Checking Null Values

Our first step was to try to find the null values. The dataset doesn't have any null values  so we don't have to deal with them. However, it is important to think about the reason why these nulls do not exist, since this can introduce some kind of bias in the data. We have to observe if data may have been duplicated to avoid nulls, if random values have been incorporated...

In [None]:
df.isnull().sum()

In [None]:
df.notnull().sum() # We haven't detected null values in any columns

## Checking for duplicated Values

Our approach to finding null values was first of all checking the reason why a same id was repeated. Due to the fact that we only had data from 2014 and 2015, probably the only reason why a house may be repeated it's because it was sold two times in this period and, therefore, two different prices (but the independent variables remained the same.) That's why we decided to only keep the last date transaction info since it's the one that recaps better the actual price of that home. 

In [None]:
df.id.duplicated().sum() #checking how many duplicated ids(Houses) are there in the Data set

In [None]:
df.duplicated().sum() #Checking if there are duplicated rows. There are not any identical rows so the duplicated Ids may have some difference. 

In [None]:
df.loc[df.id.duplicated(),:].sort_values(by=['id']) # We check all the duplicated Ids in the Dataframe

In [None]:
df=df.sort_values('date')

In [None]:
pd.set_option("display.max_rows", None)
df[df.duplicated(['id'], keep=False)].sort_values(by=['id'])

We check for all the the id that are duplicated in all the columns except the price and drop the oldest date of those. 

In [None]:
df_dupl=df[df.duplicated(['id','bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view','condition','grade', 'sqft_above','sqft_basement','yr_built','yr_renovated','zipcode','lat','long','sqft_living15', 'sqft_lot15'], keep=False)].sort_values(by=['id'])


In [None]:
df=df.drop_duplicates( subset=['id','bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view','condition','grade', 'sqft_above','sqft_basement','yr_built','yr_renovated','zipcode','lat','long','sqft_living15', 'sqft_lot15'],keep='last')

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
df.info() 

# Pre-processing

## Checking data types

We first thought of flooring the bathrooms to an integer but since it can also take decimals values we decided to leave it as a float. However, we did convert floors into an integer. 

In [None]:
df['floors']=df['floors'].astype(int)

In [None]:
df.head(15)

In [None]:
df.info()

In [None]:
df=df.drop(['id'],axis=1)

## Checking data shapes

We first plot all the graphs to try to detect clear outliers. At first sight, most of the numerical columns( sqft_living, sqft_living15, sqft_lot, sqft_lot15, sqft_above, sqft_basement) seem to have some outliers but we'll get deeper into it by plotting also the scatterplot. For the categorical variables such as bedrooms, we'll deal with non-sense outliers such as 33 and 11 bedrooms(not consistent with the rest of the attributes of the house). For the 33 bedrooms house, we'll treat it as a typo and interpret it as 3. 

In [None]:
df.hist(figsize=(15,15),bins=20,layout=(6,4));

In [None]:
df.hist(['bedrooms'], figsize=(13,11),bins=50)

In [None]:
df['bedrooms'] = df['bedrooms'].replace(33,3)

In [None]:
df.drop(df.loc[df['bedrooms']==11].index, inplace=True)

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
df.hist(['bedrooms'], figsize=(13,11),bins=50)

In [None]:
df.hist(['floors'], figsize=(13,11),bins=50)

In [None]:
df.describe()

## Check useless columns

For checking which columns should we add to our model, we run both the correlation matrix and the scatter_matrix so that we could check for multicollinearity, which were the variables more related to the price...
After observing the scatter_matrix, and observing that the sqft_living behaved as a kind of normal distribution we decided to deal with it's outliers by droping the values away from it's mean and 3 std. deviations. For the numeric variables, we introduce to the model sqft_living and sqft_basement. The reason for the first (sqft_living) is that is the variable more correlated with the target and it's very correlated with another numeric variable (sqft_above) that we drop to avoid multicollinearity. Regarding sqft_basement, is not as correlated to sqft_living so we'll live it in the model. 

In [None]:
corr_df=df.drop(['date','lat','long','yr_built','zipcode','yr_renovated','grade','condition','view','waterfront'], axis=1)

In [None]:
#Correlation Matrix
corre_matrix=corr_df.corr()
corre_matrix

In [None]:
#Heatmap
import matplotlib.pyplot as plt 
import seaborn as sns

mask=np.zeros_like(corre_matrix)
mask[np.triu_indices_from(mask)]=True
fig,ax = plt.subplots(figsize=(14,10))
ax=sns.heatmap(corre_matrix, mask=mask, center=0, cmap=sns.diverging_palette(220, 20, as_cmap=True), annot=True,);

In [None]:
from scipy import stats
df=df[(np.abs(stats.zscore(df['sqft_living'])) < 3)]

In [None]:
df=df.drop(df.loc[df['sqft_lot']>1000000].index, inplace=True)

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
X_num=df[['sqft_living','bedrooms','bathrooms','floors', 'sqft_basement']]
X_num_scatter_matrix=df[['sqft_above','sqft_living','sqft_basement','sqft_lot' ]]

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(X_num_scatter_matrix,alpha=0.2, figsize=(20,12), diagonal='kde');

In [None]:
df.tail().sort_values('sqft_lot')

## Dealing with the categorical variables 

As we commented before, we added a categorical variable(Decade build) that shows wither the date it was build or the date it was renovated. We afterwards group them by decades. We afterwards dummified all our categorical variables. This was one of the main limitations of this model since when dummifying the zipcode we creat a lot of little subsamples, some of them with not many observations which can drive us to error. This problem gets bigger when we drop some outliers that may reduce even more the size of our subsamples.

In [None]:
feat_crosstab=pd.crosstab(df['condition'],df['grade'], margins=False)
feat_crosstab

In [None]:
from scipy.stats import chi2_contingency 
chi2_contingency(feat_crosstab, correction=False)

In [None]:
feat_crosstab=pd.crosstab(df['view'],df['grade'], margins=False)
feat_crosstab

In [None]:
chi2_contingency(feat_crosstab, correction=False)

In [None]:
X_cat1=df[['grade', 'Decade Build', 'condition','grade','view']].astype(str)

In [None]:
X_cat1.info()

In [None]:
X_dummies1=pd.get_dummies(X_cat1, drop_first=True)
X_dummies1.head(10)

In [None]:
X_cat2=df[['zipcode']].astype(str)
X_cat2.info()

In [None]:
X_dummies2=pd.get_dummies(X_cat2, drop_first=True)
X_dummies2.head(10)

In [None]:
X_dummies=pd.concat((X_dummies1,X_dummies2),axis=1)

In [None]:
X_final=pd.concat((X_num,X_dummies),axis=1)

In [None]:
X_final.head(10)

# Testing the model

## Train test split

Once all the engineering and the pre-processing is finished, we, as usual run and test our model

In [None]:
y = df['price']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_final,y,test_size=0.2,random_state=40)

## Linear regression model

In [None]:
from sklearn import linear_model

In [None]:
lm=linear_model.LinearRegression() #configure model
model=lm.fit(X_train,y_train)

In [None]:
preds=lm.predict(X_test)
preds

## Model Validation

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
r2_score(y_test,preds)

In [None]:
mse=mean_squared_error(y_test, preds)
mse

In [None]:
rmse=np.sqrt(mse)
rmse

In [None]:
mean_absolute_error(y_test, preds)

In [None]:
# TESTING-ADAPTING

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import GradientBoostingRegressor

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(linear_model.LinearRegression(),
            k_features=5,
            forward=True,
            scoring='r2',
            cv=3)
sfs1 = sfs1.fit(X_train, y_train)

In [None]:
sfs1.subsets_

## Testing other models

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree

from sklearn.metrics import accuracy_score
from sklearn import linear_model

classifiers = ['RandomForestRegressor', 'KNeighborsRegressor','GradientBoostingRegressor', 'linear_model', 'tree_Regressor']
models = [
          RandomForestRegressor(n_estimators=200, random_state=0),
          KNeighborsRegressor(),
          GradientBoostingRegressor(random_state=0), linear_model.LinearRegression(),tree.DecisionTreeRegressor()
         ]
for i in models:
    model = i
    model.fit(X_train,y_train)
    preds=model.predict(X_test)
    print(model,'accuracy:',r2_score(y_test,preds))

# Scale all of X

For the scalling methods, this are going to be our different approachs: 

In [None]:
def maxmin_scaler (X):
    from sklearn.preprocessing import MinMaxScaler
    X_scaled = MinMaxScaler().fit(X).transform(X)
    
    return X_scaled

def st_scaler (X):
    from sklearn.preprocessing import StandardScaler
    X_scaled_st = StandardScaler().fit(X).transform(X)
    return X_scaled_st

def rob_scaler (X):
    from sklearn.preprocessing import RobustScaler
    X_scaled_rob = RobustScaler().fit(X).transform(X)
    return X_scaled_rob



## Min Max Scaler

In [None]:
X_scaled=maxmin_scaler(X_num)

In [None]:
#from sklearn.preprocessing import MinMaxScaler
#scaler=MinMaxScaler()
#X_scaled=scaler.fit_transform(X_num)
#X_scaled

In [None]:
X_scaled_df=pd.DataFrame(X_scaled, columns=X_num.columns)

In [None]:
X_scaled_df.describe()

In [None]:
X_scaled_concat=pd.concat((X_scaled_df,X_dummies),axis=1)

In [None]:
X_scaled_df.head()

In [None]:
X_scaled_concat.head(20)

In [None]:
y.shape

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_scaled_concat,y,test_size=0.2,random_state=40)

In [None]:
lm=linear_model.LinearRegression() #configure model
model=lm.fit(X_train,y_train)

In [None]:
preds=lm.predict(X_test)
preds

In [None]:
r2_score(y_test,preds)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn import linear_model

classifiers = ['RandomForestRegressor', 'KNeighborsRegressor','GradientBoostingRegressor', 'linear_model']
models = [
          RandomForestRegressor(n_estimators=200, random_state=0),
          KNeighborsRegressor(),
          GradientBoostingRegressor(random_state=0), linear_model.LinearRegression(),
         ]
for i in models:
    model = i
    model.fit(X_train,y_train)
    preds=model.predict(X_test)
    print(model,'accuracy:',r2_score(y_test,preds))

## StandardScaler

In [None]:
X_scaled=st_scaler(X_num)

In [None]:
X_scaled_df=pd.DataFrame(X_scaled, columns=X_num.columns)

In [None]:
X_scaled_df.describe()

In [None]:
X_scaled_concat=pd.concat((X_scaled_df,X_dummies),axis=1)

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_scaled_concat,y,test_size=0.2,random_state=40)

In [None]:
lm=linear_model.LinearRegression() #configure model
model=lm.fit(X_train,y_train)

In [None]:
preds=lm.predict(X_test)
preds

In [None]:
r2_score(y_test,preds)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn import linear_model

classifiers = ['RandomForestRegressor', 'KNeighborsRegressor','GradientBoostingRegressor', 'linear_model']
models = [
          RandomForestRegressor(n_estimators=200, random_state=0),
          KNeighborsRegressor(),
          GradientBoostingRegressor(random_state=0), linear_model.LinearRegression(),
         ]
for i in models:
    model = i
    model.fit(X_train,y_train)
    preds=model.predict(X_test)
    print(model,'accuracy:',r2_score(y_test,preds))

## RobustScaler

In [None]:
X_scaled=rob_scaler(X_num)

In [None]:
X_scaled_df=pd.DataFrame(X_scaled, columns=X_num.columns)

In [None]:
X_scaled_df.describe()

In [None]:
X_scaled_concat=pd.concat((X_scaled_df,X_dummies),axis=1)

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_scaled_concat,y,test_size=0.2,random_state=40)

In [None]:
lm=linear_model.LinearRegression() #configure model
model=lm.fit(X_train,y_train)

In [None]:
preds=lm.predict(X_test)
preds

In [None]:
r2_score(y_test,preds)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn import linear_model

classifiers = ['RandomForestRegressor', 'KNeighborsRegressor','GradientBoostingRegressor', 'linear_model']
models = [
          RandomForestRegressor(n_estimators=200, random_state=0),
          KNeighborsRegressor(),
          GradientBoostingRegressor(random_state=0), linear_model.LinearRegression(),
         ]
for i in models:
    model = i
    model.fit(X_train,y_train)
    preds=model.predict(X_test)
    print(model,'accuracy:',r2_score(y_test,preds))

# MODEL 2 - Setting distance to center as dummie

In our second model, we tried to think how we could improve the score we obtained on our previous model. As we commented before, one of the main issues in our previous model was deling with some many subsamples due to the dummification od the zipcode variable. We therefore, tried to reduce the number of subsamples by some kind of aggrupation that had more observations for each subsample. That's why we decided to divide the zipcodes in 5 groups depending how far away were they from the most expensive area(best place to live) since it appear to be a clear pattern that distance to this point would mean less value of the house (less services, more distance to business area...). That's why we created the distance_to_center column. For the rest of the model, we just follow the same steps as before. 

In [None]:
df3=pd.read_excel('regression_data_distance_center.xls')
pd.set_option("display.max_columns", None)
#pd.set_option("display.max_rows", None)
df3.head()

In [None]:
df3.info()

In [None]:
df3.drop(df3.loc[df3['bedrooms']==11].index, inplace=True)
df3['bedrooms'] = df3['bedrooms'].replace(33,3)

In [None]:
df3.drop(df3.loc[df3['bedrooms']==11].index, inplace=True)

In [None]:
df3=df3[(np.abs(stats.zscore(df3['sqft_living'])) < 3)]
df3.reset_index(drop=True, inplace=True)

In [None]:
df3=df3.drop_duplicates(subset=['id','bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view','condition','grade', 'sqft_above','sqft_basement','yr_built','yr_renovated','zipcode','lat','long','sqft_living15', 'sqft_lot15'],keep='last')

In [None]:
df3['bathrooms']=df3['bathrooms'].astype(int) 
df3['floors']=df3['floors'].astype(int)

In [None]:
df3=df3.drop(['id'],axis=1)
df3.reset_index(drop=True, inplace=True)

In [None]:
X_cat3=df3[['grade', 'Decade Build', 'condition','distance_to_center']].astype(str)
X_cat3.info()

In [None]:
X_dummies10=pd.get_dummies(X_cat3, drop_first=True)
X_dummies10.info()

In [None]:
X_num2=df[['sqft_living','bedrooms','bathrooms','floors', 'sqft_basement']]

In [None]:
X_final2=pd.concat((X_num2,X_dummies10),axis=1)

In [None]:
X_final2.info()

In [None]:
y2 = df3['price']

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_final2,y2,test_size=0.2,random_state=40)

In [None]:
lm=linear_model.LinearRegression() #configure model
model=lm.fit(X_train,y_train)

In [None]:
preds2=lm.predict(X_test)
preds2

In [None]:
r2_score(y_test,preds2)

In [None]:
mse=mean_squared_error(y_test, preds2)
mse

In [None]:
rmse=np.sqrt(mse)
rmse

In [None]:
mean_absolute_error(y_test, preds2)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn import linear_model

classifiers = ['RandomForestRegressor', 'KNeighborsRegressor','GradientBoostingRegressor', 'linear_model']
models = [
          RandomForestRegressor(n_estimators=200, random_state=0),
          KNeighborsRegressor(),
          GradientBoostingRegressor(random_state=0), linear_model.LinearRegression(),
         ]
for i in models:
    model = i
    model.fit(X_train,y_train)
    preds=model.predict(X_test)
    print(model,'accuracy:',r2_score(y_test,preds))

In [None]:
# TESTING-ADAPTING

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_predict

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(linear_model.LinearRegression(),
            k_features=23,
            forward=True,
            scoring='r2',
            cv=3).fit(X_train,y_train)


In [None]:
sfs1.subsets_