## Real Estate House Price Predictor

In [1]:
import pandas as pd
import numpy as np

In [2]:
housing = pd.read_csv("data.csv")

In [3]:
housing.head()

In [4]:
housing.info()

In [5]:
housing.describe()

In [6]:
%matplotlib inline

In [7]:
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize =(20,15))

## Train - Test Splitting

In [8]:
# import numpy as np
# def split_train_test(data,test_ratio):
#     shuffled = np.random.permutation(len(data))
#     test_set_size = int(len(data) * test_ratio)
#     test_indices = shuffled[:test_set_size]
#     train_indices = shuffled[test_set_size: ]
#     return data.iloc[train_indices], data.iloc[test_indices]

In [9]:
# train_set,test_set = split_train_test(housing,0.2)
# print (f"Rows in train set :{len(train_set)}\nRows in test set :{len(test_set)}\n") 

In [10]:
from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(housing,test_size=0.2,random_state=42)
print (f"Rows in train set :{len(train_set)}\nRows in test set :{len(test_set)}\n") 

In [11]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index,test_index in split.split(housing,housing['CHAS']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]    

In [12]:
strat_test_set

In [13]:
housing = strat_train_set.copy()

## Looking for Correlations

<!-- Correlation, statistical technique which determines how one variables moves/changes in relation with the other variable. It gives us the idea about the degree of the relationship of the two variables. It’s a bi-variate analysis measure which describes the association between different variables. In most of the business it’s useful to express one subject in terms of its relationship with others.It also helps us to find out whether the dataset provided to us is erroneous and also helps us to
create new features by combining the existing features/label in the dataset that has been provided to us -->

In [14]:
# Creating the Correlation Matrix
corr_matrix = housing.corr()

In [15]:
# Finding out the correlation of the Label 'MEDV' wrt the other labels. A strong positive correlation as in case of RM means
# increasing the value of RM also increases the value of MEDV and a strong negative correlation means the labels are inversly
# proportional

corr_matrix['MEDV'].sort_values(ascending=False)

In [16]:
from pandas.plotting import scatter_matrix
attributes = ["MEDV","RM","ZN","LSTAT"]
scatter_matrix(housing[attributes],figsize = (12,8))

In [17]:
# We can see that this graph has many outliers which can mislead our model. For ex a house with 5 rooms and one with 9 rooms have
# the same price which is not possible.Also some points are way scattered.Thus plotting these graphs help us to clearly understand
# the outliers and remove them that is basically clean the data

housing.plot(kind ="scatter", x="RM",y="MEDV",alpha=0.8)

## Trying out Attribute Combinations

In [18]:
housing ["TAXRM"] = housing["TAX"]/housing["RM"]

In [19]:
housing.head()

In [20]:
# We can see a good negative correlation between our newly created attribute,TAXRM and MEDV

housing.plot(kind ="scatter", x="TAXRM",y="MEDV",alpha=0.8)

In [21]:
housing = strat_train_set.drop("MEDV",axis=1)
housing_labels = strat_train_set["MEDV"].copy
housing_labels = np.array(housing_labels)

## Missing Attributes

To take care of missing attributes we can do the following thing

1. Get rid of the missing data points
2. Get rid of the whole attribute
3. Set the value to some value (0,median,or mean)

In [22]:
#Option 3 where we are calculating median
#We will use the same median for our test set
#because we are not sure that our test set does 
#not have missing values

#We will also use this median in case some new
#data is added to our dataset

median = housing["RM"].median()

In [23]:
housing["RM"].fillna(median)
#Note that the original housing dataframe will remain unchanged

In [24]:
housing.shape

This code creats an imputer object. It takes help of the SimpleImputer class and when we use the fit method it finds out the median for every attribute in our dataset.

In [25]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = "median")
imputer.fit(housing)

In [26]:
imputer.statistics_.shape

We can see that it has calculated 15 medians that is median for every attribute in our dataset. Inspite of just needing the median for RM we are calculating all the medians because we want to create a pipeline that can handle missing values for other attributes also if present when more data will be added to our dataset in the future

In [27]:
X = imputer.transform(housing)

In [28]:
housing_tr = pd.DataFrame(X,columns = housing.columns)

#We create a new dataset where the columns are taken from housing and the rows which have been 
#transformed are taken from X  

In [29]:
housing_tr.describe()

## Scikit-Learn Design

3 types primarily

1. Estimators : Estimate some parameters based on a dataset. Eg: imputer It has a fit and transform method 
Fit method : Fits the dataset and calculates parameters according to the dataset

2. Trasnformers : Takes input and returns output based on the learnings from fit(). It also has a convenience method called fit_transform() which fits and then transforms

3. Predictors : Linear Regression model is an example of a predictor. Two common functions. Set(0 and Predict(). It also gives some score function which will evaluate the predictions



## Feature Scaling

Primarily two types of feature scaling methods:

1. Min-Max scaling (Normalization)
       (value-min)/(max-min)
       SK learn provides a class called MinMaxScaler for this
       
2. Standardization
    (value - mean)/(Standard Deviation)
    Sk learn provides a class called Standard Scalar for this

## Creating a pipeline

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
my_pipeline =Pipeline([
    ('imputer',SimpleImputer(strategy="median")),
    #.... add as many tools as you want to
    ('std_scaler',StandardScaler())
])

In [31]:
housing_num_tr = my_pipeline.fit_transform(housing)

In [32]:
housing_num_tr.shape
#This is a numpy array

## Selecting a desired model for our Real Estate House Price Prediction

In [33]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(housing_num_tr, housing_labels.values)

In [34]:
some_data = housing.iloc[:5]

In [35]:
some_labels = housing_labels.iloc[:5]

In [36]:
prepared_data = my_pipeline.transform(some_data)

In [41]:
predicted_data =model.predict(prepared_data)

In [43]:
for i in range(len(some_data)):
    print("Actual value:", some_labels[i], "Predicted value:", predicted_data[i])

## Evaluating the model