#                                    (Housing Price Prediction)

### Importing necessary libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sapi
import statsmodels.formula.api as smf
import seaborn as sns

#### Pandas

'Pandas' is a software library written for the Python programming language for data manipulation and analysis. It is most widely used for data science/data analysis and machine learning tasks.

#### Matplotlib

Matplotlib is a python software library used for data visualization.

#### Sklearn

'Sklearn'also known as 'scikit-learn' is a software machine learning library for the Python programming language which features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
#### Regression algorithm in Sklearn
It is used to predict a continuous-valued attribute associated with an object.

#### Statsmodels

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

#### statsmodels.api
Cross-sectional models and methods. (Cross-sectional analysis looks at data collected at a single point in time, rather than over a period of time.)
#### statsmodels.formula.api
A convenience interface for specifying models using formula strings and DataFrames.

#### Seaborn
Seaborn is a library in Python predominantly used for making statistical graphics. Seaborn is a data visualization library built on top of matplotlib and closely integrated with pandas data structures in Python. Visualization is the central part of Seaborn which helps in exploration and understanding of data.

### Getting data through 'Pandas' Library

In [None]:
data=pd.read_csv("Housing Price Prediction (1).csv")
# Here we are getting our data of 'Housing Price Prediction' through pd.read_csv(file_name.extension) and 
#assigning to 'data' variable . 

### About Data:
Our data comes from dataset 'Housing Price Prediction' which is in csv(comma separated value) format. It contains 21597 rows also known as observation and 21 columns also known as features. Our aim is to predict the price of house based on the features which have major role in predicting price. 

In [None]:
data.head(10)
# Dataframe.head() tells us the topmost 5 values of the data as it's default in it we can change it by passing value in head..
# For example 
# data.head(10)
# It tells us topmost 10 value of the dataset...

### Data and feature Description

#### id - 
    Unique ID for each home sold
#### date - 
    Date of the home sale
#### price - 
    Price of each home sold
#### bedrooms -
    Number of bedrooms
#### bathrooms - 
    Number of bathrooms
#### sqft_living - 
    Square footage of the apartments interior living space
#### sqft_lot - 
    Square footage of the land space
#### floors -
    Number of floors
#### waterfront - 
    A dummy variable for whether the apartment was overlooking the waterfront or not
#### view - 
    An index from 0 to 4 of how good the view of the property was
#### condition - 
    An index from 1 to 5 on the condition of the apartment,
#### grade - 
    An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
#### sqft_above - 
    The square footage of the interior housing space that is above ground level
#### sqft_basement - 
    The square footage of the interior housing space that is below ground level
#### yr_built - 
    The year the house was initially built
#### yr_renovated - 
    The year of the house’s last renovation
#### zipcode - 
    What zipcode area the house is in
#### lat - 
    Lattitude
#### long - 
    Longitude
#### sqft_living15 - 
    The square footage of interior housing living space for the nearest 15 neighbors
#### sqft_lot15 -
    The square footage of the land lots of the nearest 15 neighbors


### Feature Naure:
### From the type of the features and their values count, we can determine the nature of each feature:

#### Qualitative:

##### Nominal: 
        id, waterfront, zipcode
##### Ordinal: 
        date, view, condition

#### Quantitative:

##### Discrete: 
        bedrooms, sqft_living, sqft_lot, sqft_above, sqft_basement, yr_built, yr_renovated, sqft_living15, sqft_lot15
        
##### Continuous: 
        price, floors, lat, long,

In [None]:
data.shape
# .shape tells us the shape (number of rows, number of columns) of the dataset..
# Our dataset (Housing Price Predicton) has 21597 rows also known as observation and 21 columns also known as features... 

In [None]:
print(data.isnull().sum())
# It tells us about our dataset if any 'Null' value is present in it columnwise.
# In our dataset there in no 'Null' value present in it..

In [None]:
data.columns
# .columns tells us about the all the columns aka features of the dataset..

In [None]:
data.sort_values(['yr_built'],ascending=False)
# Here we are sorting our dataset by 'yr_built' in descending order to check our latest built houses...

In [None]:
data.describe()
# .describe() tells us the about the complete statistics about the data. In our data it includes count, mean, standard deviation,
# and percentiles of 'Housing price prediction'.

In [None]:
data.dtypes
# This returns a Series with the data type of each column.
# The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.

In [None]:
# Cast a pandas object to a specified data type

data['bathrooms'] = data['bathrooms'].astype('int64')
# it casts 'bathrooms' column to 'int64' from 'float64' 

data['floors'] = data['floors'].astype('int64')
# it casts 'floors' column to 'int64' from 'float64' 

### Correlation of data

In [None]:
data.corr()
# corr() is used to find the pairwise correlation of all columns in the dataframe. 
# Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.

### Visualisation of data (Housing Price Prediction)

In [None]:
sns.pairplot(data)
# To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. 
# This shows the relationship for combination of variable in a DataFrame as a matrix of plots and 
# the diagonal plots are the univariate plots.to plot all the values with respect to each other....

In [None]:
sns.set(rc={'figure.figsize':(17,11)})
sns.heatmap(data.corr(), annot = True, cmap = 'magma')
# Heatmap is defined as a graphical representation of data using colors to visualize the value of the matrix. 
# In this, to represent more common values or higher activities brighter colors basically reddish colors are used and 
# to represent less common or activity values, darker colors are preferred. Heatmap is also defined by the name of the shading matrix.
# Heatmaps in Seaborn can be plotted by using the seaborn.heatmap() function.

In [None]:
sns.distplot(data.loc[:,'price'])
# distplot() function is used to plot the distplot. The distplot represents the univariate distribution of data 
# i.e. data distribution of a variable against the density distribution. 
# The seaborn. distplot() function accepts the data variable as an argument and returns the plot with the density distribution.

###  Regression 

Linear regression models assume that the relationship between a dependent continuous variable Y and one or more explanatory (independent) variables X is linear (that is, a straight line). It’s used to predict values within a continuous range (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). Linear regression models can be divided into two main types:

#### Simple Linear Regression:
Simple linear regression uses a traditional slope-intercept form, where a and b are the coefficients that we try to “learn” and produce the most accurate predictions. X represents our input data and Y is our prediction.

##### .                                                                  Y=bX+a
    where y: Predicted value
    X: Predictor
    a: Intercept (estimated by regression)
    b: Coefficient (estimated by regression)

#### Multivariable Regression
A more complex, multi-variable linear equation might look like this, where w represents the coefficients or weights, our model will try to learn.
#####                                        Y(x1,x2,x3.....,xn)=w0 + w1x1 + w2x2 + w3x3 +.......+ wnxn

The variables x_1, x_2, x_3.......,x_n represent the attributes or distinct pieces of information, we have about each observation.

## Building model through Sklearn's linear_model 

### Extracting useful data and dividing them in dependent and independent variables

In [None]:
X_var = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built']].values
# independent variables (all columns or features except 'price','date' & 'id' )
y_var = data['price'].values # dependent variable (it includes'price' column which we have to predict with our independent variables. )

### Importing train_test_split
It is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don't need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_var, y_var, test_size = 0.2, random_state = 0)
# Here train_x,train_y are training data with 0.8(or 80%) size and test_x,test_y are included in test data.

### Building model from 'sklearn's linear_model' library

In [None]:
model=LinearRegression()

## 'LinearRegression' fits a linear model with coefficients to minimize the residual sum of squares between the
# observed targets in the dataset, and the targets predicted by the linear approximation.

### Fitting variables to the above built model

In [None]:
model.fit(X_train,y_train)

### Predicting through the above built model

In [None]:
predictions = model.predict(X_test)
# we have assigned our all prediction of test data in y_pred variable.
pd.Series(predictions)
# to convert our array into pandas series..

In [None]:
model.coef_
# To show coefficients of various features which helped to predict the price of house..

In [None]:
model.intercept_
# To show above built model's intercept

In [None]:
model.score(X_test,y_test)
# To show R-squared value which stands for accuracy of model...

In [None]:
import numpy as np #importing numpy library..
# NumPy is a Python library used for working with arrays. 
# It also has functions for working in domain of linear algebra, fourier transform, and matrices. 

plt.scatter(y_test, predictions)# to scatter 'y_test' and 'predictions'
plt.xlabel('Actual Labels')# to x-label
plt.ylabel('Predicted Labels')# to y-label
plt.title('Predictions vs Actuals')# for title of graph
z = np.polyfit(y_test, predictions, 1)
# The Numpy polyfit() method is used to fit our data inside a polynomial function, which is straight line here.

p = np.poly1d(z)
#The numpy.poly1d() function helps to define a polynomial function. 
#It makes it easy to apply “natural operations” on polynomials.

plt.plot(y_test,p(y_test), color='magenta')
plt.show()

In [None]:
sns.distplot((y_test-predictions),bins=50, color = 'red');

### In the above histogram plot, we see data is in bell shape (Normally Distributed), which means our model has done good predictions.

In [None]:
data_2=pd.DataFrame(data=y_test,columns=['price'])
# to create a dataframe of test_y (test price) ..
data_2['prediction1']=predictions
# to add predictions in above created dataframe..
data_2

### Importing 'sklearn's mean_squared_error' library

In [None]:
from sklearn.metrics import mean_squared_error

#### sklearn.metrics
The 'sklearn.metrics' module implements several loss, score, and utility functions to measure classification performance. Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values.

#### mean_squared_error
'mean_squared_error' function computes mean square error, a risk metric corresponding to the expected value of the squared (quadratic) error or loss.

In [None]:
print('MSE:', mean_squared_error(y_test, predictions))

In [None]:
data3=pd.DataFrame(data=data['price'],columns=['price'])
data3
# to create a dtaframe so that we can further add our predictions and compare them..

## Building model through 'statsmodel'

In [None]:
model1=smf.ols('price ~ id+date+bedrooms+bathrooms+sqft_living+sqft_lot+floors+waterfront+view+condition+grade+sqft_above+sqft_basement+yr_built+yr_renovated+zipcode+lat+long+sqft_living15+sqft_lot15',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature 'id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
#'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade','sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
#'lat', 'long', 'sqft_living15', 'sqft_lot15'
# fitting it to our data

data3['ypred_f'] = model1.predict(data[['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']])

# predicting value with the above built model and features used in building that model

model1.summary()
# to show all the details about above built model..
MSE7 = np.square(np.subtract(data['price'],data3['ypred_f'])).mean()
MSE7
# To show mean squared error using numpy...

In [None]:
data3

In [None]:
model2=smf.ols('price ~ sqft_living+grade+sqft_above+sqft_living15+bathrooms+view',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature 'sqft_living','grade',
# 'sqft_above','sqft_living15','bathrooms'and 'view' and fitting it to our data
data3['ypred_4'] = model2.predict(data[['sqft_living','grade','sqft_above','sqft_living15','bathrooms','view']])
# predicting value with the above built model and features used in building that model
model2.summary()
# to show all the details about above built model..
MSE2 = np.square(np.subtract(data['price'],data3['ypred_4'])).mean()
MSE2
# To show mean squared error using numpy...

In [None]:
data3

In [None]:
model3=smf.ols('price ~ sqft_living+grade+sqft_above+sqft_living15+bathrooms+view+sqft_basement+bedrooms',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature  'sqft_living','grade',
# 'sqft_above','sqft_living15','bathrooms','view', 'sqft_basement' and 'bedrooms' and fitting it to our data
data3['ypred_3'] = model3.predict(data[['sqft_living','grade','sqft_above','sqft_living15','bathrooms','view','sqft_basement','bedrooms']])
# predicting value with the above built model and features used in building that model
model3.summary()
# to show all the details about above built model..
MSE3 = np.square(np.subtract(data['price'],data3['ypred_3'])).mean()
MSE3
# To show mean squared error using numpy...

In [None]:
data3

In [None]:
model4=smf.ols('price ~ sqft_living+grade+sqft_above+sqft_living15+bathrooms+view+sqft_basement+bedrooms+lat',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature sqft_living,'grade','sqft_above','sqft_living15',
# 'bathrooms','view','sqft_basement','bedrooms','lat' and fitting it to our data
data3['ypred_2'] = model4.predict(data[['sqft_living','grade','sqft_above','sqft_living15','bathrooms','view','sqft_basement','bedrooms','lat']])

# predicting value with the above built model and features used in building that model
model4.summary()
# to show all the details about above built model..
MSE4 = np.square(np.subtract(data['price'],data3['ypred_2'])).mean()
MSE4
# To show mean squared error using numpy...

In [None]:
data3

In [None]:
model5=smf.ols('price ~ sqft_living+grade+sqft_above+sqft_living15+bathrooms+view+sqft_basement+bedrooms+lat+yr_renovated+id+floors+waterfront',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature 'sqft_living','grade','sqft_above',
#'sqft_living15','bathrooms','view','sqft_basement','bedrooms','lat','yr_renovated','id','floors','waterfront' and 
# fitting it to our data
data3['ypred_1'] = model5.predict(data[['sqft_living','grade','sqft_above','sqft_living15','bathrooms','view','sqft_basement','bedrooms','lat','yr_renovated','id','floors','waterfront']])
# predicting value with the above built model and features used in building that model
model5.summary()
# to show all the details about above built model..
MSE5 = np.square(np.subtract(data['price'],data3['ypred_1'])).mean()
MSE5
# To show mean squared error using numpy...

In [None]:
model6=smf.ols('price ~ sqft_living+lat+bathrooms+yr_built+bedrooms+grade+floors+view+condition+waterfront+long',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature 'sqft_living','lat',\
# 'bathrooms','yr_built','bedrooms','grade','floors','view','condition','waterfront','long' and
# fitting it to our data

data3['ypred'] = model6.predict(data[['sqft_living','lat','bathrooms','yr_built','bedrooms','grade','floors','view','condition','waterfront','long']])

# predicting value with the above built model and features used in building that model

model6.summary()
# to show all the details about above built model..
MSE6 = np.square(np.subtract(data['price'],data3['ypred'])).mean()
MSE6
# To show mean squared error using numpy...

In [None]:
model7=smf.ols('price ~ id+bedrooms+bathrooms+sqft_living+sqft_lot+waterfront+view+condition+grade+sqft_above+sqft_basement+yr_built+yr_renovated+zipcode+lat+long+sqft_living15+sqft_lot15',data=data).fit()
# building our model thought statsmodel.formula.api's ordinary least squares(ols) on basis of feature 'id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
#'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade','sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
#'lat', 'long', 'sqft_living15', 'sqft_lot15'
# fitting it to our data

data3['ypred_f'] = model7.predict(data[['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']])

# predicting value with the above built model and features used in building that model

model7.summary()
# to show all the details about above built model..
MSE7 = np.square(np.subtract(data['price'],data3['ypred_f'])).mean()
MSE7
# To show mean squared error using numpy...

In [None]:
data3

In [None]:
plt.scatter(data['price'], data3['ypred_f'])# to scatter 'y_test' and 'predictions'
plt.xlabel('Actual Labels')# to x-label
plt.ylabel('Predicted Labels')# to y-label
plt.title('Predictions vs Actuals')# for title of graph
z1 = np.polyfit(data['price'], data3['ypred_f'], 1)
# The Numpy polyfit() method is used to fit our data inside a polynomial function.

p1 = np.poly1d(z1)
"""
The numpy.poly1d() function helps to define a polynomial function. 
It makes it easy to apply “natural operations” on polynomials.
"""

plt.plot(data['price'],p(data['price']), color='magenta')
plt.show()

#### Hence, on the basis of various features' p-values our 'model6' has maximum accuracy of about 69.5% so our further visualisations would be on the basis of it..... 

In [None]:
data3

### Importing 'plotly' library

#### Plotly
Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on a variety of types of data and produces easy-to-style figures.Plotly Express provides functions to visualize a variety of types of data. 

In [None]:
import plotly.express as px

### Visualisation of individual feature's regression model with price 

In [None]:
variable=['sqft_living','lat','bathrooms','sqft_living15','bedrooms','grade','floors','view','sqft_basement','waterfront','sqft_above']
# making a list of features which helped in getting maximum accuracy and assigning it to variable. 
for i in variable:
    fig2 = px.scatter(
    data, x=data.loc[:,i], y='price', opacity=0.65,
    trendline='ols', trendline_color_override='darkblue',
    title="Linear Regression of feature with price"
    )
    fig2.show()
# running a for loop for item in list 'variable' and visualise scattering of these features and their regression line using 
# plotly.express

### Statistical Summary....


#### We have seen a lot of cases and models for our 'Housing Price Prediction' dataset and from these cases we have three best cases. These are :

#### Case 1:

##### Our first model built through 'statmodel' library's 'OLS(Ordinary least square )' method which is about 71% accurate and having mean-squared error around 39193614215.43002 which is on the basis of independent variables or features on basis of  all of the features except price of house ( 'id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade','sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15','floors' and 'sqft_lot15' ).

#### Case 2:

##### Our last model built through 'statmodel' library's 'OLS(Ordinary least square )' method which is about 70% accurate and having mean-squared error around 39193614215.43002 which is on the basis of independent variables or features on basis of feature 'id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade','sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', and 'sqft_lot15' (these variable are selected by their correlation with price of house and p-values of their model's summary).

#### Case 3:

##### Our first model built through sklearn library which is about  66% accurate and having mean-squarerd error around 40212974895.048164.

#### Hence, our three of the models are perfect but our first preference will be our models in case:1 and case:2 becuase they are better than the model in case:3 and having more accuracy and less error than the model in case:3....  