## Day 34 - DIY Solution

**Q1. Problem Statement: Linear Regression** <br>
Load the housing_price.csv dataset to a DataFrame and perform the following tasks:<br>
The housing_price dataset contains all numeric data and the median_house_value column is our target variable, so with help of linear regression build a model that can predict accurate house prices.<br> Perform the below task and build a model.
1.	Load the housing_price dataset  into DataFrame
2.	Find the null value and drop it, if it is there
3.	Split x and y into train and test data set  based on test size as 0.2 and  random_state as 10
4.	Call the LinearRegression model then fit the model  using train data 
5.	Print R2 vallue,coefficient and intercept
6.	Compare actual and predicted values.
7.	Print the final summary




**Step-1:** Importing Libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import ElasticNet
import statsmodels.api as sm

  import pandas.util.testing as tm


## Dataset

input variables (based on physicochemical tests):<br>
1 - longitude                 :used for location 	<br>
2 - latitude                  :used for location	<br>
3 - housing_median_age	      :average age of house<br>
4 - total_rooms	total_bedrooms:data  is in numeric form<br>
5 - population	              :around area<br>
6 - households	              :data  is in numeric form<br>
7 - median_income 	          :data  is in numeric form<br>


Output variable :<br> 
8 - median_house_value<br>

**Step-2:** Loading the CSV file into a DataFrame.

In [27]:
df = pd.read_csv("/content/housing_price.csv")

In [28]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15,5612,1283,1015,472,1.4936,66900
1,-114.47,34.4,19,7650,1901,1129,463,1.82,80100
2,-114.56,33.69,17,720,174,333,117,1.6509,85700
3,-114.57,33.64,14,1501,337,515,226,3.1917,73400
4,-114.57,33.57,20,1454,326,624,262,1.925,65500


**Step-3:** Finding Null values in dataset.

In [29]:
# no missing values
df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

**Step-4:** Seprating dependent and independent features into new DataFrame.

In [30]:
#Restructuring the Dataframe into dependant and independant dataframe
x = df.drop(columns = "median_house_value").values 
y = df.median_house_value.values

In [31]:
print("independent data\n",x)
print("\ndependent data\n",y)

independent data
 [[-114.31     34.19     15.     ... 1015.      472.        1.4936]
 [-114.47     34.4      19.     ... 1129.      463.        1.82  ]
 [-114.56     33.69     17.     ...  333.      117.        1.6509]
 ...
 [-124.3      41.84     17.     ... 1244.      456.        3.0313]
 [-124.3      41.8      19.     ... 1298.      478.        1.9797]
 [-124.35     40.54     52.     ...  806.      270.        3.0147]]

dependent data
 [ 66900  80100  85700 ... 103600  85800  94600]


**Step-5:** Seprating x and y DataFrame for test and train.

In [32]:
#split the data in to train test 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2, random_state = 10)

In [33]:
#check shape of the data 
print("x_train and x_test dataset shape",x_train.shape,x_test.shape)
print("y_train and y_test dataset shape",y_train.shape,y_test.shape)

x_train and x_test dataset shape (13600, 8) (3400, 8)
y_train and y_test dataset shape (13600,) (3400,)


**Step-6:** Call LinearRegression model and us fit() method with train dataset for training.

In [34]:
# call model and fit the model using train data
regressor = LinearRegression()  
regressor.fit(x_train, y_train)

LinearRegression()

**Step-7:** Printing R2 value, coefficient and intercept.

In [None]:
#print r2 value
regressor.score(x_test,y_test)

0.6484403017760421

In [None]:
#print coefficient and intercept
print("R2 value:",regressor.score(x_test,y_test))
print("\ncoefficient: \n ",regressor.coef_)
print("\nintercept:",regressor.intercept_)

R2 value: 0.6484403017760421

coefficient: 
  [-4.34225673e+04 -4.34584915e+04  1.15417922e+03 -8.34683693e+00
  1.14234465e+02 -3.87425498e+01  5.04252279e+01  4.02554220e+04]

intercept: -3635200.010897606


**Step-8:** Using trained model predict for test data and then compare with orignal test data.

In [None]:
# find actual vs prediction
y_pred = regressor.predict(x_test)

In [None]:
pred = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = pred.head(10)
df1

Unnamed: 0,Actual,Predicted
0,96100,-8475.675202
1,500001,490876.394233
2,177200,112662.10799
3,55000,218093.753334
4,220800,207600.925885
5,158300,121540.170888
6,37900,180602.126583
7,115600,104694.108104
8,359700,310765.123759
9,203300,265864.990208


In [None]:
# summarize the fit of the model
mse = np.mean((df1.Predicted-df1.Actual)**2)
print( "coefficient :",regressor.coef_)
print("\n intercepter :",regressor.intercept_)
print("\n mse :",mse)
print("\n final score",regressor.score(x_test,y_test))

coefficient : [-4.34225673e+04 -4.34584915e+04  1.15417922e+03 -8.34683693e+00
  1.14234465e+02 -3.87425498e+01  5.04252279e+01  4.02554220e+04]

 intercepter : -3635200.010897606

 mse : 7010137827.159782

 final score 0.6484403017760421


**Step-9:** Print final summary using OLS model.

In [None]:
X2 = sm.add_constant(x)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.641
Method:                 Least Squares   F-statistic:                     3798.
Date:                Wed, 20 Apr 2022   Prob (F-statistic):               0.00
Time:                        03:40:10   Log-Likelihood:            -2.1365e+05
No. Observations:               17000   AIC:                         4.273e+05
Df Residuals:                   16991   BIC:                         4.274e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -3.621e+06   6.92e+04    -52.312      0.0