<h2 style="color:green" align="center"> Machine Learning With Python: Linear Regression Multiple Variables</h2>

<h3 style="color:purple">Sample problem of predicting home price in monroe, new jersey (USA)</h3>

Below is the table containing home prices in monroe twp, NJ. Here price depends on **area (square feet), bed rooms and age of the home (in years)**. Given these prices we have to predict prices of new homes based on area, bed rooms and age.

<img src="homeprices.jpg" style='height:200px;width:350px'>

Given these home prices find out price of a home that has,

**3000 sqr ft area, 3 bedrooms, 40 year old**

**2500 sqr ft area, 4 bedrooms,  5 year old**

We will use regression with multiple variables here. Price can be calculated using following equation,

<img src="equation.jpg" >

Here area, bedrooms, age are called independant variables or **features** whereas price is a dependant variable

In [1]:
import pandas as pd
import numpy as np
#from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv('homeprices.csv')
data

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


**Data Preprocessing: Fill NA values with median value of a column**

In [3]:
# replace numeric value missing --> replace with median 
data.bedrooms.median()

4.0

In [4]:
data.bedrooms.fillna(data.bedrooms.median(), inplace=True)
data

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [5]:
x = data.drop('price',axis='columns')
y = data.price

In [6]:
x

Unnamed: 0,area,bedrooms,age
0,2600,3.0,20
1,3000,4.0,15
2,3200,4.0,18
3,3600,3.0,30
4,4000,5.0,8
5,4100,6.0,8


In [7]:
y

0    550000
1    565000
2    610000
3    595000
4    760000
5    810000
Name: price, dtype: int64

In [8]:
model = LinearRegression()
# fit ( x , y ) training 
model.fit(x,y) # df.drop(price) --> area, bedrooms, age 

LinearRegression()

y = mx + c
y = mx1 + m2x2 + m3x3 + c 

In [9]:
model.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [10]:
model.intercept_

221323.00186540408

**Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old**

In [11]:
model.predict([[3000, 3, 40]])

array([498408.25158031])

In [12]:
model.predict([[3000, 4, 40]])

array([521797.13165825])

In [13]:
112.06244194*3000 + 23388.88007794*3 + -3231.71790863*40 + 221323.00186540384

498408.25157402386

**Find price of home with 2500 sqr ft area, 4 bedrooms,  5 year old**

In [14]:
model.predict([[2500, 4, 5]])

array([578876.03748933])

In [15]:
model.predict([[2500, 4, 0]])

array([595034.6270325])

In [16]:
model.score(x,y)

0.9550196399325818

<h3>Exercise<h3>

In exercise folder (same level as this notebook on github) there is **hiring.csv**. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,


**2 yr experience, 9 test score, 6 interview score**

**12 yr experience, 10 test score, 10 interview score**


In [18]:
newdf = pd.read_csv("Exercise/hiring.csv")
newdf

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [19]:
newdf[newdf.columns[1]].fillna(newdf[newdf.columns[1]].min(), inplace= True)
newdf

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,6.0,7,72000
7,eleven,7.0,8,80000


In [20]:
newdf.fillna("zero" , inplace=True)
newdf

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,6.0,7,72000
7,eleven,7.0,8,80000


In [21]:
from word2number import w2n
newdf.experience = newdf.experience.apply(w2n.word_to_num)
newdf

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,6.0,7,72000
7,11,7.0,8,80000


In [27]:
a = newdf.experience.to_list()
a[0] = 1
newdf.experience = a

In [28]:
newdf

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,1,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,6.0,7,72000
7,11,7.0,8,80000


In [29]:
x = newdf.drop(newdf.columns[-1], axis=1)
y = newdf["salary($)"]

In [30]:
from sklearn import linear_model
newmodel = linear_model.LinearRegression()
newmodel.fit(x,y)

LinearRegression()

In [31]:
newmodel.predict([[2,9,6]])

array([53660.32674957])

In [32]:
newmodel.predict([[12,10,10]])

array([94858.81492319])

In [33]:
newmodel.score(x,y)

0.9673945616172213

In [47]:
model = LinearRegression().fit( newdf.drop(newdf.columns[-1], axis=1)  , newdf["salary($)"])

<h3>Answer<h3>

53713.86 and 93747.79