# Linear Regression(Multivariate)

## Example 1: House Price Prediction
Price of the house is actually dependent on more than 1 factor (`multiple-variables`). Let's say 3 factors, Area, Number of Bedrooms and Age. In a real-world scenario, this is what we want to be doing right? 
* Our `Target Variable / Dependent Variable` is Prices, because that is the variable that we are trying to predict.
* `Features or Independent Variables` are Area, Number of Bedrooms and Age of house.

Area  | Bedrooms | Age | Prices 
---- | ----| ---- | ----
2600 | 3.0 | 20 | 550000
3000 | 4.0 | 15 | 565000
3200 | NaN | 18 | 610000
3600 | 3.0 | 30 | 595000
4000 | 5.0 | 8 | 760000
4100 | 6.0 | 8 | 810000

The multi-variate equation looks like:
$$ y = m_1*area + m_2*bedrooms + m_3*age + c $$
The equation can be generalized as:
$$ y = m_1*x_1 + m_2*x_2 + m_3*x_3 + c $$


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
import math

### Importing the csv file

In [2]:
dataset = pd.read_csv("data/homeprices.csv")

### Check if there are any missing values

In [3]:
dataset.isna().sum()

area        0
bedrooms    1
age         0
price       0
dtype: int64

In [None]:
fig, ax = plt.subplots()
ax.scatter(x,y)
ax.set(title = 'Area of the houses in Monroe Twp, NJ (USA) and their Prices', 
       xlabel = 'Areas', 
       ylabel = 'Prices');

### Dealing with missing values
There is a missing data. There are numerous ways to fill the data.
But in this problem, we will fill it with a `median score` of all the bedrooms present in the dataset.

In [4]:
# Fill the missing value with median value of the number of bedrooms in the entire dataset
import math
median_bedrooms = math.floor(dataset.bedrooms.median())
median_bedrooms, type(median_bedrooms)

(4, int)

In [5]:
dataset.bedrooms = dataset.bedrooms.fillna(median_bedrooms)
dataset

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


The multi-variate equation looks like:
$$ y = m_1*area + m_2*bedrooms + m_3*age + c $$
The equation can be generalized as:
$$ y = m_1*x_1 + m_2*x_2 + m_3*x_3 + c $$

### Fitting the Model

In [6]:
model = linear_model.LinearRegression()
x = dataset[['area', 'bedrooms', 'age']]
y = dataset[['price']]
model.fit(x,y)

m1 = model.coef_[0][0]
m2 = model.coef_[0][1]
m3 = model.coef_[0][2]
c = model.intercept_
print(f"m1 = {m1}, m2 = {m2}, m3 = {m3}, c = y-intercept = {c}")

m1 = 112.06244194213458, m2 = 23388.88007793922, m3 = -3231.7179086329634, c = y-intercept = [221323.0018654]


### Making Predictions

Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old

In [7]:
model.predict([[3000, 3, 40]])

array([[498408.25158031]])

## Example 2: Hiring Salary Prediction

Assuming that the linear regression curve suits the given problem present in the `hiring.csv` contains hiring stats for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

* 2 yr experience, 9 test score, 6 interview score

* 12 yr experience, 10 test score, 10 interview score

**Hint:** Use `word2number` module. Official Documentation: [Link]('https://pypi.org/project/word2number/')

In [8]:
dataset = pd.read_csv("data/hiring.csv")
dataset

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [9]:
type(dataset)

pandas.core.frame.DataFrame

In [10]:
dataset.isna().sum()

experience                    2
test_score(out of 10)         1
interview_score(out of 10)    0
salary($)                     0
dtype: int64

### Data-Pre-Processing:

1. Rewrite the column names.
2. Fill `NaN` values with 0 in experience column.
3. Convert words to numbers in the experience column.
4. Fill `NaN` values with mean score (rounded off) in `test_score` column.

#### **Rewrite column names**

In [11]:
# rankings.rename(columns = {'test':'TEST'}, inplace = True)
dataset.rename(columns = {'test_score(out of 10)':'test_score',
                          'interview_score(out of 10)':'interview_score',
                          'salary($)':'salary'},inplace = True)
dataset

Unnamed: 0,experience,test_score,interview_score,salary
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


#### **Fill NaN values with 0 in experience column**

In [12]:
dataset.experience = dataset.experience.fillna(0)
dataset

Unnamed: 0,experience,test_score,interview_score,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


#### **Convert words to numbers in the experience column**

The `word2number` module accepts string as input. 
* df['DataFrame Column'] = df['DataFrame Column'].apply(str)
* `apply(str)` - fastest method of conversion.
* print(w2n.word_to_num('112')), where `112` is a string.
* from word2number import w2n is required

In [16]:
# Convert to string and replace the words with num
from word2number import w2n
dataset.experience = dataset.experience.apply(str)
dataset.experience = dataset.experience.apply(w2n.word_to_num)
dataset

Unnamed: 0,experience,test_score,interview_score,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,,7,72000
7,11,7.0,8,80000


#### **Fill NaN values with mean score (rounded off) in test_score column**

In [17]:
test_score_mean = math.floor(dataset.test_score.mean())
test_score_mean, type(test_score_mean)

(7, int)

In [18]:
dataset.test_score = dataset.test_score.fillna(test_score_mean)
dataset

Unnamed: 0,experience,test_score,interview_score,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,7.0,7,72000
7,11,7.0,8,80000


## Fit the model

In [19]:
model = linear_model.LinearRegression()
model.fit(dataset[['experience','test_score','interview_score']],dataset['salary'])

LinearRegression()

## Make predictions

* 2 yr experience, 9 test score, 6 interview score

* 12 yr experience, 10 test score, 10 interview score

In [22]:
print(model.predict([[12, 10, 10]]))

[93747.79628651]


In [23]:
print(model.predict([[2,9,6]]))

[53713.86677124]
