# Linear Regression


Source : Introduction to Ai and Machine Learning with Python

05 - Machine Learning <br>
video 08 Project - Salary Prediction


In [1]:
import pandas as pd

## Explore the data

In [2]:
# import the data into pandas as a dataframe

file = "./data/008 hiring-salaries-201119-162808.csv"
df = pd.read_csv(file)

In [3]:
df

Unnamed: 0,experience,test_score_10,interview_score_10,salary
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [4]:
df.shape

(8, 4)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   experience          6 non-null      object 
 1   test_score_10       7 non-null      float64
 2   interview_score_10  8 non-null      int64  
 3   salary              8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes


In [6]:
# check missing values using .isnull() method
df.isnull().sum(axis=0)

experience            2
test_score_10         1
interview_score_10    0
salary                0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,test_score_10,interview_score_10,salary
count,7.0,8.0,8.0
mean,7.857143,7.875,63000.0
std,1.345185,1.642081,11501.55269
min,6.0,6.0,45000.0
25%,7.0,6.75,57500.0
50%,8.0,7.5,63500.0
75%,8.5,9.25,70500.0
max,10.0,10.0,80000.0


## Data Cleaning

In [8]:
df.head()

Unnamed: 0,experience,test_score_10,interview_score_10,salary
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000


NaN might mean he has zero years of experience.<br>
So it makes sense to set NaN to zero

In [9]:
df.experience = df.experience.fillna("zero")

In [10]:
df

Unnamed: 0,experience,test_score_10,interview_score_10,salary
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


Lets fill the NaN values with the 'mean' of all the values for test_score_10

In [11]:
df =df.fillna(df["test_score_10"].median())

In [12]:
df

Unnamed: 0,experience,test_score_10,interview_score_10,salary
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,8.0,7,72000
7,eleven,7.0,8,80000


No more NaN values

In [13]:
df.isnull()

Unnamed: 0,experience,test_score_10,interview_score_10,salary
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False


In [14]:
df.isnull().sum(axis = 0)

experience            0
test_score_10         0
interview_score_10    0
salary                0
dtype: int64

### Issues: Experience Column/Feature/Attrubue is not numeric. We need a numberic value

In [15]:
!pip install word2number



In [16]:
from word2number import w2n

In [17]:
df["experience"] = df["experience"].apply(w2n.word_to_num)

In [18]:
df

Unnamed: 0,experience,test_score_10,interview_score_10,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


## Import Leaner Regression from sklearn
#### Steps: 
<ol>
    <li> initiaize the model <code> m = Model() </code> 
    <li> fit the model <code> m.fit(train_data) </code>
    <li> predict <code> m.predict(test_data) </code>
</ol>

In [19]:
# input featers to X
# salary to Y

In [20]:
X = df.drop("salary", axis = 1)
Y = df["salary"]

In [21]:
Y.head()

0    50000
1    45000
2    60000
3    65000
4    70000
Name: salary, dtype: int64

1) Initialize the model


In [22]:
from sklearn.linear_model import LinearRegression

In [23]:
model = LinearRegression()

2) fit the model

In [24]:
model.fit(X.values,Y)

LinearRegression()

In [25]:
model.coef_

array([2812.95487627, 1845.70596798, 2205.24017467])

In [26]:
model.intercept_

17737.263464337695

> y = 18345.70*X1 + 2205.2401*X2 + 2812.9548*X3 + 17737.26

Save the trained model

In [27]:
import joblib
joblib.dump(model, "salary_model.pkl")

['salary_model.pkl']

3) Load the file and do some predictions

Lets create a new notebook called "salary predict"