<h2 style="color:green" align="center"> Machine Learning Project With Python: Linear Regression Multiple Variables</h2>

## Problem Statement

The **hiring.csv** file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,


**2 yr experience, 9 test score, 6 interview score**

**12 yr experience, 10 test score, 10 interview score**


In [1]:
#importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

In [2]:
#reading the csv

data = pd.read_csv('hiring.csv')

In [3]:
data        #from the dataset, we observe some null values

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


### Data Cleaning

In [4]:
#getting the mean of the test scores and rounding up to 2 significant figures

t = data['test_score(out of 10)'].mean()
t = round(t,2)
t

7.86

In [5]:
#filling the null values in 'test_score(out of 10)' column with t(mean value)

data['test_score(out of 10)']= data['test_score(out of 10)'].fillna(t)

In [6]:
data                               #verifying there are no more null values in 'test_score(out of 10)' column

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,7.86,7,72000
7,eleven,7.0,8,80000


In [7]:
#cleaning the 'experience' column by replacing the null values with 'zero'

data['experience']=data['experience'].fillna('zero')

In [8]:
data                             #verifying there are no more null values in 'experience' column

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,7.86,7,72000
7,eleven,7.0,8,80000


In [9]:
#creating a dict to replace the object dtype in 'experience column' to numbers as ML models work well with only models

number = {'zero':0,'one':1,'two':2, 'three':3,'four':4,'five':5,'six':6,'seven':7, 'eight':8,'nine':9,'ten':10,'eleven':11}


In [10]:
#checking the datatype

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  8 non-null      object 
 1   test_score(out of 10)       8 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes


In [11]:
#replacing the 'experience column' values with the dict created

data["experience"].replace(number, inplace=True)

In [12]:
data                   #now the data is cleaned

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,7.86,7,72000
7,11,7.0,8,80000


In [13]:
data.info()                       #checking that the 'experience' column previously in object dtype is now in int64 dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  8 non-null      int64  
 1   test_score(out of 10)       8 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(3)
memory usage: 384.0 bytes


In [14]:
#creating an object for linear regression 

reg = linear_model.LinearRegression()
reg

LinearRegression()

In [15]:
#creating the x and y axis, and training the model

x = data.drop('salary($)',axis='columns')
y = data['salary($)']
reg.fit(x,y)

LinearRegression()

In [16]:
#getting the coefficient

reg.coef_

array([2827.33518049, 1911.6265094 , 2197.14184739])

In [17]:
#getting the intercept

reg.intercept_

17247.060546901615

#### predict salary for candidate with 2 yr experience, 9 test score, 6 interview score

In [18]:
reg.predict([[2,9,6]])



array([53289.2205768])

In [22]:
#linear equation : y = mx + b,  where m = slope(coefficient), and b = intercept

salary = 2827.33518049*2 + 1911.6265094*9 + 2197.14184739*6  + 17247.060546901615
salary

53289.220576821615

#### predict salary for candidate with 12 yr experience, 10 test score, 10 interview score

Answer: 53713.86 and 93747.79

In [20]:
reg.predict([[12,10,10]])



array([92262.76628062])

In [23]:
#Y = m * X + b (m is coefficient and b is intercept)

2827.33518049*12 + 1911.6265094*10 + 2197.14184739*10 + 17247.060546901615

92262.76628068161