# Regression Analysis

In this notebook, we'll explore how regression models work in a practical way. 

But we will not use pre-defined functions here. Instead, let's dive into the calculational steps of the process!! 

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mp
import copy
import math

In [2]:
df = pd.read_csv("Employee.csv")

This file contains employee data which describe their education, years of domain experience and how these data might effect on the possibility of leaving the company or not.

In [3]:
df

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1
...,...,...,...,...,...,...,...,...,...
4648,Bachelors,2013,Bangalore,3,26,Female,No,4,0
4649,Masters,2013,Pune,2,37,Male,No,2,1
4650,Masters,2018,New Delhi,3,27,Male,No,5,1
4651,Bachelors,2012,Bangalore,3,30,Male,Yes,2,0


In [4]:
df.shape

(4653, 9)

In [5]:
df.columns


Index(['Education', 'JoiningYear', 'City', 'PaymentTier', 'Age', 'Gender',
       'EverBenched', 'ExperienceInCurrentDomain', 'LeaveOrNot'],
      dtype='object')

In [6]:
df.isna().sum()

Education                    0
JoiningYear                  0
City                         0
PaymentTier                  0
Age                          0
Gender                       0
EverBenched                  0
ExperienceInCurrentDomain    0
LeaveOrNot                   0
dtype: int64

In [7]:
city = df['City'].unique()

In [8]:
edu = df['Education'].unique()

Now, we have to convert string data into representing numbers that means 0,1,2. let's create a data dictionary that stores encoded data for Education, City and Gender for regression analysis. 

In [9]:
dictt = {'City' : {city[i] : i for i in range(len(city))},
        'Education' : {edu[i] : i for i in range(len(edu))}}

In [10]:
dictt['City'][df.iloc[1,2]]

1

In [11]:
dictt

{'City': {'Bangalore': 0, 'Pune': 1, 'New Delhi': 2},
 'Education': {'Bachelors': 0, 'Masters': 1, 'PHD': 2}}

In [12]:
for i,row in df.iterrows():
    df.loc[i, 'City'] = dictt['City'][row['City']]

In [13]:
for i,row in df.iterrows():
    df.loc[i, 'Education'] = dictt['Education'][row['Education']]

In [14]:
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})

In [15]:
df['EverBenched'] = df['EverBenched'].map({'Yes': 1, 'No': 0})

In [16]:
df

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,0,2017,0,3,34,1,0,0,0
1,0,2013,1,1,28,0,0,3,1
2,0,2014,2,3,38,0,0,2,0
3,1,2016,0,3,27,1,0,5,1
4,1,2017,1,3,24,1,1,2,1
...,...,...,...,...,...,...,...,...,...
4648,0,2013,0,3,26,0,0,4,0
4649,1,2013,1,2,37,1,0,2,1
4650,1,2018,2,3,27,1,0,5,1
4651,0,2012,0,3,30,1,1,2,0


Now our data is ready to be used for regression analysis! First let's walk through each steps with naive steps. 

#  Linear Regression

In [17]:
### 1. Define cost function for one feature 
### 2. Write Optimization algorithm 
## 3. Scale up and write for multiple features

In [18]:
def cost_function_1(data,y, weight, bias):
    error_sum = 0
    for x in range(len(data)):
        error_sum += (weight*data.iloc[x] + bias-y[x])**2
    mean_sq_error = error_sum / (2 * len(data))
    return mean_sq_error

##[[1,2],
##[4,5]]

This is basically how we calculate mean square error mathematically! But this function has to run over each row in dataset. We have a better approach that is vectorization! 

Transformed our training data into vectors with numpy libary, we can perform parallel computation which can save a lot of time!

In [19]:
def cost_function(data,y,weight,bias):
    error_sum = []
    m = data.shape
    for x in range(len(m)):
        line = np.dot(data.iloc[x],weight) + bias
        error = (line - y[i])**2
    mean_sq_error = np.mean(error)/2
    return mean_sq_error

Now we've got cost_function with input for weight and bias. What we are trying to do in regression is to find 'minimum' cost function that 'fits' the given dataset in an average alignment. 

In [20]:
def derivative(data, y,weight, bias):
    m,n = data.shape
    dw = np.zeros((n,))
    db = 0
    for i in range(m):
        err = np.dot(weight, data.iloc[i]) - y[i]
        for j in range(n):
            dw = dw + err * data.iloc[i].values
        db = db + err
    dw = dw/m
    db = db/m
    return dw,db

In [21]:

def gradient_descent(data, y, weight, bias, cost_function, derivative, alpha, num_iters):
    J_history = []
    for i in range(num_iters):
        dw, db = derivative(data, y, weight, bias)
        weight -= alpha * dw
        bias -= alpha * db
        if i < 100000:
            J_history.append(cost_function(data, y, weight, bias))
        if i % math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}")
    return weight, bias, J_history

In [22]:
train_Y = df.iloc[:,-1]

In [23]:
train_X = df.iloc[:,:-1]

In [24]:
train_Y

0       0
1       1
2       0
3       1
4       1
       ..
4648    0
4649    1
4650    1
4651    0
4652    0
Name: LeaveOrNot, Length: 4653, dtype: int64

In [25]:
train_X

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
0,0,2017,0,3,34,1,0,0
1,0,2013,1,1,28,0,0,3
2,0,2014,2,3,38,0,0,2
3,1,2016,0,3,27,1,0,5
4,1,2017,1,3,24,1,1,2
...,...,...,...,...,...,...,...,...
4648,0,2013,0,3,26,0,0,4
4649,1,2013,1,2,37,1,0,2
4650,1,2018,2,3,27,1,0,5
4651,0,2012,0,3,30,1,1,2


In [27]:
w,b,j = gradient_descent(train_X,train_Y,[3,3,3,3,3,3,3,3],[2],cost_function,derivative,0.01,10)

Iteration    0: Cost 1995507802638192640.00
Iteration    1: Cost 210656245739925056665549799424.00
Iteration    2: Cost 22237975654294924443971716959864820334592.00
Iteration    3: Cost 2347557080322925625275781560091688375195540895301632.00
Iteration    4: Cost 247820410052019607645688106137744097475542652899769684152287232.00
Iteration    5: Cost 26161219317361637731373503301942705766046562594536162315023840101791694848.00
Iteration    6: Cost 2761715211541460078266546906657753991122343654269081319221705294121412049661054681088.00
Iteration    7: Cost 291541109652992787044335514975664293022573450498353619659370322545678148803262666346805445263360.00
Iteration    8: Cost 30776605155557902467242224581057005143333244120442541004007331047194474729908385325708113818202935511744512.00
Iteration    9: Cost 3248939492713285043119255758548972270907851436633933861553858141763027327164633836417172976145866171373302246793019392.00


The cost values are getting bigger insted of minimizing after each iterations. But our model work with another dataset in http://localhost:8888/notebooks/anaconda3/Discoveries.ipynb. That's because we used the wrong method. It should be binary classification model to predict an employee will leave or not.