# __LINEAR REGRESSION FROM ALMOST SCRATCH__

## Introduction

#### This is a project made by me <a href="https://github.com/Rulios">Robert Lu Zheng</a> to start in this advancing and fast paced environment of ML. 

## Description

#### I'll be implementing a Linear Regression on a dataset using Pandas, NumPy. No framework, no library that has any kind of support that can scaffold this algorithm, everything from scratch (almost)

#### This Jupyter Notebook will contain all the breakdown process of what I did to build the linear regression algorithm. 

### Install libraries (optional)

In [1]:
# Install pip packages in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib



### Import the libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random 

### Import the dataset (as Pandas DataFrame)

In [3]:
data = pd.read_csv("./data/life-expectancy-data.csv")

### Objective

Label to predict: Life Expectancy

### Dataset notes

I use this section to take notes on the dataset when I explore it, and get a better understanding of what I'm trying to accomplish 

This dataset is extracted from <a href="https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download">here</a>


***
Notes: 

- My gosh, this dataset has a lot of features. 
- This data ranges from 2000 to 2015
- The most correlated features with Life Expectancy are *Schooling* and *Income Composition of Resources*

***

***
Experiment: 

- I think I'll start with Schooling as a feature to make the linear regression model. Then, if everything goes better, I'll add more features to the model

***

Run the cells below to get a better understanding of the dataset 

In [4]:
print(data.describe())


              Year  Life expectancy   Adult Mortality  infant deaths  \
count  2938.000000       2928.000000      2928.000000    2938.000000   
mean   2007.518720         69.224932       164.796448      30.303948   
std       4.613841          9.523867       124.292079     117.926501   
min    2000.000000         36.300000         1.000000       0.000000   
25%    2004.000000         63.100000        74.000000       0.000000   
50%    2008.000000         72.100000       144.000000       3.000000   
75%    2012.000000         75.700000       228.000000      22.000000   
max    2015.000000         89.000000       723.000000    1800.000000   

           Alcohol  percentage expenditure  Hepatitis B       Measles   \
count  2744.000000             2938.000000  2385.000000    2938.000000   
mean      4.602861              738.251295    80.940461    2419.592240   
std       4.052413             1987.914858    25.070016   11467.272489   
min       0.010000                0.000000     1.000000

In [5]:
print(data["Schooling"].describe())
print(data["Life expectancy "].describe())

count    2775.000000
mean       11.992793
std         3.358920
min         0.000000
25%        10.100000
50%        12.300000
75%        14.300000
max        20.700000
Name: Schooling, dtype: float64
count    2928.000000
mean       69.224932
std         9.523867
min        36.300000
25%        63.100000
50%        72.100000
75%        75.700000
max        89.000000
Name: Life expectancy , dtype: float64


In [6]:
#Print correlations 

print(data.corr())

                                     Year  Life expectancy   Adult Mortality  \
Year                             1.000000          0.170033        -0.079052   
Life expectancy                  0.170033          1.000000        -0.696359   
Adult Mortality                 -0.079052         -0.696359         1.000000   
infant deaths                   -0.037415         -0.196557         0.078756   
Alcohol                         -0.052990          0.404877        -0.195848   
percentage expenditure           0.031400          0.381864        -0.242860   
Hepatitis B                      0.104333          0.256762        -0.162476   
Measles                         -0.082493         -0.157586         0.031176   
 BMI                             0.108974          0.567694        -0.387017   
under-five deaths               -0.042937         -0.222529         0.094146   
Polio                            0.094158          0.465556        -0.274823   
Total expenditure                0.09074

  print(data.corr())



### Algorithm 

![imagen.png](attachment:4247ad78-b522-4383-ba90-205a79055f2d.png)

In [15]:
def initParams(): 
    
    learningRate = 0.1
    iterations = 20

    
    #these 2 parameter may change (initialize randomly)
    m = random.random()
    b = 0  
    
    return learningRate, m , b, iterations

#predicts Y
def linearRegression(b, m ,x ):
    return b + np.dot(m , x)

#mean squared error
def costfunction(length, computed, expected):
    return (1/(2*length)) * np.sum(np.power(difference(computed, expected), 2))


def difference(computed, expected):
    return computed - expected

def costFunctionDerivative(length, computed, expected):
    return (1/length) * np.sum(difference(computed, expected))
    
def gradientDescent(trainingSet, testSet): 
    
    #here we using y = mx + b, whereas m is the slope and b the intercept
    learningRate, b, m, iterations = initParams()
    
    rows, columns = np.shape(trainingSet)
    
    x = trainingSet[:, 0]
    y = trainingSet[:, 1]
    
    for i in range(iterations):
        newB = b - learningRate * costFunctionDerivative(rows, linearRegression(b, m, x), y)
        
        newM = m - learningRate * costFunctionDerivative(rows, linearRegression(b, m, x),  y) * np.sum(1/x)
        
        print("accuracy: ", np.absolute(linearRegression(b, m, x) -  y) / y)
    
        #update intercept and slope
        b = newB
        m = newM
    
    
    
    return 




#delete all the data that has null values in
# - Schooling 
# - Life expectancy 

data = pd.read_csv("./data/life-expectancy-data.csv")
columns = list(data.columns.values)

#print(columns)

#remove rows with empty values (this should be reviewed in the future, since probably this can cause a hurdle)
data = data.dropna()

#convert it into an np array to perform math calc
data = np.array(data)

#get size of the data
m, n = np.shape(data)

#randomize data by shuffling it 
np.random.shuffle(data)


#divide data set
#training set = 70% of the data 
#test set = 30% of the data
training_set = data[0:int(m*.7)]
test_set = data[int(m*.7):m]
                

#print(training_set[:, 3])
#print(training_set[:, (3,21)])

gradientDescent(training_set[:, (3, 21)], test_set[:, (3, 21)])
    
# Life expectancy is in index 3
# Schooling is in index 21

#print(m, n)






accuracy:  [0.950165725861837 0.9619278526770556 0.9571355544126291 ...
 0.9549292961838673 0.948919869008383 0.9614489577421758]
accuracy:  [109.82988556674948 94.72744545924101 101.44226038949229 ...
 105.708212256696 121.40304212792292 99.4992773830908]
accuracy:  [12809.829657577364 11064.990488064206 11840.828189769627 ...
 12333.819108645597 14147.823907954653 11616.674949258551]
accuracy:  [1493938.0077293476 1290430.6724613085 1380919.570346865 ...
 1438418.8749769577 1649992.5333841464 1354775.3820154744]
accuracy:  [174229658.8787624 150495749.83552703 161048958.05029246 ...
 167754776.7426819 192429416.14490113 157999909.6218009]
accuracy:  [20319433405.723835 17551479945.56857 18782241769.900352 ...
 19564304009.030117 22441969604.262573 18426648254.29897]
accuracy:  [2369742193188.164 2046931218492.092 2190468105889.4385 ...
 2281675663141.024 2617281752268.064 2148997217372.5723]
accuracy:  [276369815537868.12 238722171930541.66 255462078578163.97 ...
 266099107300422.84 