# Supervised Learning- Linear Regression

In [5]:
import numpy as np
import pandas as pd

## Overview
Linear regression is a machine learning, classification algorithm that is used to predict label for continuous data.
<br>
The algortihm consists of sketching a line of best fit in the data, and predictions are made on its basis. This line is sketched by the equation
$$y = mx + c$$
where, for a point, $y$ is the y-coordinate of the point, $x$ is the x-coordinate, $m$ is the slope of the line at that point, and $c$ is the y-intercept of the line. Since, $y$ is the dependent variable, and $x$ is the independent variable, our goal effectively becomes to calculate the value of $m$ and $c$.
<br>
The cost function for our purposes is
$$
f(y) = {1 \over 2n} \sum_{i=1}^n (\hat{y_i} - (mx_i + c))^2
$$
The derivative of this function with respect to $m$ gives us
$$
{dy \over dm} = {-2 \over n} \sum_{i=1}^n (x_i(y_i - (mx_i + c)))
$$
The derivative of this function with respect to $c$ gives us
$$
{dy \over dc} = {2 \over n} \sum_{i=1}^n (y_i(mx_i + c))
$$
This combined give us the descent gradient, which is used to see when our cost function has reached the lowest value.
<br>
But for this to happen, we need a number of iterations, and a learning rate, which will be 1000 and 0.001 respectively.

In [6]:
class LinearRegression:

    def __init__(self, lr = 0.001, n_iters=1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iters):
            y_pred = np.dot(X, self.weights) + self.bias

            dw = (1/n_samples) * np.dot(X.T, (y_pred-y))
            db = (1/n_samples) * np.sum(y_pred-y)

            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

## Medical Price Exercise

In [7]:
df = pd.read_csv('./Medical Price Dataset.csv')
print(df.columns)
print(df.describe())
df.head()

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Dropping Categorical Columns

In [8]:
df.drop(['sex', 'smoker', 'region'], axis=1, inplace=True)

### Training

In [9]:
y_train = df['charges']
x_train = df.drop(['charges'], axis=1)

In [11]:
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
lin_reg.predict(x_train)

array([-2.60948617e+215, -2.83006817e+215, -3.45613978e+215, ...,
       -2.97781463e+215, -2.63957618e+215, -5.43825191e+215])