# Regression

> 📌  A process for modeling the **relationship** between variables of interest.

Example: If you know the relationship between education and income (the more someone is educated, the more money they make), we could predict someone's income based on their education.

_Simply put, learning such a_ **relationship** _is regression._

# More Formally

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (outcome) and one or more independent variables (predictors).

Linear regression is a basic and commonly used type of predictive analysis.

The case of one independent (input) variable is called *Single Input Single Output (SISO)*; for more than one, it is called *Multiple Inputs Single Output (MISO)*

### Dependent Vs Independent variables
A random sample of eight drivers insured with a company and having similar auto insurance policies was
selected. The following table lists their driving experiences (in years) and monthly auto insurance
premiums.

|Driving Experience (years)| Monthly Auto Insurance Premium ($)|
|:-:|:-:|
|5|64|
|2|87|
|12|50|
|9|71|
|15|44|
|6|56|
|25|42|
|16|60|



Does the insurance premium depend on the driving experience or does the driving experience depend on the insurance premium?

Yes, the insurance premium is dependent on the driving experience, therefore in this case, the *driving experience* is the **independent variable** and *monthly insurance premium* is the **dependent variable**



# Correlation Analysis
 > 📌 It describes strength and direction of the relationship between two variables.

  1. Strength indicates how closely two variables are related to each other.
  2. Direction indicates how one variable would change its value as the value of the other variable changes.

## Pearson's *r* Correlation

The following formula is used for calculating the Pearson's *r* correlation:

$$
r = \frac{N \sum xy-\sum x \sum y}{\sqrt{\left[N\sum x^2-\left(\sum x\right)^2\right]\left[N\sum y^2-\left(\sum y\right)^2\right]}}
$$

where  
$r$ = Pearson's *r* correlation coefficient, $-1\leq r \leq 1$  
$N$ = number of values in each dataset/column  
$\sum xy$ = sum of the products of the paired scores/columns  
$\sum x$= sum of x column scores  
$\sum y$= sum of y column scores  
$\sum x^2$= sum of squared x column scores  
$\sum y^2$= sum of squared y column scores

**Directions:** All correlation coefficients between 0 and 1 represent positive correlations, while all coefficients between 0 and -1 are negative correlations. Positive relation means if one increase(decrease), then other will increase(decrease). Negative relation means if one increase(decrease), then other will decrease(increase).

**Strength:** The closer a correlation coefficient is to 1 or to -1, the stronger it is. Following picture [Figure-1] suggests a guideline to interpret the strength.

![image.png](attachment:image.png)


## Regression vs Correlation

Correlation:
> 1. Correlation may indicate whether two variables are related or not.
> 2. However, correlation will not provide information of how one variable is related to another.

Regression:
> 1. Regression may identify how one or more variables are related to an output variable.
> 2. Specifically, it will provide details of how input variables affects the output variable.
> 3. Beyond estimating a relationship, regression is a procedure of **predicting** an output variable from one or more input variables.

# Linear Regression
> 📌 The most common form of regression used in data analysis.  
> 📌 It assumes the **relationship** of the input variables and the output variables is **linear** (can be expressed as a line or hyperplane for higher dimensions).

$$y=\sum \beta_ix_i + b $$


## Single Input Single Output (SISO) linear regression

We will start with **Single Input Single Output (SISO)** linear regression.

**Notations**:
Let $x$ be the input variable, and let $y$ be the output variable. The linear regression model can be stated as:

**Model**:
$$
 y = \alpha + \beta x,
$$

where $\beta$ represents the slope of the $x$, and $\alpha$ is the intercept for the equation.

**Goal**:
Linear regression estimates the best values of $\alpha$ and $\beta$. So, when a new or previously unobserved data point $x$ comes with unknown value of $y$, using the value of $x$, and estimated $\alpha$ and $\beta$ values, one can find estimated value of $y$, say $\hat{y}$.

The goal of linear regression is to have $\hat{y}$ as close as possible to $y$.


## y = $\alpha$ + $\beta$ x

We have to compute $\alpha$ and $\beta$ to form the regression equation

\begin{equation}
\beta= \frac{SS_{xy}}{SS_{xx}}
\end{equation}

\begin{equation}
\alpha= \bar{y}-\beta \bar{x}
\end{equation}

$\bar{x}$ and $\bar{y}$ represent the average values.

and $SS_{xx}$, $SS_{yy}$ and $SS_{xy}$ are as follows:

## Computing Sum of Squares
\begin{equation}
SS_{xy}= \sum xy - \frac{(\sum x) (\sum y)}{n}
\end{equation}

\begin{equation}
SS_{xx}= \sum x^2 - \frac{(\sum x)^2}{n}
\end{equation}

\begin{equation}
SS_{yy}= \sum y^2 - \frac{(\sum y)^2}{n}
\end{equation}

\begin{equation}
\bar{x} = \frac{\sum x}{n}
\end{equation}

Similar for $\bar{y}$

|$x$|$y$|$xy$|$x^2$|$y^2$|
|--|--|--|--|--|
|5|64|320|25|4096|
|2|87|174|4|7569|
|12|50|600|144|2500|
|9|71|639|81|5041|
|15|44|660|225|1936|
|6|56|336|36|3136|
|25|42|1050|625|1764|
|16|60|960|256|3600|
|---|---|---|---|---|
|$\sum$$x$ = 90|$\sum$$y$ = 474|$\sum$ $xy$ =4739|$\sum$ $x^2$=1396|$\sum$ $y^2$=29642|


\begin{equation}
SS_{xy} = \sum xy - \frac{(\sum x) (\sum y)}{n} = 4739-\frac{(90)(474)}{8} = -593.50
\end{equation}

\begin{equation}
SS_{xx}= \sum x^2 - \frac{(\sum x)^2}{n} = 1396-\frac{(90)^2}{8} = 383.50
\end{equation}

\begin{equation}
SS_{yy}= \sum y^2 - \frac{(\sum y)^2}{n} = 29642 - \frac{(474)^2}{8} = 1557.50 \end{equation}

\begin{equation}
\bar{x} = \frac{\sum x}{n} = \frac{90}{8} = 11.25
\end{equation}

\begin{equation}
\bar{y} = \frac{\sum y}{n} = \frac{474}{8} = 59.25
\end{equation}


## Computing Correlation


$$
r = \frac{N \sum xy-\sum x \sum y}{\sqrt{\left[N\sum x^2-\left(\sum x\right)^2\right]\left[N\sum y^2-\left(\sum y\right)^2\right]}}
$$

\begin{equation}
r = \frac{8 (4739)-(90)(474)}{\sqrt{\left[(8) (1396)- \left(90\right)^2\right]\left[(8)(29642)-\left(474\right)^2\right]}} = -0.767934
\end{equation}



## Another formula for Correlation (using Sum of Squares)

$$
r = \frac{SS_{xy}}{\sqrt{SS_{xx}SS_{yy}}}
$$

$$
r = \frac{-593.5000}{\sqrt{(383.5000)(1557.5000)}} = -0.767934
$$

### Meaning of Correlation Coefficient
Based on the given band, $r=-0.767934$ indicates a **strongly negative** correlation between driving experience and the monthly auto premium insurance.


### Interpret the meaning of the values of $\alpha$ and $\beta$ calculated
- The value of $\alpha$ = 76.6605 gives the value of $\hat{y}$ for $x = 0$; that is, it gives the monthly auto insurance premium for a driver with no driving experience.
- The value of $\beta$ gives the change in $\hat{y}$ due to a change of one unit in $x$.
- Thus, $\beta$ = −1.5476 indicates that, on average, for every extra year of driving experience, the monthly auto insurance premium decreases by $1.55.

Note that when $\beta$ is negative, $y$ decreases as $x$ increases, i.e. there is negative correlation between $x$ and $y$.


### Thus, our estimated regression line y = $\alpha$ + $\beta$x is:
\begin{equation}
y = 76.66 - 1.5476 x
\end{equation}



In [None]:
import numpy as np

y = np.array([64,87,50,71,44,56,42,60])
x = np.array([5,2,12,9,15,6,25,16])


xy = (x*y)
x2 = (x**2)
y2 = (y**2)
n= len(x)


print(xy,x2,y2,n)

SSxy = sum(xy) - ((sum(x)*sum(y))/n)
SSxx = sum(x2) - ((sum(x)**2)/n)

b = SSxy/SSxx
a = np.mean(y) - (b*np.mean(x))

print(b,a)



In [None]:
def LinearRegression(x):
  y = a + (b*x)
  return y

In [None]:
r = ( n*sum(xy)  - (sum(x)*sum(y))) / np.sqrt(( n*sum(x2)  - (sum(x)**2)) *( n*sum(y2)  - (sum(y)**2)))

In [None]:
if(r<0):
  print("x is inversely propotional to y")
else:
  print("x is directly propotional to y")

In [None]:
y_pred = (LinearRegression(x))

In [None]:
MSE  = sum((y-y_pred)**2)/ n

In [None]:
from sklearn.model_selection import train_test_split #Spliting Dataset
from sklearn.linear_model import LinearRegression #Linear Model
from sklearn.metrics import mean_squared_error, r2_score #Evaluation Criterea

import pandas as pd

x = x.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 42)

regression_model = LinearRegression().fit(x_train, y_train)
y_predicted = regression_model.predict(x_test)

In [None]:
regression_model.predict([[15]])

In [None]:
regression_model.intercept_

In [None]:
regression_model.coef_

In [None]:
regression_model.score(x_train, y_train)

In [None]:
regression_model.score(x_test, y_test)

In [None]:
r2_score(y_test, y_predicted)

In [None]:
mean_squared_error(y_test, y_predicted, squared=True)
mse = mean_squared_error(y_test, y_predicted)
rmse = np.sqrt(mse)
print(rmse)

In [None]:
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [None]:
df.columns

Index(['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain',
       'area', 'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'day_fri', 'day_mon', 'day_sat', 'day_sun',
       'day_thu', 'day_tue', 'day_wed'],
      dtype='object')

In [None]:
df = pd.get_dummies(df)

In [None]:
df

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,month_nov,month_oct,month_sep,day_fri,day_mon,day_sat,day_sun,day_thu,day_tue,day_wed
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0,0,0,1,0,0,0,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,1,0,0,0,0,0,0,1,0
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,1,0,0,0,1,0,0,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,1,0,0,0,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,4,3,81.6,56.7,665.6,1.9,27.8,32,2.7,0.0,...,0,0,0,0,0,0,1,0,0,0
513,2,4,81.6,56.7,665.6,1.9,21.9,71,5.8,0.0,...,0,0,0,0,0,0,1,0,0,0
514,7,4,81.6,56.7,665.6,1.9,21.2,70,6.7,0.0,...,0,0,0,0,0,0,1,0,0,0
515,1,4,94.4,146.0,614.7,11.3,25.6,42,4.0,0.0,...,0,0,0,0,0,1,0,0,0,0


In [None]:
from sklearn.linear_model import LinearRegression

data = df.copy()
target = data.pop('area')

lr = LinearRegression().fit(data, target)

In [None]:
from sklearn.metrics import mean_squared_error

# R^2
print(lr.score(data, target))

predictions = lr.predict(data)
mse = mean_squared_error(target, predictions)
rmse = np.sqrt(mse)
print(rmse)

0.0457820965080854
62.12143311792723


In [None]:
marks = pd.read_csv('marks.csv')

marks = marks.dropna()

x = marks[["MID", "FINAL"]]
y = marks["OVERALL"]

x = x.values.reshape(-1,2)
y= y.values.reshape(-1,1)

from sklearn.model_selection import train_test_split
# splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 42)

# Model initialization
regression_model = LinearRegression().fit(x_train, y_train)
# Predict
y_predicted = regression_model.predict(x_test)





In [None]:
# model evaluation
mse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)

# printing values
print('Slope:' ,regression_model.coef_)
print('Intercept:', regression_model.intercept_)
print('Mean squared error: ', mse)
print('Root mean squared error: ', mse**(1/2.0))
print('R2 score: ', r2)

In [None]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
Answer = np.array(["Summer", "Winter", "Summer", "Winter", "Spring", "Spring", "Summer"])
ord_enc = OrdinalEncoder().fit_transform(Answer.reshape(-1,1))
print(ord_enc)

[[1.]
 [2.]
 [1.]
 [2.]
 [0.]
 [0.]
 [1.]]


In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
#le = preprocessing.LabelEncoder()
Answer = np.array(["Summer", "Winter", "Summer", "Winter", "Spring", "Spring", "Summer"])
le = LabelEncoder()
target = le.fit_transform(Answer.reshape(-1,1))

  y = column_or_1d(y, warn=True)


In [None]:
target

array([1, 2, 1, 2, 0, 0, 1])

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)

In [None]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [None]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
clf = LogisticRegression(random_state=42).fit(X, y)
clf.predict([[6.7, 2, 2, 1.2]])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([1])

In [None]:
clf.predict_proba(X[:2, :])

array([[9.81799409e-01, 1.82005762e-02, 1.43509289e-08],
       [9.71722782e-01, 2.82771875e-02, 3.00214335e-08]])

In [None]:

clf.score(X, y)

0.9733333333333334

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
spam = pd.read_csv('spam_or_not_spam.csv')
spam.head()


Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [None]:
spam.dropna(inplace = True)

X = spam['email']
y = spam['label']
y.shape,X.shape

((2999,), (2999,))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range =(1,3),min_df=5,max_features =8000)
X_vec = vect.fit_transform(X).toarray()
X_vec[:5]

array([[1, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X_vec,y,random_state = 0)

In [None]:
y

0       0
1       0
2       0
3       0
4       0
       ..
2995    1
2996    1
2997    1
2998    1
2999    1
Name: label, Length: 2999, dtype: int64

In [None]:
clf = LogisticRegression(random_state=42).fit(X_train, y_train)
# clf.predict([[6.7, 2, 2, 1.2]])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
clf.predict(X_test)

array([0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,

In [None]:
clf.score(X_test,y_test)
# clf.score()

0.996

In [None]:
clf.score(X_train,y_train)
# clf.score()

0.9986660738105825

In [None]:
from sklearn.metrics import mean_squared_error
pre = clf.predict(X_test)
mean_squared_error(y_test, pre )

0.004