## Introduction of Linear Regression

Before you can run any statistical models, it's usually a good idea to visualize your dataset.
For Linear regression, we often calculate correlation beforehand by using the function or scatterplot to check the linearity between two variables.

#### Three types of variables - 
1. y  variable = Response variable/dependent variable/ Target variable / Output 
2. x variable = Explanatory variable/independent variables/ feature variable/ Input
3. y^ variable = Estimated/Predicted variable/fitted value

#### Python Packages -
1. statsmodel: optimised for insights
2. scikit-learn: optimised for predictions

#### Two properties -
1. Intercept: Value of y when x = 0
2. Slope: Amount of y value increases, when x value increments by 1

In [2]:
##
#* 1. Correlation Function:
# df['size_house'].corr(df['price_houses'])



#* 2. Visualising Numeric vs. Numeric variable (ScatterPlot):
# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.scatterplot(x="size", y="price", data= df)
# plt.show()


#* 3. Visualising Numeric vs. Numeric variable along with linear line (Regplot):
# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.regplot (x="size", y="price", data= df, ci= None)
# plt.show()



#* 4. Visualising Categorical vs. Numeric variable (Displot):
# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.displot(x="meat_mass", col="species", col_wrap = 2, bins = 9, data= butchershop)
# plt.show()

#? Displot function generates Histogram
#? col = variable that divides different plots; col_wrap = no. of plots per row


## OLS Regression (Ordinary Least Squares)

A) To calculate the intercept & slope values -

In [1]:

#* OLS regression for numeric vs. numeric:-


#  from statsmodels.formula.api import ols  #? Import the ols function
#  reg_calc = ols("price ~ no_convenience", data = df_realestate).fit() #? Train the Model & fit into an object; Format= "Y o/p variable ~ X i/p variable"
#  print(reg_calc.params) #? parameters of the fitted model



#* OLS regression for categorical vs. numeric:-

#  from statsmodels.formula.api import ols  
#  reg_calc = ols("price ~ house_age_rubrics + 0", data= df_realestate).fit()
#  print(reg_calc.params)

#? This above regression function (categorical vs numeric) gives similar o/p while calculating the average of 'y' variable per 'x' category.
# mean_price = df.groupby('house_age_rubrics')['price'].mean()
# Output: house_age_rubrics.
#     0 to 15     12.637
#      15 to 30     9.877
#      30 to 45    11.393

B) To predict UNKNOWN range of the values along with given dataset:

In [3]:

#!  Let's assume we have a dataset 'df' with columns - length & mass. We take "length as input variable" and "mass as target variable". First, calculate a regplot to visualise. Second, calculate the ols model as per this dataset. Third, calculate the

#* Regplot:
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.regplot (x="length", y="mass", data= df, ci= None)
# plt.show()


#* OLS Model to predict mass based on the length variable:

# import statsmodels.formula.api import ols
# ols_model = ols("mass ~ length", data = df).fit()  #? Train the Model & fit into an object

# df2 = pd.DataFrame({"length": np.arange(20, 41)})  #? Create a dictionary of ANY values in column 'length' and assign into new df
# mass = ols_model.predict(df2)   #? Create the predicted column 'mass' (y variable) as per OLS model
# predicted_df = df2.assign(mass)  #? assign the new column mass inside df2
# print(predicted_df)

#* Visualise the actual dataframe 'df' and model dataframe 'predicted_df' -
# import matptotlib.pyplot as plt
# import seaborn as sns
# fig= plt.figure()      #? Helps in plotting both figures into the singular.
# sns.regplot (x = "length", y ="mass", ci=None, data = df)
# sns.scatterplot(x ="length", y= "mass", data=predicted_df, color = " red" ,marker ="s")
# plt.show()



![image.png](attachment:image.png)

C) .PARAMS | .RESID | .FITTEDVALUES | .SUMMARY

In [3]:
### PARAMS Function - #? Returns intercet & slope values of predicted regression line.

# import statsmodels.formula.api import ols
# ols_model = ols("mass ~ length", data = df).fit()  #? Train model & fit into an object
# coeffs = ols_model.params
# intercept = coeffs.intercept
# slope = coeffs.slope  


### FITTEDVALUES Function - #? Predictions of y^ (estimated) values from the original dataset
# print (ols_model.fittedvalues)

### RESID Function - #? Residual = (y - y^) --- Less residual, better fitted regression line.
# print (ols_model.resid)



Residual denoted by red line :

![image.png](attachment:image.png)



In [4]:
### Summary Function: Extented printout of the deets of the function:
# print (ols_model.summary())

![image.png](attachment:image.png)

## Regression to the Mean -

Regression to the mean/Reversion to Mediocrity refers to the idea that rare or extreme events are likely to be followed by more typical ones. Over time, outcomes “regress” to the average or “mean”.
The term was coined by Sir Francis Galton when he noticed that tall parents tend to have children shorter than them, whereas short parents often have children who were taller than them. Graphically, this means that if we plot the height of parents on the x-axis and the height of kids on the y-axis and draw a line through all the data points, we get a line with a slope of less than 1 (or equivalently an angle of less than 45 degrees).

![image.png](attachment:image.png)

Regression to the mean is a statistical fact about the world that is both easy to understand and easy to forget. Because the sequence of events unfolds in this way (extreme, typical, typical, extreme…), our brains automatically errnoneously infer CAUSAL relationship between the “extreme” event and the “typical” event i.e.the extreme event caused the typical event.
This concept is basically implemented for COMPARISON. Basically used in researches, sports, investments,

In [None]:

#* Plotting a graph with both - regression line & axis line with slope as 1: 

fig = plt.figure()  #? combines two more figures

sns.regplot(x = "father_height", y = "son_height", data = family_df, ci = None, line_kws = {"color": "black"}) #? Regression Plot
plt.axline (xy1 = (150,150), slope =1, linewidth =2, color = "green")  #? Axline = Imaginary line on graph

plt.axis("equal") #? 1cm on x-axis is equal to 1cm on y-axis
plt.show()
