<a href="https://colab.research.google.com/github/LeeVek/ML_End2End_PROJECTS/blob/main/LinearRegression_Docker%2BHeroku.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Lets load the Boston House Pricing Dataset

In [None]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
boston=load_boston()

In [None]:
boston.keys()

In [None]:
## Lets check the description of the dataset
print(boston.DESCR)

In [None]:
print(boston.data)

In [None]:
print(boston.target)

In [None]:
print(boston.feature_names)

## Preparing The Dataset

In [None]:
dataset=pd.DataFrame(boston.data,columns=boston.feature_names)

In [None]:
dataset.head()

In [None]:
dataset['Price']=boston.target

In [None]:
dataset.head()

In [None]:
dataset.info()

In [None]:
## Summarizing The Stats of the data
dataset.describe()

In [None]:
## Check the missing Values
dataset.isnull().sum()

In [None]:
### EXploratory Data Analysis
## Correlation
dataset.corr()

In [None]:
import seaborn as sns
sns.pairplot(dataset)

## Analyzing The Correlated Features

In [None]:
#check correlation between independent features to each other and to the dependent feature
dataset.corr()

In [None]:
plt.scatter(dataset['CRIM'],dataset['Price'])
plt.xlabel("Crime Rate")
plt.ylabel("Price")
#inversely correlated

In [None]:
#scattertplot
plt.scatter(dataset['RM'],dataset['Price'])
plt.xlabel("RM")
plt.ylabel("Price")
#positive correlation

In [None]:
#regressionplot
import seaborn as sns
sns.regplot(x="RM",y="Price",data=dataset)
#this creates a simpole linear regression line and shows the variations of datapoints

In [None]:
sns.regplot(x="LSTAT",y="Price",data=dataset)
#negative correlation

In [None]:
sns.regplot(x="CHAS",y="Price",data=dataset)
#linearity(+or-) should be there to create better regression model

In [None]:
sns.regplot(x="PTRATIO",y="Price",data=dataset)

In [None]:
## Independent and Dependent features

X=dataset.iloc[:,:-1]#to skip last column
y=dataset.iloc[:,-1]#price is the dependent feature/target

**TO CREATE MODELS:**


1.   Prepare independent and dependent features
2.   Train test split



In [None]:
X.head()

In [None]:
y

In [None]:
##Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
X_train

In [None]:
X_test

In [None]:
## Standardize the dataset
#standardscaler
#gradient descent
#main aim of linear regression is to come near the global minima
#for converging our specific algorithm of the gradient descent to take place faster, we need to normalize/standardize these data points to the same scale-->>STANDARDIZED SCALAR
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

***why do we standardize dataset in linear regression?***:

internally we use **gradient descent** and our main aim is to come to the **global minima**.in order to reach global minima, we should ensure that all **independent features unit** should be in the **same scale**.So that, the **convergence** wil hapen **quickly**.


In [None]:
X_train=scaler.fit_transform(X_train)

In [None]:
X_test=scaler.transform(X_test)
#we dpnt apply fit_transform to test data . whatever info and and techniques aplied on trainingdata should be applied to testdata

In [None]:
import pickle
pickle.dump(scaler,open('scaling.pkl','wb'))

In [None]:
X_train

In [None]:
X_test

## Model Training

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
regression=LinearRegression()
#initialize the specific model

In [None]:
regression.fit(X_train,y_train)
#similar to creating hyperplane

In [None]:
## print the coefficients and the intercept
print(regression.coef_)
#this wil be equal to number of independent features

In [None]:
print(regression.intercept_)

In [None]:
## on which parameters the model has been trained
regression.get_params()# a method

These parameters control different aspects of the regression model's behavior and can be adjusted to **fine-tune the performance** or adapt to specific requirements.

**copy_X**: This parameter determines whether or not a copy of the input features (X) should be made. If set to True, it ensures that the original input data is not modified during the training process.

**fit_intercept**: It specifies whether or not an intercept term should be included in the regression model. If set to True, the model will learn an additional intercept term, which allows the regression line to not necessarily pass through the origin.

**n_jobs**: This parameter determines the number of parallel jobs to use during model training. By default, it is set to None, which means that the model will use only one job. If set to an integer value, it enables **parallel computation using multiple jobs**.

**normalize**: It specifies whether or not the input features should be normalized before fitting the regression model. If set to True, the features will be scaled so that they have zero mean and unit variance.

**positive**: This parameter is relevant for certain types of regression models, such as non-negative linear regression. If set to True, it enforces that the **predicted values should be non-negative**.

**alpha**: This parameter is used in **regularization methods, such as Ridge regression or Lasso regression**. It controls the strength of regularization, with **higher values** leading to **more regularization** and potentially **simpler models**.

**max_iter**: It specifies the maximum number of **iterations or epochs** that the regression **algo**rithm will perform **during training**. It helps **control the convergence of the optimization process**.

**tol**: This parameter determines the **tolerance** or **threshold for convergence**. If the change in the model's coefficients or **loss function** **falls below this value, the optimization process is considered to have converged**.

**solver**: For certain types of regression models, such as logistic regression, this parameter determines **the solver algorithm used to optimize the model's parameters**. Common solver options include 'lbfgs', '**liblinear**', 'newton-cg', 'sag', and 'saga'.

**random_state**: It is used to set a **seed value** for the **random number generator.** By fixing the seed, you can ensure **reproducibility of the results across different runs of the model**.

**verbose**: This parameter controls the level of verbosity or **amount of output generated during training**. **Higher values provide more detailed information, while lower values reduce the amount of output.**

In [None]:
### Prediction With Test Data
reg_pred=regression.predict(X_test)

In [None]:
reg_pred

## Assumptions

In [None]:
## plot a scatter plot for the prediction
plt.scatter(y_test,reg_pred)
#almost linear-->>model predicts fine

In [None]:
## Residuals=error 
residuals=y_test-reg_pred

In [None]:
residuals

In [None]:
## Plot this residuals 

sns.displot(residuals,kind="kde")#kernel density estimation (KDE) 
#normal distribution in distlot-->>better model performance
#but at the right end there are many outliers

This will generate a KDE plot of the residuals, which can help you analyze the **goodness of fit of your regression model** and identify any patterns or **deviations in the residuals.**

In [None]:
## Scatter plot with respect to prediction and residuals
## uniform distribution in scatter plot-->>model is performing well
plt.scatter(reg_pred,residuals)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(y_test,reg_pred))#mae
print(mean_squared_error(y_test,reg_pred))#mse
print(np.sqrt(mean_squared_error(y_test,reg_pred)))#rmse

**MAE**: MAE represents the average absolute difference between the predicted and true values. It measures the average magnitude of the errors without considering their direction. A **lower** MAE indicates better model performance, as it suggests that the model's predictions are closer to the true values on average.

**MSE**: MSE represents the average squared difference between the predicted and true values. It gives higher weights to larger errors due to the squaring operation. A **lower** MSE indicates better model performance, as it means that the model's predictions are, on average, closer to the true values with smaller deviations.

**RMSE**: RMSE is the square root of MSE, and it is expressed in the same units as the target variable. It provides a more interpretable measure of the average error magnitude. Similar to MAE and MSE, a **lower** RMSE indicates better model performance.

## R square and adjusted R square


Formula

**R^2 = 1 - SSR/SST**


R^2	=	coefficient of determination
SSR	=	sum of squares of residuals
SST	=	total sum of squares


In [None]:
from sklearn.metrics import r2_score
score=r2_score(y_test,reg_pred)
print(score)

adjusted R2 is less than R2

if value is more towards 1 the more better the score

**Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]**

where:

R2: The R2 of the model
n: The number of observations
k: The number of predictor variables

In [None]:
#display adjusted R-squared
1 - (1-score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)

## New Data Prediction

In [None]:
boston.data[0].reshape(1,-1)

In [None]:
##transformation of new data
scaler.transform(boston.data[0].reshape(1,-1))

In [None]:
regression.predict(scaler.transform(boston.data[0].reshape(1,-1)))

## Pickling The Model file For Deployment

In [None]:
import pickle

In [None]:
pickle.dump(regression,open('regmodel.pkl','wb'))

In [None]:
pickled_model=pickle.load(open('regmodel.pkl','rb'))

In [None]:
## Prediction
pickled_model.predict(scaler.transform(boston.data[0].reshape(1,-1)))

array([30.08649576])