## Logistic Regression
Linear regression is a machine learning model where it gives a relationship between the independent features and the dependent features in the form of a straight line. But here tthe output generated by the model is a discrete or categorical value. Also we have Z eff to derermine the loss function.

loss --> loss can be defined as the difference between the original output and the model predicted output.

loss function --> During the model training if we are passing a single datapoint from the dataset and the loss which we get is called the loss function

cost function --> During the model training if we are passing the whole dataset, the loss which we get is called cost function

## Problem statement we are solving:-
Implementing Logistic regression ML model using a imported dataset from sklearn library that is the iris dataset, where the model will get trained on different flower sepal length, sepal width, petal length,petal width and based on it it will give us respective flower with its desired specifications.

## Importing necessary libraries for operation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge, Lasso,  ElasticNet, LassoCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## Wrapping the dataset into pandas dataframe

In [55]:
iris = load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [56]:
## This are the attributes of the dataset
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [None]:
## This are the data on which the model will get trained  
iris.data

In [58]:
## This are the output which will be predicted by them model once it get trained
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [59]:
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Data preprocessing steps:
1) data cleaning --> This is done to clean the noisy data, deal with inconsistent data, handling missing value and handling the outliers.
Outliers == These are the unusual datapoints present in a dataset which does not follow the usual or predictive stats of a dataset 

2) data integration --> This is used to integrate multiple datas which are bought from different data sources to make a clear dataset.

3) data selection --> This is used to select the data, on which we are keen to work and find the hidden patterns.

4) data transformation --> This is done to scale down the values into a particular range so that the model training will be efficient.

5) data reduction --> This is done to remove the less important or highly co related attributes.

what excatly is the meaning of high co relation == For this we need to learn about the co relation, co relation gives us the strenght or how closely two attribute is related to each other, so a highly co related attribute means these attributes are highly related to each other and it will give a numeric value of nearer to 1, and for negatively co related it will tell how much it is not related to the other attribute. It will give value away from 1 i.e 0.


In [60]:
## Checking for missing values
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64


## Current context of the working data

In context of a dataset we are not having any outliers, redundant datas, inconsistent data, missing values so we are done with the data cleaning part

Also the data which we have created is integrated already. So now integration step is done. We have to deal with this piece of data so the selection step is done.

## Spliting data
we will split the dataset into two parts i.e the tain and test dataset, we will be using train-test split where we will perform 80-20 split, The 80% data will be used to train the data and 20% will be used to test the data

In [70]:
train_df, test_df = train_test_split(df, test_size=0.4, random_state=32)

In [71]:
## performing label encoding
le = LabelEncoder()
train_df['target'] = le.fit_transform(train_df['target'])
test_df['target'] = le.transform(test_df['target'])

## Scaling the dataset
Here we will use standard scaling technique to scale the dataset. Also we will separating the independent and dependent varaibales

independent variables --> In a datset the data which is used to train the model is called the independent variables

dependent variables --> In a datset the data which is used to predict as per the trained data is called the dependent variable or labels.

In [72]:
X_train = train_df.drop('target',axis=1)
y_train = train_df['target']
X_test = test_df.drop('target',axis=1)
y_test = test_df['target']

In [73]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model Training using Hyper parameter techniques
Basic method --> Here we will be using the linear regression model for model training. While training the model we will be giving it X_train and y_train dataset.

Hyperparameter method --> There are two types of hyper parameter techniques, i.e GridSearchCV and RandomizedSearchCV. To deal with bigger datasets we will be using the RandomizedSearchCV. or else it will be fine using the GridSearchCV

In [74]:
model = LogisticRegression(multi_class='ovr', max_iter=200)

In [75]:
params = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'saga']
}

In [76]:
grid = GridSearchCV(estimator=model, param_grid=params, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
grid.fit(X_train_scaled, y_train)
grid.best_params_

Fitting 5 folds for each of 24 candidates, totalling 120 fits


{'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

In [77]:
grid

## Model prediction
For predicting the model we will be using the X_test dataset

## Model Evaluation
Model evaluation is a technique where we will be evaluating the model performance on the basis of the its predicted data and the y_test data. For this we are having different techniques such as MSE(Mean Squared Error), MAE(Mean Absolute Error), r2_score, RMSE(Root Mean Square Error)

In [78]:
y_pred = grid.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc*100:.2f}%")
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)    


Accuracy: 98.33%
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        25
           1       0.95      1.00      0.97        19
           2       1.00      0.94      0.97        16

    accuracy                           0.98        60
   macro avg       0.98      0.98      0.98        60
weighted avg       0.98      0.98      0.98        60

