# <font color = blue >1. What is Logistic Regression?</font>

Logistic Regression is a Machine Learning algorithm that works on binary classification. 
So when I say **binary classification**, I mean **only 2 classes should be present in the dependent (target) variable 'y'**.

In Binary Classification, the value of dependent variable (y) is discrete. In other words, the value is in 2 levels 
(hence binary classification). Hence, the values of x are saturated on either class 0 or class 1,
creating a Sigmoid function.

The entire concept of Logistic regression revolves around Odd's ratio. Odd's ratio is the ratio of 
Prob of Success/Prob of Failure or in other words (P/1-P).


#### Logistic Regression graph
<img src = "images/Sigmoid_curve.png">

#### Derivation of Odd's ratio
<img src = "images/LogicReg_OddsRatio.jpg">

# <font color = blue >2. What are the steps involved in Logistic Regression?</font>

The steps used in Logistic Regression are somewhat similar to the one in Linear Regression. The major difference lies in 
Preprocessing & Evaluating the model. For preprocessing, we are converting categorical columns to numbers and scaling the
columns as well (to avoid any bias). In Evaluation, we are using Confusion matrix, accuracy score & recall value to check
whether the values are correctly classified. 

#### Logistic Regression steps
<img src = "images/LogicReg_Steps.jpg" width = 600>

# <font color = blue >3. What is a Probability matrix?</font>

Probability Matrix works on the concept of Threshold.
Say if suppose the value of  x is less than the threshold, then it belongs to class 0
if suppose the value of  x is more than the threshold, then it belongs to class 1



#### Probability matrix concept
<img src = "images/LogicReg_ProbMatrix.jpg" width = 300>

# <font color = blue >4. What is a Confusion matrix?</font>

In Machine Learning, not every time the model gives accurate prediction. At times it misplaces values from Class 0 into 
Class 1 and vice versa. This confusion is then depicted by Confusion matrix as shown below:

<img src = "images/LogicReg_ConfusionMatrix.jpg" >

Let us first understand the terms TN, TP, FN & FP

To understand them first, remember
Class 0 ---> -ve & Class 1----> +ve

**TN - True Negative - are the values that belong to negative group (i.e. Class 0).**

**TP - True Positive - are the values that belong to positive group (i.e. Class 1).**

**FN - False negative - are the values that belong to negative group but are falsely classified into positive group.**

**FP - False positive - are the values that belong to positive group but are falsely classified into negative group.**

In the example shown above the diagonal values 7017,1030 are TN & TP respectively. They are correctly classified.
However the values 1316 & 406 are FN & FP are wrongly classified. 

1316 values belong to -ve group (Class 0) but misclassified in +ve group (Class 1)---> FN,

406 values belong to +ve group (Class 1) but misclassified in -ve group (Class 0)----> FP.

# <font color = blue >5. What is Accuracy score & Recall value?</font>

Logistic Regression Model is evaluated based on Accuracy score & Recall value.

Accuracy score is calculated using ratio of correct predictions to total predictions of the matrix.

***Accuracy score = (TN + TP)/(TN + FN + TP + FP)***

Lower the false predictions, better the accuracy.

Recall Value is the accuracy score of the individual classes.

So, if a particular class has a lower recall value, one can adjust the threshold from probability matrix to fix it.

Recall value for individual classes is given by the formula:

***Recall value for Class 0 – TN/(TN + FP)***

***Recall value for Class 1 – TP/(TP + FN)***

#### Accuracy Score, Recall value 
<img src = "images/LogicReg_AccuracyScore.jpg" width = 600>

---

# <font color = red>Code</font>

## <font color = green >1. Importing the libraries & reading the data</font>

In [4]:
#Importing pandas, numpy, matplotib & seaborn libraries and 
#assigning them aliases as pd,np, plt & sns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns               

In [5]:
# Reading the csv file and storing it in a variable named "data"

data = pd.read_csv("adult_new.csv")

In [6]:
data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
1,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
2,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K
3,65,Private,184454,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,6418,0,40,United-States,>50K
4,48,Private,279724,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,3103,0,48,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27409,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
27410,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
27411,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
27412,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


As one can clearly see, here the dependent variable is "income" which is divided into classes >50 K and <= 50 K.
So this is an example of Classification and since it has only 2 classes in it, it is called as Binary Classification.

Important thing to remember here is Logistic Regression is used only on Binary classification.

If the Classification is Multi-level, it is not used since the performance is not that good.

## <font color = green >2. Preprocessing: Check for null values & special characters & treating them (if necessary)</font>


In [7]:
data.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

So we have got no null values, let us check for special characters then.

In [8]:
for i in data.columns:
    print(f"Unique value for coulmn: {i}\n\n{data[i].unique()}\n") 

Unique value for coulmn: age

[28 44 63 65 48 43 40 34 45 46 36 22 42 41 55 33 59 29 27 31 58 54 37 56
 39 51 50 52 47 24 64 53 35 32 30 68 57 66 49 38 61 73 26 23 25 62 90 60
 67 69 72 85 71 77 70 19 21 74 75 20 81 76 80 88 78 79 84 17 18 82 83 86
 87]

Unique value for coulmn: workclass

['Local-gov' 'Private' 'Self-emp-not-inc' 'State-gov' 'Self-emp-inc' '?'
 'Federal-gov' 'Without-pay' 'Never-worked']

Unique value for coulmn: fnlwgt

[336951 160323 104626 ...  84661 310152 257302]

Unique value for coulmn: education

['Assoc-acdm' 'Some-college' 'Prof-school' 'HS-grad' 'Masters' 'Doctorate'
 'Bachelors' 'Assoc-voc' '9th' '10th' '7th-8th' '11th' '5th-6th' '1st-4th'
 '12th' 'Preschool']

Unique value for coulmn: educational-num

[12 10 15  9 14 16 13 11  5  6  4  7  3  2  8  1]

Unique value for coulmn: marital-status

['Married-civ-spouse' 'Never-married' 'Divorced' 'Separated'
 'Married-AF-spouse' 'Widowed' 'Married-spouse-absent']

Unique value for coulmn: occupation

['Protectiv

We have got some special charcters '?' in the data. Let us replace the special characters with np.nan to make the
characters visible. I am storing the substitued data in a new variable 'data_new'

In [9]:
data_new = data.replace("?",np.nan)

In [10]:
data_new.isnull().sum()

age                   0
workclass          1283
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         1288
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      508
income                0
dtype: int64

The np.nan have made the characters visible showing 1283 missing values for workclass, 1288 for occupation & 508
for native-country 

In [11]:
data_new.shape

(27414, 15)

Null values are treated only if it accounts to less than 30 % of the dataset.

In this case we have max null value % around: 

(1288/27414)*100 

In [12]:
(1288/27414)*100

4.6983293207850005

Since 4.69 % accounts to less than 30 % of the dataset one can treat the null values with mean/mode/median.

In [13]:
for value in ["workclass","occupation","native-country"]:
    data_new[value].fillna(data_new[value].mode()[0],inplace=True)

In the code above, we are using a for loop to sequentially fill up not asssigned (na) values with the first mode of
the respective columns. Inplace will update the changes in the dataset without needing it to store in another variable.

In [14]:
data_new.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

## <font color = green >3. Preprocessing: Highlighting the Categorical columns and label encoding them</font>


Categorical columns have data in dtype 'object', and before implementing it on ML algorithm it is important that we 
convert the object dtype into 'int64' (number) by label encoding them.

label encoding basically assigns a number to a string. Before moving on to Label encoding it is essential that we
separate the categorical columns first as shown below:

In [16]:
colname=[]
for i in data_new.columns:
    if data_new[i].dtypes== "object":
        colname.append(i)
colname   

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'gender',
 'native-country',
 'income']

The above loop is used to segregate all the categorical columns from data_new by checking for every column (i) in
data_new.columns whether the data type (dtype) of the column is an "object" and appending it to a blank string 
colname if it's true.

In [17]:
# Label encoding the categorical data

from sklearn.preprocessing import LabelEncoder  

le = LabelEncoder()

for x in colname:
    data_new[x]=le.fit_transform(data_new[x])   
    
    le_name_mapping = dict(zip(le.classes_,le.transform(le.classes_))) 
   
    print("Feature",x)
    print("mapping", le_name_mapping)

Feature workclass
mapping {'Federal-gov': 0, 'Local-gov': 1, 'Never-worked': 2, 'Private': 3, 'Self-emp-inc': 4, 'Self-emp-not-inc': 5, 'State-gov': 6, 'Without-pay': 7}
Feature education
mapping {'10th': 0, '11th': 1, '12th': 2, '1st-4th': 3, '5th-6th': 4, '7th-8th': 5, '9th': 6, 'Assoc-acdm': 7, 'Assoc-voc': 8, 'Bachelors': 9, 'Doctorate': 10, 'HS-grad': 11, 'Masters': 12, 'Preschool': 13, 'Prof-school': 14, 'Some-college': 15}
Feature marital-status
mapping {'Divorced': 0, 'Married-AF-spouse': 1, 'Married-civ-spouse': 2, 'Married-spouse-absent': 3, 'Never-married': 4, 'Separated': 5, 'Widowed': 6}
Feature occupation
mapping {'Adm-clerical': 0, 'Armed-Forces': 1, 'Craft-repair': 2, 'Exec-managerial': 3, 'Farming-fishing': 4, 'Handlers-cleaners': 5, 'Machine-op-inspct': 6, 'Other-service': 7, 'Priv-house-serv': 8, 'Prof-specialty': 9, 'Protective-serv': 10, 'Sales': 11, 'Tech-support': 12, 'Transport-moving': 13}
Feature relationship
mapping {'Husband': 0, 'Not-in-family': 1, 'Other-r

Label Encoding is done by making use of LabelEncoder function from sklearn library & perpocessing sublibrary. 

LableEncoder function is stored in a variable 'le' & later fitted for every column (x) in colname string.

Remember, colname string has the list of all the categorical columns. So, in a way we are applying 
label encoding using le.fit_transform on all the categorical columns. This will get us the label encoded columns.

To view how this Label encoding has turned out to be, we can use 'dict(zip(le.classes_,le.transform(le.classes_)))', this 
function makes a dictionary [key:value] to check the class and it's corresponding transformed mapping.

To give an example, look at the mapping done on the native country, mexico is replaced with the number 25.
Also, if you look at the mapping for Feature 'income' we got values <=50K: 0, >50K: 1.

To view the mapped columns we can used data_new.head().


In [18]:
data_new.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,28,1,336951,7,12,2,10,0,4,1,0,0,40,38,1
1,44,3,160323,15,10,2,6,0,2,1,7688,0,40,38,1
2,63,5,104626,14,15,2,9,0,4,1,3103,0,32,38,1
3,65,3,184454,11,9,2,6,0,4,1,6418,0,40,38,1
4,48,3,279724,11,9,2,6,0,4,1,3103,0,48,38,1


## <font color = green >4. Putting the label encoded values in X and Y</font>



In [19]:
X= data_new.values[:,0:-1] 

Y= data_new.values[:,-1]

In X we are providing a slicing (range), ':' indicates all rows whereas 0:-1 is the range of column 0 till second last column
(column 'age' till 'native country')
        
In Y we are just providing the index of the last column 'income' hence only -1.    

Let us see how X & Y looks like after splitting.

In [21]:
X

array([[    28,      1, 336951, ...,      0,     40,     38],
       [    44,      3, 160323, ...,      0,     40,     38],
       [    63,      5, 104626, ...,      0,     32,     38],
       ...,
       [    58,      3, 151910, ...,      0,     40,     38],
       [    22,      3, 201490, ...,      0,     20,     38],
       [    52,      4, 287927, ...,      0,     40,     38]], dtype=int64)

In [22]:
Y

array([1, 1, 1, ..., 0, 0, 1], dtype=int64)

In [23]:
X.shape

(27414, 14)

In [24]:
Y.shape

(27414,)

## <font color = green >5. Scaling the high numeric columns from X</font>

High numeric columns are those columns containing too many numbers (e.g. column 'fnlwgt')

Normalization is done using MinMaxScaler() and it divides the values in the range 0 to 1.

Scaling is done using StandardScaler() and it divides all the values in all the columns in the predefined range -3 to 3.
It is done for eliminating the bias of the algorithm to a high numeric column. So, in absence of 
Standard Scaler the Machine Learning algorithm will be biased to a high numeric column.

In [25]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

scaler.fit(X)
X= scaler.transform(X)

In [26]:
print(X)

[[-0.91080072 -1.82178236  1.41055538 ... -0.24549699 -0.13195044
   0.2536764 ]
 [ 0.30347105 -0.08150981 -0.27597356 ... -0.24549699 -0.13195044
   0.2536764 ]
 [ 1.74541877  1.65876274 -0.80779525 ... -0.24549699 -0.78409618
   0.2536764 ]
 ...
 [ 1.36595884 -0.08150981 -0.35630492 ... -0.24549699 -0.13195044
   0.2536764 ]
 [-1.36615263 -0.08150981  0.11710872 ... -0.24549699 -1.76231478
   0.2536764 ]
 [ 0.91060693  0.78862647  0.94245069 ... -0.24549699 -0.13195044
   0.2536764 ]]


In the code above StandardScaler() function from sklearn library and preprocessing sub-library. We are storing this
in the variable scaler and one thing to note here is we are scaling only the independent columns and hence we have fit
the function on X.

## <font color = green >6. Performing Train-test split<font>

Similar to Linear Regression we perform Train-test split here as well. This time we have kept test size as 30 % of the
data which means the train size would be 70%.

In [27]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3, random_state=10)

In [28]:
print(X_train.shape)
print(Y_train.shape)
print()
print(X_test.shape)
print(Y_test.shape)

(19189, 14)
(19189,)

(8225, 14)
(8225,)


## <font color = green >7. Training the model on LogisticRegression() function<font>

Here we train the model on LogisticRegression() function which we import from sklearn library and linear_model sublibrary.
We store the function in a variable called 'classifier' and fit the training data in X & Y on classifier.

In [29]:
from sklearn.linear_model import LogisticRegression

# Create a model object
classifier = LogisticRegression()

# Train the model object
classifier.fit(X_train,Y_train)

## <font color = green >8. Predicting Y_pred based on X_test data<font>

In the code block below we are performing predictions on the test data and printing the predictions Y_pred.

Also, we are using List with zip to show us the list of predictions that are: 
    
**classified correctly (e.g. (1,1), (0,0))** 

**those are wrongly classified (e.g. (1,0),(0,1))**

In [30]:
Y_pred = classifier.predict(X_test)
print(Y_pred)
print(list(zip(Y_test,Y_pred))) 

[1 1 1 ... 1 1 0]
[(1, 1), (1, 1), (1, 1), (1, 0), (1, 1), (1, 1), (0, 1), (0, 0), (1, 1), (1, 1), (1, 1), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (1, 0), (1, 1), (0, 1), (0, 0), (0, 0), (0, 0), (0, 0), (0, 1), (1, 0), (0, 0), (1, 1), (1, 1), (0, 1), (0, 0), (0, 0), (0, 0), (1, 1), (1, 0), (0, 0), (1, 1), (1, 0), (1, 1), (0, 0), (1, 1), (1, 0), (1, 0), (0, 0), (0, 0), (0, 0), (1, 1), (1, 0), (0, 0), (0, 1), (0, 0), (1, 0), (0, 1), (0, 0), (1, 1), (1, 1), (0, 0), (0, 1), (0, 1), (1, 0), (1, 1), (1, 0), (0, 0), (0, 0), (1, 1), (1, 1), (0, 1), (0, 0), (1, 1), (1, 0), (0, 0), (0, 0), (1, 1), (0, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 1), (1, 1), (0, 0), (0, 0), (1, 1), (1, 0), (0, 0), (0, 0), (0, 1), (1, 0), (0, 0), (0, 0), (1, 1), (0, 0), (0, 0), (1, 0), (1, 1), (1, 1), (1, 1), (1, 1), (0, 0), (0, 0), (0, 0), (0, 0), (1, 1), (1, 0), (0, 0), (1, 1), (0, 1), (1, 1), (0, 0), (0, 0), (0, 0), (1, 1), (0, 0), (1, 1), (0, 0), (0, 0), (1, 0), (0, 0), (1, 1), (0, 0), (0, 0), (0, 0), (0, 0

## <font color = green >9. Evaluating the data using Confusion matrix<font>


In the code block below we have imported confusion_matrix, accuracy_score, classification_report from sklearn
library & metrics sublibrary. 

**Important thing to remember is all these metrics are evaluated on 'Y_test & Y_pred'.**

In [31]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm = confusion_matrix(Y_test, Y_pred)
print(cfm)

print("Classification report: ")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("accuracy of the model:",acc)

[[3909  798]
 [1096 2422]]
Classification report: 
              precision    recall  f1-score   support

           0       0.78      0.83      0.80      4707
           1       0.75      0.69      0.72      3518

    accuracy                           0.77      8225
   macro avg       0.77      0.76      0.76      8225
weighted avg       0.77      0.77      0.77      8225

accuracy of the model: 0.769726443768997


So in the confusion matrix generated 3909 values (TN) and 2422 values (TP) are correctly classified, whereas 1096 (FN) and
798 (FP) are the ones wrongly classifed.

Accuracy of the model is 0.7697 or 76.97%.

Recall or so to say individual accuracy of class 0 is 0.83 (83 %) and that of class 1 is 0.69 (69%).

This indicates the performance of class 1 is still poor.

The reason behind this discrepancy can be attributed to the concept of "Class Imbalance".
Class Imbalance happens when the value count in each of the class differs making the Ml model favour one class than the
other.

let us check the value count of the dependent column y to make it more clear.

In [33]:
data_new['income'].value_counts()

income
0    15727
1    11687
Name: count, dtype: int64

So the above count clearly shows a Class imbalance, where we have got lesser count in Class 1 and hence a lower recall
value.

## <font color = green >10. What are Type 1 & Type 2 errors? <font>


Now that we have learned the algorithm well, the question arrises that whether there's an error which is fatal when it comes to classification. The simple answer is 'yes'.

Type 1 error is the error on whose occurence the purpose of the model is still fulfilled.

Type 2 error is the error on whose occurence the purpose of the model is NOT fulfilled.
Type 2 error is fatal when it comes to classification.

To give an example of Type 1 & Type 2 error consider Covid-19 scenario
Say in Covid-19, people not affected with Covid are shown positive, then such error doesn't affect 
the society as a whole.This is a Type 2 error.

However, if people who are actually affected with Covid-19 are shown negative, then such error is something that
once cannot neglect, this is a Type 2 error.