# **Scikit- Learn : Logistic Regression** 

---



## An overview of Logistic Regression



Whereas linear regression deals with continous dependant variables such as House Prices, logistic regression deals with calculating the probability of an  outcome that is categorical by nature, e.g. whether someone has diabetes or not . This makes it a method of solving classification problems. 

A logistic regression function *P(x)*  is found, such that the predicted outcomes *p(Xi)*, is as close as possible to the actual outcomes *Yi* for each observation. 

First, the dependant variable is expressed as a  linear function  or *logit*: 


$ y = β0 + β1X1 + β2X2 +... + βnXn $


where *Xn* are the dependant variables and *βn* are the coefficients or *estimators* that will be calculated. 

The linear function is then applied to the general sigmoid equation :


$ p = 1 / (1 + e$^-y$) $


To form $  p = 1 / (1 + e$^- β0 + β1X1 + β2X2 +... + βnXn$) $

Whereas linear regression is solved using Ordinary Least Squares, a logistic regression is solved using the Maximum Likelihood Estimation Approach 

The sigmoid function can be used to plot a sigmoid curve like so :

![Sigmoid Curve](https://miro.medium.com/max/1280/1*OUOB_YF41M-O4GgZH_F2rw.png)


where all outputs lie in between 1 meaning "Yes" or 0 meaning "No", with 0.5 being the default threshold value. The reason for choosing a sigmoid function being that most values would be close to 1 or 0 



##An example of a single variate binary classification problem ##

 1. Import all necessary packages



```
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score 
from sklearn.model_selection import train_test_split
import pandas as pd 
import seaborn as sns
import pickle
```



2. Generate data 



```
x = np.arange(10).reshape(-1, 1)
y = np.array([0,0,0,0,1,1,1,1,1,1])
```



The .reshape() function is used as the model requires columns of data. -1 is used to get the needed number of rows and 1 is used to get 1 column of data. 

3. Create and train the model 



```
model = LogisticRegression(solver = "liblinear",random_state = 0 )
model.fit(x,y)
```



Note we use the default solver and a random state of 0 to ensure that the same sequence of random numbers is generated during the training process. .fit() used to calculate the coefficients ,*βn*, in the linear function. 
The model instance is returned 



```
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
```



The intercept and coefficient of the linear function can also be found 



```
intercept = model.intercept_ 
coefficient = model.coef_
```





```
array([-1.04608067])
array([[0.51491375]])
```



4. Evaluate the model

We can check the matrix of probabilities using 

```
model.predict.proba(x)
```
which gives the output:


```
array([[0.74002157, 0.25997843],
       [0.62975524, 0.37024476],
       [0.5040632 , 0.4959368 ],
       [0.37785549, 0.62214451],
       [0.26628093, 0.73371907],
       [0.17821501, 0.82178499],
       [0.11472079, 0.88527921],
       [0.07186982, 0.92813018],
       [0.04422513, 0.95577487],
       [0.02690569, 0.97309431]])
```
Each row is an observation. The first column is the probability of the output being 0 and the second column is the probability that the output is 1. 



Using 
```
model.predict(x) 
```
gives the actual prediction results 


```
[0 0 0 1 1 1 1 1 1 1]
```




```
prediction_matrix = model.predict_proba(x)
p_x = prediction_matrix[:,1]
plt.figure(figsize=(8,6))
plt.grid(True)
x_val = range(10)
y = [ coefficient * x  +  intercept for x in x_val  ] 
plt.title("Visual illustration of the results ")
plt.plot(range(10),p_x,label="Sigmoid Function")
plt.plot(range(10),results,"g.",markersize=20, label="Predictions")
plt.plot(3,1,"rx",markersize=20,label="Incorrect Prediction")
plt.plot(3,0,"b.",markersize=20, label = "Correct Prediction")
plt.plot(x_val, y, label="Linear Function")
plt.axhline(y=0.5, color='y', linestyle='--')
plt.ylim(-0.1, 1.1)
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()
```

![Example Illustration](https://raw.githubusercontent.com/KaiSun19/LogisticRegression/figures/example_illustration.png)

The above figure shows the output of the results of the binary classification problem. As you can see there was one false positive result shown by the red cross, which should have been a negative results shown by the blue dot. As well, slightly above the x value of 2 is where the linear function gives an output of 0, and thus where the sigmoid function crosses above the 0.5 threshold as shown by the yellow line, separating the outcomes as 0s or 1s. 

##Real life use of binary classification

**Problem**: 

The dataset given contains medical information e.g. number of previous pregnancies, insulin level, blood glucose level, and BMI. As well, each observation has the dependant variable "Outcome" that is either "1" meaning they have diabetes and "0" meaning they do not. 

The task is to use logistic regression to accurately predict if a patient has diabetes given their medical information

**Step 1 :** Load the dataset 



```
url = "https://raw.githubusercontent.com/KaiSun19/LogisticRegression/data/diabetes.csv"
dataset = pd.read_csv(url)
```



**Step 2**: Analyze the dataset to find features and target variables 






```
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(dataset.head(10))
```





```
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1
5            5      116             74              0        0  25.6                     0.201   30        0
6            3       78             50             32       88  31.0                     0.248   26        1
7           10      115              0              0        0  35.3                     0.134   29        0
8            2      197             70             45      543  30.5                     0.158   53        1
9            8      125             96              0        0   0.0                     0.232   54        1
```



As we can see, the feature variables are quantative and the "Outcome" column is a target variable of binary nature .  



```
features = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
target = "Outcome"
```



We then allocate the variable "X" to consist of our features and "y" to consist of the corresponding target 



```
X = dataset[features]
y = dataset[target]
```



Step 3: Training the Logistic Regression Model  

The dataset is split into a training set to fit the model, and a test set to see how well the model works on unseen data. The test_size parameter is set to 0.25 meaning 0.25 of the dataset will be the test set, and the random_state is set to 0 to select observations randomly 


```
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
```


The model variable is used to create an instance of the Logistic Regression classifier. 

We use "liblinear" which is the algorithm used, C = 0.05  , the regularization strength set as 0.05 to prevent overfitting, and a random_state of 0 to generate random slices of data during the training process 

We then fit the model to the training data using .fit() function and predict outputs given features in the test set.  



```
model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
                           random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
```



Step 4: Scoring the data 

A confusion matrix shows predictions as true negatives ( correct 0s ), false positives ( incorrect 1s) , true positives ( correct 1s ), and false negatives ( incorrect 0s ). As the problem involves binary classification, the matrix would be a  2 *2 matrix



```
cm = confusion_matrix(y_test,y_pred)

```


```
[[115  15]
 [ 32  30]]
```




A visualisation of the confusion matrix can be shown to give clarification 



```
class_names=[0,1] 
plt.xticks(class_names)
plt.yticks(class_names)
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
```



![Confusion Matrix](https://raw.githubusercontent.com/KaiSun19/LogisticRegression/figures/workshop_cm.png)



*   The .accuracy_score() function shows how many correct predictions were made out of total predictions. 
*   The .precision_score() function shows how many times a model will make a correct prediction
* The .recall_score() function shows the probability the model can identify a correct positive outcome 





```
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
```




```
Accuracy: 0.7552083333333334
Precision: 0.6666666666666666
Recall: 0.4838709677419355
```



**Step 5: Storing the classifier**

It is often useful to store the model into a file via pickling so that is can be used for later purposes without having to train the model again to save time. First, a new file is created via the open("filename", "wb") function and the model is dumped into the new file. This is saved as the variable "f". 



```
f = open("logisticregression.pickle", "wb").pickle.dump(model,f)
f.close()
```



Next,  we use open("filename", "rb") to read the file then save the read file as a variable, "pickle_in". We can then load in the classifier, and use it for predictions again as if it was trained 


```
f = open("logisticregression.pickle", "wb")
pickle.dump(model,f)
f.close()
pickle_in = open("logisticregression.pickle", "rb")
classifier = pickle.load(pickle_in)
pickle_in.close()
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:" + " " + str(accuracy))
```





```
Accuracy: 0.7552083333333334

```


