# Primera regresion logistica

# Accessing the drive

In [1]:
from google.colab import drive
drive.mount('/content/Drive')

Mounted at /content/Drive


In [2]:
%cd '/content/Drive/MyDrive/Colab Notebooks/MachineLearning/Models/LogisticRegression/LogisticExampleAdvanced'
!ls

/content/Drive/MyDrive/Colab Notebooks/MachineLearning/Models/LogisticRegression/LogisticExampleAdvanced
 Admitance.csv			     'LogisticExampleAdvanced - 3.ipynb'
'LogisticExampleAdvanced - 1.ipynb'   LogisticExample.ipynb
'LogisticExampleAdvanced - 2.ipynb'   Test.csv


# importing packages

In [3]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# Creating DataFrame

In [4]:
data = pd.read_csv('Admitance.csv')
data.head()

Unnamed: 0,SAT,Admitted
0,1363,No
1,1792,Yes
2,1954,Yes
3,1653,No
4,1593,No


In [5]:
data['Admitted'] = data['Admitted'].map({'Yes': 1, 'No': 0})
data.head()

Unnamed: 0,SAT,Admitted
0,1363,0
1,1792,1
2,1954,1
3,1653,0
4,1593,0


# regression

## Declaring variables and starting regression

In [6]:
# The result means that statsmodels was able to find that a relationship exists.
# The maximum number of iterations is 35, if at this point no relationship was found, the system stops.

y = data['Admitted']
x = data['SAT']

X = sm.add_constant(x)
logistic = sm.Logit(y, X)
reg = logistic.fit()
reg

Optimization terminated successfully.
         Current function value: 0.137766
         Iterations 10


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x7f6116c53850>

## Viewing the regression

In [7]:
print(reg.summary())

                           Logit Regression Results                           
Dep. Variable:               Admitted   No. Observations:                  168
Model:                          Logit   Df Residuals:                      166
Method:                           MLE   Df Model:                            1
Date:                Sun, 05 Mar 2023   Pseudo R-squ.:                  0.7992
Time:                        05:26:31   Log-Likelihood:                -23.145
converged:                       True   LL-Null:                       -115.26
Covariance Type:            nonrobust   LLR p-value:                 5.805e-42
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -69.9128     15.737     -4.443      0.000    -100.756     -39.070
SAT            0.0420      0.009      4.454      0.000       0.024       0.060

Possibly complete quasi-separation: A fraction 0.27

## Interpretation of statistics

### MLE(Maximum likelihood)
This is a model optimization algorithm.

```
Method: MLE
```

### Log likelihood
This is a statistic that indicates the level of optimization of the model, this is usually negative and the higher the better, such as -3 is higher than -6.
```
Log Likelihood: -23.145
```

### LL-Null(Log likelihood null)
This statistic represents the **Log-likelihood** without independent variables.

```
LL-Null: -115.26
```

### LLR(Log likelihood - ratio) p-value

This is an equivalent to the **F-score**, which indicates the significance of the model, so it is a hypothesis test:

- Null hypothesis.- The model does not have any significance.

- Alternative hypothesis.- The model is significant.

In this case, being much less than 0.05, there is statistically relevant evidence to understand that the model is significant.

```
LLR p-value: 5.805e-42
format(5.805e-42, '.45f')
0.00000000000000000000000000000000000000005805
```

### Pseudo R-square(McFaden's R-square)
This statistic **has a function similar to but never the same as linear regression**. within logistic regression, there are several ways to obtain an **r-square**, there are **AIC**, **BIC** and **McFaden's R-square**, the latter being the one used by **statmodels **.

Within **McFaden's R-square** it is recommended to have a score between 0.2 and 0.4 but more than anything it represents a referential value to compare with other models.

```
Pseudo R-square: 0.7992
```

## Interpretation of coefficients

```
==================================================== ============================
                  coef
-------------------------------------------------- ----------------------------
const -69.9128
SAT 0.0420
==================================================== ============================
```

Coefficients within logistic regression work differently than coefficients within linear regression. It is easier to study these using the **sklearn** package.

## Stats remaining

The standard error is calculated within the regression line and its purpose is to calculate the **z** value, this **z** value is the **intercept divided by the standard error**. So the **p** value is calculated using a normal distribution.

```
==================================================== ============================
           std errz P>|z| [0.025 0.975]
-------------------------------------------------- ----------------------------
const 15.737 -4.443 0.000 -100.756 -39.070
SAT 0.009 4.454 0.000 0.024 0.060
==================================================== ============================
```

In [8]:
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()
logistic.fit(data[['SAT']], data['Admitted'])

print('Intercepto:', logistic.intercept_)
print('Coeficientes:', logistic.coef_)
print('R-cuadrado:', logistic.score(data[['SAT']], data['Admitted']))

Intercepto: [-69.90656965]
Coeficientes: [[0.04200113]]
R-cuadrado: 0.9166666666666666


In [9]:
# The logistic regression works based on percentages, having an asymptotic line up to 1.
# The intercept and coefficients form a linear regression.

linear = logistic.intercept_ + (logistic.coef_[0]*1389)
linear

array([-11.56699954])

In [10]:
# When using logistic regression based on the exponential, the opposite is the natural logarithm.
# The method used is the logit method so that by adding the results of log_proba we have
# the result inside the linear regression.

test = {'SAT': [1389], 'Admitted': [0]}

testdf = pd.DataFrame(test)

logProba = logistic.predict_log_proba(testdf[['SAT']])

log = logProba[0][0] + logProba[0][1]
log

-11.567018484317622

In [11]:
# The predict_proba method returns the probabilities of obtaining the result.
# The second data returned is the probability and the first data is its complement.

logistic.predict_proba(testdf[['SAT']])

array([[9.99990526e-01, 9.47352956e-06]])

In [12]:
# Note that the same result is obtained by taking the exponential of the logarithm.

np.exp(log)

9.473439810843002e-06

In [13]:
# The number is quite low so that surely the rank of the data is equal to 0.

format(np.exp(log), '.10f')

'0.0000094734'

In [14]:
logistic.predict(testdf[['SAT']])

array([0])

## Confusion Matrix
In order to know the reliability of the logistic regression model, it is possible to use the so-called **confusion matrix**.

In [15]:
# Note that the model was able to predict 154 correctly and failed 14 times. The total is 168.

from tabulate import tabulate

conf = reg.pred_table()

matrix = {'Prediction 0': conf[0], 'Prediction 1': conf[1]}

print(tabulate(matrix, tablefmt='fancy_grid', headers=matrix.keys(), showindex=matrix.keys()))

╒══════════════╤════════════════╤════════════════╕
│              │   Prediction 0 │   Prediction 1 │
╞══════════════╪════════════════╪════════════════╡
│ Prediction 0 │             67 │              7 │
├──────────────┼────────────────┼────────────────┤
│ Prediction 1 │              7 │             87 │
╘══════════════╧════════════════╧════════════════╛


### Statistics

#### Accuracy
This is one of the main metrics to find the percentage of successes of the model. This is found as follows:

$$Accuracy = \frac{\text{Correct Predictions}}{Total}$$

In the case of the example it is equal to:

$$\frac{154}{168} = 0.9167$$

#### Missclassification
This is another of the main metrics to find the percentage of failures of the model, it follows the following formula:

$$Missclassification = \frac{\text{Wrong Predictions}}{Total}$$

In this case the example would be equal to:

$$\frac{14}{168} = 0.083333333$$

#### Other statistics

In [16]:
# It is possible to obtain the most common statistics using sklearn's classification_report.

from sklearn.metrics import classification_report

## Threshold

There is an important detail when classifying the data, in general the model uses a **threshold(threshold)**, this is usually set at 50% but it is possible to use other thresholds, for which it is advisable to use the following text https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/.