# **Logistic Regression Model Theory**


## Theory
Logistic regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables, where the dependent variable is categorical (binary or multi-class). It predicts the probability of an event occurring by applying a sigmoid function to a linear model. The model function for logistic regression is:

$$ f_{w,b}(x) = \sigma(w^T x + b) $$

Here:
- $f_{w,b}(x)$ is the predicted probability of the positive class.
- $w$ is the vector of coefficients (weights).
- $b$ is the bias.
- $x$ is the input feature vector.
- $\sigma(z)$ is the sigmoid function defined as:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Logistic regression uses a probabilistic approach to model the data and aims to maximize the likelihood of the observed data.

## Model Training

### Forward Pass

The forward pass in logistic regression computes the predicted probability of the positive class using the current parameters ($w, b$). The predicted probability is:

$$ f_{w,b}(x) = \sigma(w^T x + b) $$

### Cost Function

The cost function in logistic regression is based on cross-entropy loss, which measures the difference between the predicted probabilities and the actual class labels. The cost function is:

$$ J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(f_{w,b}(x^{(i)})) + (1 - y^{(i)}) \log(1 - f_{w,b}(x^{(i)})) \Big] $$

Where:
- $J(w,b)$ is the cost function.
- $m$ is the number of training examples.
- $x^{(i)}$ and $y^{(i)}$ are the input and actual output for the $i$-th example.
- $f_{w,b}(x^{(i)})$ is the predicted probability for the $i$-th example.

### Backward Pass (Gradient Computation)

The backward pass computes the gradients of the cost function with respect to the weights and bias. The gradient formulas are as follows:

$$ \frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=0}^{m-1} \Big(f_{w,b}(x^{(i)}) - y^{(i)}\Big) $$

$$ \frac{\partial J(w,b)}{\partial w_j} = \frac{1}{m} \sum_{i=0}^{m-1} \Big(f_{w,b}(x^{(i)}) - y^{(i)}\Big) x_j^{(i)} $$

Where:
- The first term is the gradient for the cost function with respect to the bias.
- The second term is the gradient for the cost function with respect to the weights.

## Training Process

The training process involves iteratively updating the weights and biases to minimize the cost function, which is cross-entropy loss. This is typically done through an optimization algorithm like gradient descent. The update equations for parameters are:

$$ w_j \leftarrow w_j - \alpha \frac{\partial J}{\partial w_j} $$

$$ b \leftarrow b - \alpha \frac{\partial J}{\partial b} $$

Here, $\alpha$ is the learning rate, which controls the step size during parameter updates.

By iteratively performing the forward pass, computing the cost, performing the backward pass, and updating the parameters, the model learns to make better predictions and fits the data optimally.






## **Model Evaluation**

### 1. Accuracy

**Formula:**
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$

**Description:**
- **Accuracy** measures the proportion of correctly classified instances out of the total instances.
- It is a straightforward and commonly used metric for classification models.

**Interpretation:**
- Higher accuracy indicates that the model is correctly predicting more samples.
- **Limitations:**
  - Accuracy is not reliable when dealing with imbalanced datasets, as it can be high even if the model predicts only the majority class.

---

### 2. Precision

**Formula:**
$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

**Description:**
- **Precision** focuses on the proportion of correctly predicted positive observations out of all observations predicted as positive.
- It answers the question: *Of all the predicted positive cases, how many are truly positive?*

**Interpretation:**
- High precision indicates fewer false positives.
- **Use Case:**
  - Precision is important in applications where false positives are costly (e.g., fraud detection).

---

### 3. Recall (Sensitivity/True Positive Rate)

**Formula:**
$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

**Description:**
- **Recall** measures the proportion of actual positives that are correctly identified.
- It answers the question: *Of all the true positive cases, how many did the model identify?*

**Interpretation:**
- High recall indicates fewer false negatives.
- **Use Case:**
  - Recall is crucial in applications where missing positive cases is costly (e.g., disease diagnosis).

---

### 4. F1-Score

**Formula:**
$$
\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

**Description:**
- **F1-Score** is the harmonic mean of precision and recall, balancing the two metrics.
- It is especially useful when the dataset is imbalanced.

**Interpretation:**
- An F1-Score closer to 1 indicates a good balance between precision and recall.
- **Limitations:**
  - It does not differentiate between the costs of false positives and false negatives.

---

### 5. Log Loss (Logarithmic Loss)

**Formula:**
$$
\text{Log Loss} = - \frac{1}{n} \sum_{i=1}^{n} \left[ y_i \cdot \log(p_i) + (1 - y_i) \cdot \log(1 - p_i) \right]
$$
where \(y_i\) is the actual label (0 or 1) and \(p_i\) is the predicted probability of the positive class.

**Description:**
- **Log Loss** quantifies the accuracy of predicted probabilities rather than class labels.
- It penalizes incorrect predictions with a heavier weight for confident but incorrect predictions.

**Interpretation:**
- Lower log loss indicates better model performance.
- **Use Case:**
  - It is often used in probabilistic classification problems and is sensitive to incorrect confidence levels.

---

### 6. Receiver Operating Characteristic (ROC) and Area Under Curve (AUC)

**ROC Curve:**
- A plot of the True Positive Rate (Recall) against the False Positive Rate at various threshold values.

**AUC:**
$$
\text{AUC} = \int_{0}^{1} \text{ROC Curve}
$$

**Description:**
- **AUC-ROC** represents the degree of separability between the positive and negative classes.
- AUC values range from 0 to 1, where 1 indicates a perfect model and 0.5 indicates random guessing.

**Interpretation:**
- A higher AUC indicates better performance in distinguishing between classes.
- **Limitations:**
  - It may not provide actionable insights in highly imbalanced datasets.

---

### 7. Confusion Matrix

**Structure:**
|                | Predicted Positive | Predicted Negative |
|----------------|--------------------|--------------------|
| Actual Positive| True Positive (TP) | False Negative (FN)|
| Actual Negative| False Positive (FP)| True Negative (TN) |

**Description:**
- A **confusion matrix** provides a summary of prediction results by showing counts of TP, TN, FP, and FN.

**Interpretation:**
- It helps in understanding the types of errors the model makes and aids in calculating other metrics like accuracy, precision, and recall.


## sklearn template [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

### class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='deprecated', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

| **Parameter**          | **Description**                                                                                                                                       | **Default**      |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| `penalty`              | Specify the norm of the penalty: ‘l1’, ‘l2’, ‘elasticnet’, None.                                                                                      | `l2`             |
| `dual`                 | Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver.                                                | `False`          |
| `tol`                  | Tolerance for stopping criteria.                                                                                                                      | `0.0001`         |
| `C`                    | Inverse of regularization strength; must be positive. Smaller values specify stronger regularization.                                                 | `1.0`            |
| `fit_intercept`        | Whether to include a constant (intercept) in the decision function.                                                                                   | `True`           |
| `intercept_scaling`    | Used when solver ‘liblinear’ is used and fit_intercept=True to adjust intercept scaling.                                                             | `1`              |
| `class_weight`         | Weights associated with classes, can be a dictionary or ‘balanced’.                                                                                  | `None`           |
| `random_state`         | Used for shuffling when solver='sag', 'saga', or 'liblinear'.                                                                                         | `None`           |
| `solver`               | Algorithm to use in the optimization problem.                                                                                                         | `lbfgs`          |
| `max_iter`             | Maximum number of iterations taken for solvers to converge.                                                                                          | `100`            |
| `multi_class`          | Strategy for multiclass classification: ‘ovr’, ‘multinomial’.                                                                                         | `auto`           |
| `verbose`              | Verbosity level for the liblinear and lbfgs solvers.                                                                                                 | `0`              |
| `warm_start`           | Whether to reuse the solution of the previous call to fit as initialization.                                                                          | `False`          |
| `n_jobs`               | Number of CPU cores to use when parallelizing over classes in multi_class='ovr'.                                                                     | `None`           |
| `l1_ratio`             | Elastic-Net mixing parameter, used only if penalty='elasticnet'.                                                                                     | `None`           |

-

| **Attribute**          | **Description**                                                                                                                                       |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `classes_`             | A list of class labels known to the classifier.                                                                                                       |
| `coef_`                | Coefficient of the features in the decision function.                                                                                                 |
| `intercept_`           | Intercept added to the decision function.                                                                                                             |
| `n_features_in_`       | The number of features seen during fit.                                                                                                               |
| `feature_names_in_`    | Names of features seen during fit, defined only when X has feature names.                                                                            |
| `n_iter_`              | Actual number of iterations for all classes.                                                                                                         |

-

| **Method**             | **Description**                                                                                                                                       |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `fit(X, y)`            | Fits the logistic regression model to the data X (input) and y (output).                                                                             |
| `predict(X)`           | Predicts the output using input data X based on the fitted model.                                                                                    |
| `predict_proba(X)`     | Returns the probability estimates for the classes.                                                                                                   |
| `predict_log_proba(X)` | Returns the log-probability estimates for the classes.                                                                                              |
| `score(X, y)`          | Returns the mean accuracy of the model.                                                                                                               |
| `get_params()`         | Gets the parameters of the logistic regression model.                                                                                               |
| `set_params(**params)` | Sets the parameters of the logistic regression model.                                                                                               |
| `decision_function(X)` | Returns confidence scores for each sample.                                                                                                           |
| `densify()`            | Converts the coefficient matrix to dense array format.                                                                                              |


# LogisticRegression - Example

## Data loading

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

%matplotlib inline

data = '/home/petar-ubuntu/Learning/ML_Theory/ML_Models/Supervised Learning/Classification models/LogisticRegresson/data/weatherAUS.csv'

df = pd.read_csv(data)

df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


##  Data processing / EDA

### Explore problems within categorical variables

In [4]:
categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :', categorical)

df[categorical].head()

There are 7 categorical variables

The categorical variables are : ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']


Unnamed: 0,Date,Location,WindGustDir,WindDir9am,WindDir3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,W,W,WNW,No,No
1,2008-12-02,Albury,WNW,NNW,WSW,No,No
2,2008-12-03,Albury,WSW,W,WSW,No,No
3,2008-12-04,Albury,NE,SE,E,No,No
4,2008-12-05,Albury,W,ENE,NW,No,No


In [6]:
# check missing values in categorical variables

df[categorical].isnull().sum()

Date                0
Location            0
WindGustDir     10326
WindDir9am      10566
WindDir3pm       4228
RainToday        3261
RainTomorrow     3267
dtype: int64

In [7]:
# print categorical variables containing missing values

cat1 = [var for var in categorical if df[var].isnull().sum()!=0]

print(df[cat1].isnull().sum())


WindGustDir     10326
WindDir9am      10566
WindDir3pm       4228
RainToday        3261
RainTomorrow     3267
dtype: int64


In [8]:
# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

Date  contains  3436  labels
Location  contains  49  labels
WindGustDir  contains  17  labels
WindDir9am  contains  17  labels
WindDir3pm  contains  17  labels
RainToday  contains  3  labels
RainTomorrow  contains  3  labels


In [9]:
# parse the dates, currently coded as strings, into datetime format

df['Date'] = pd.to_datetime(df['Date'])

# extract year / month / day from date

df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop the original Date variable

df.drop('Date', axis=1, inplace = True)

# preview the dataset again

df.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,...,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,...,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,...,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,...,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,...,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5


In [10]:
# find categorical variables

categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :', categorical)

There are 6 categorical variables

The categorical variables are : ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']


In [None]:
# let's do One Hot Encoding of Location variable
# get k-1 dummy variables after One Hot Encoding 
# preview the dataset with head() method

pd.get_dummies(df.Location, drop_first=True).head()

In [None]:
# sum the number of 1s per boolean variable over the rows of the dataset
# it will tell us how many observations we have for each category

pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True).sum(axis=0)

In [None]:
# sum the number of 1s per boolean variable over the rows of the dataset
# it will tell us how many observations we have for each category

pd.get_dummies(df.WindDir9am, drop_first=True, dummy_na=True).sum(axis=0)

In [None]:
# sum the number of 1s per boolean variable over the rows of the dataset
# it will tell us how many observations we have for each category

pd.get_dummies(df.WindDir3pm, drop_first=True, dummy_na=True).sum(axis=0)

In [None]:
# sum the number of 1s per boolean variable over the rows of the dataset
# it will tell us how many observations we have for each category

pd.get_dummies(df.RainToday, drop_first=True, dummy_na=True).sum(axis=0)

### Explore problems within numerical variables

In [14]:
# find numerical variables

numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)

# view the numerical variables

df[numerical].head()

There are 19 numerical variables

The numerical variables are : ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day']


Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,Year,Month,Day
0,13.4,22.9,0.6,,,44.0,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,2008,12,1
1,7.4,25.1,0.0,,,44.0,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,2008,12,2
2,12.9,25.7,0.0,,,46.0,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,2008,12,3
3,9.2,28.0,0.0,,,24.0,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,2008,12,4
4,17.5,32.3,1.0,,,41.0,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,2008,12,5


In [15]:
# check missing values in numerical variables

df[numerical].isnull().sum()

MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustSpeed    10263
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
Year                 0
Month                0
Day                  0
dtype: int64

In [16]:
# view summary statistics in numerical variables

print(round(df[numerical].describe()),2)

        MinTemp   MaxTemp  Rainfall  Evaporation  Sunshine  WindGustSpeed  \
count  143975.0  144199.0  142199.0      82670.0   75625.0       135197.0   
mean       12.0      23.0       2.0          5.0       8.0           40.0   
std         6.0       7.0       8.0          4.0       4.0           14.0   
min        -8.0      -5.0       0.0          0.0       0.0            6.0   
25%         8.0      18.0       0.0          3.0       5.0           31.0   
50%        12.0      23.0       0.0          5.0       8.0           39.0   
75%        17.0      28.0       1.0          7.0      11.0           48.0   
max        34.0      48.0     371.0        145.0      14.0          135.0   

       WindSpeed9am  WindSpeed3pm  Humidity9am  Humidity3pm  Pressure9am  \
count      143693.0      142398.0     142806.0     140953.0     130395.0   
mean           14.0          19.0         69.0         52.0       1018.0   
std             9.0           9.0         19.0         21.0          7.0   
mi

In [18]:
X = df.drop(['RainTomorrow'], axis=1)

y = df['RainTomorrow']

# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [19]:
# display categorical variables

categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

In [20]:
# display numerical variables

numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm',
 'Year',
 'Month',
 'Day']

In [22]:
# impute missing values in X_train and X_test with respective column median in X_train

for df1 in [X_train, X_test]:
    for col in numerical:
        col_median=X_train[col].median()
        df1[col].fillna(col_median, inplace=True)   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df1[col].fillna(col_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df1[col].fillna(col_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always be

In [23]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

WindGustDir 0.07106764746322013
WindDir9am 0.07259727760208992
WindDir3pm 0.028951258077822083
RainToday 0.02248900041248453


In [24]:
# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)
    df2['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)
    df2['WindDir3pm'].fillna(X_train['WindDir3pm'].mode()[0], inplace=True)
    df2['RainToday'].fillna(X_train['RainToday'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because th

In [25]:
def max_value(df3, variable, top):
    return np.where(df3[variable]>top, top, df3[variable])

for df3 in [X_train, X_test]:
    df3['Rainfall'] = max_value(df3, 'Rainfall', 3.2)
    df3['Evaporation'] = max_value(df3, 'Evaporation', 21.8)
    df3['WindSpeed9am'] = max_value(df3, 'WindSpeed9am', 55)
    df3['WindSpeed3pm'] = max_value(df3, 'WindSpeed3pm', 57)

In [None]:
# encode RainToday variable

import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['RainToday'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train = pd.concat([X_train[numerical], X_train[['RainToday_0', 'RainToday_1']],
                     pd.get_dummies(X_train.Location), 
                     pd.get_dummies(X_train.WindGustDir),
                     pd.get_dummies(X_train.WindDir9am),
                     pd.get_dummies(X_train.WindDir3pm)], axis=1)

X_test = pd.concat([X_test[numerical], X_test[['RainToday_0', 'RainToday_1']],
                     pd.get_dummies(X_test.Location), 
                     pd.get_dummies(X_test.WindGustDir),
                     pd.get_dummies(X_test.WindDir9am),
                     pd.get_dummies(X_test.WindDir3pm)], axis=1)



In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

## Plotting data

In [None]:
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization

# draw boxplots to visualize outliers

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.boxplot(column='Rainfall')
fig.set_title('')
fig.set_ylabel('Rainfall')


plt.subplot(2, 2, 2)
fig = df.boxplot(column='Evaporation')
fig.set_title('')
fig.set_ylabel('Evaporation')


plt.subplot(2, 2, 3)
fig = df.boxplot(column='WindSpeed9am')
fig.set_title('')
fig.set_ylabel('WindSpeed9am')


plt.subplot(2, 2, 4)
fig = df.boxplot(column='WindSpeed3pm')
fig.set_title('')
fig.set_ylabel('WindSpeed3pm')

In [None]:
# plot histogram to check distribution

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.Rainfall.hist(bins=10)
fig.set_xlabel('Rainfall')
fig.set_ylabel('RainTomorrow')


plt.subplot(2, 2, 2)
fig = df.Evaporation.hist(bins=10)
fig.set_xlabel('Evaporation')
fig.set_ylabel('RainTomorrow')


plt.subplot(2, 2, 3)
fig = df.WindSpeed9am.hist(bins=10)
fig.set_xlabel('WindSpeed9am')
fig.set_ylabel('RainTomorrow')


plt.subplot(2, 2, 4)
fig = df.WindSpeed3pm.hist(bins=10)
fig.set_xlabel('WindSpeed3pm')
fig.set_ylabel('RainTomorrow')

## Model definition

In [None]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression


# instantiate the model
logreg = LogisticRegression(solver='liblinear', random_state=0)


# fit the model
logreg.fit(X_train, y_train)


y_pred_test = logreg.predict(X_test)

## Model evaulation

In [None]:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

y_pred_train = logreg.predict(X_train)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))


In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_test)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_test))

TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]

# print classification accuracy

classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)

print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))

# print classification error

classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))

# print precision score

precision = TP / float(TP + FP)


print('Precision : {0:0.4f}'.format(precision))

#Recall

recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))

#True positive rate

true_positive_rate = TP / float(TP + FN)


print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))

# False Positive rate

false_positive_rate = FP / float(FP + TN)


print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))

#Specificity

specificity = TN / (TN + FP)

print('Specificity : {0:0.4f}'.format(specificity))

In [None]:
# Adjusting the threshold level

# print the first 10 predicted probabilities of two classes- 0 and 1

y_pred_prob = logreg.predict_proba(X_test)[0:10]

y_pred_prob

In [None]:
# store the probabilities in dataframe

y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - No rain tomorrow (0)', 'Prob of - Rain tomorrow (1)'])

y_pred_prob_df

In [31]:
# print the first 10 predicted probabilities for class 1 -- Probability of rain

logreg.predict_proba(X_test)[0:10, 1]

NameError: name 'logreg' is not defined

In [None]:
# store the predicted probabilities for class 1 - Probability of rain

y_pred1 = logreg.predict_proba(X_test)[:, 1]

In [None]:
# plot histogram of predicted probabilities


# adjust the font size 
plt.rcParams['font.size'] = 12


# plot histogram with 10 bins
plt.hist(y_pred1, bins = 10)


# set the title of predicted probabilities
plt.title('Histogram of predicted probabilities of rain')


# set the x-axis limit
plt.xlim(0,1)


# set the title
plt.xlabel('Predicted probabilities of rain')
plt.ylabel('Frequency')

In [None]:
from sklearn.preprocessing import binarize

for i in range(1,5):
    
    cm1=0
    
    y_pred1 = logreg.predict_proba(X_test)[:,1]
    
    y_pred1 = y_pred1.reshape(-1,1)
    
    y_pred2 = binarize(y_pred1, i/10)
    
    y_pred2 = np.where(y_pred2 == 1, 'Yes', 'No')
    
    cm1 = confusion_matrix(y_test, y_pred2)
        
    print ('With',i/10,'threshold the Confusion Matrix is ','\n\n',cm1,'\n\n',
           
            'with',cm1[0,0]+cm1[1,1],'correct predictions, ', '\n\n', 
           
            cm1[0,1],'Type I errors( False Positives), ','\n\n',
           
            cm1[1,0],'Type II errors( False Negatives), ','\n\n',
           
           'Accuracy score: ', (accuracy_score(y_test, y_pred2)), '\n\n',
           
           'Sensitivity: ',cm1[1,1]/(float(cm1[1,1]+cm1[1,0])), '\n\n',
           
           'Specificity: ',cm1[0,0]/(float(cm1[0,0]+cm1[0,1])),'\n\n',
          
            '====================================================', '\n\n')

In [None]:
# ROC - AUC

# plot ROC Curve

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred1, pos_label = 'Yes')

plt.figure(figsize=(6,4))

plt.plot(fpr, tpr, linewidth=2)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12

plt.title('ROC curve for RainTomorrow classifier')

plt.xlabel('False Positive Rate (1 - Specificity)')

plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# compute ROC AUC

from sklearn.metrics import roc_auc_score

ROC_AUC = roc_auc_score(y_test, y_pred1)

print('ROC AUC : {:.4f}'.format(ROC_AUC))

In [None]:
# calculate cross-validated ROC AUC 

from sklearn.model_selection import cross_val_score

Cross_validated_ROC_AUC = cross_val_score(logreg, X_train, y_train, cv=5, scoring='roc_auc').mean()

print('Cross validated ROC AUC : {:.4f}'.format(Cross_validated_ROC_AUC))

# k-Fold Cross Validation 

In [None]:
# Applying 5-Fold Cross Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(logreg, X_train, y_train, cv = 5, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

In [30]:
#