### Logistic Regression with Scikit-Learn

In this notebook, we will revisit **Logistic Regression**, this time leveraging the implementation provided by the **Scikit-Learn** library. 

**Scikit-Learn** is one of the most popular and widely-used Python libraries for machine learning. It provides a rich collection of tools for implementing a vast array of machine learning models and algorithms also offers extensive utilities for preprocessing data, model selection, evaluation metrics, hyperparameter tuning, and much more.

Using Scikit-Learn simplifies the process of applying machine learning techniques, as it provides well-tested, efficient, and easy-to-use implementations of many commonly used algorithms. If you’re using Scikit-Learn, it’s always a good idea to refer to the **official documentation** for guidance on how to properly utilize its features and customize them for your needs.

In addition to models, Scikit-Learn includes powerful modules like:
- **Preprocessing**: Tools to scale, normalize, and transform your data for better model performance.
- **Model Selection**: Methods for splitting datasets into training and testing sets, performing cross-validation, and tuning hyperparameters.
- **Metrics**: A variety of metrics to evaluate the performance of models.

By leveraging Scikit-Learn, we can quickly build, train, and evaluate machine learning models with minimal effort, allowing us to focus on problem-solving and gaining insights from our data.


In [None]:
# uncomment this just the first time to download the following library inside the ml_venv
"""
! pip install scikit-learn
"""

In [25]:
import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split # model selection module
from sklearn.preprocessing import StandardScaler, RobustScaler # preprocessing
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # evaluation metrics

In [None]:
# Load the diabetes dataset
diabetes = pd.read_csv('./datasets/diabetes.csv')

# Print dataset stats
print(diabetes.describe())
print("\n-----------------------------------------------\n")
print(diabetes.columns)

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

In [27]:
# Shuffling all samples to avoid group bias
diabetes = diabetes.sample(frac=1).reset_index(drop=True)

In [28]:
# Select features and target variable
selected_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                     'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diabetes[selected_features].values
y = diabetes['Outcome'].values

In [29]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
# More robust to the outliers
#scaler = RobustScaler()
# Standardize the features using StandardScaler (Z-Score Normalization)
scaler = StandardScaler()

X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

In [31]:
# Create a logistic regression model
logistic_model = LogisticRegression(random_state=42)

# Train the logistic regression model on the standardized training data
logistic_model.fit(X_train_std, y_train)

In [32]:
# Make predictions on the standardized test data
y_pred = logistic_model.predict(X_test_std)

To evaluate how well our Logistic Regression model performs, we need to use **classification metrics**. The most common metric is **Accuracy**, which measures the proportion of correctly predicted instances. However, depending on the specific problem and what you aim to measure, other metrics might be more appropriate.

For example:
- **Recall**: Useful when you want to focus on minimizing false negatives, such as in medical diagnoses where missing a positive case can have serious consequences.
- **Precision**: Important when false positives are more costly than false negatives, such as in email spam detection.
- **F1-Score**: A balance between Precision and Recall, particularly helpful when dealing with imbalanced datasets.

Additionally, Scikit-Learn provides tools like **classification_report**, which gives a comprehensive summary of key metrics such as precision, recall, F1-score, and support for each class. This report helps you better understand your model’s performance across all classes, enabling a more informed evaluation of its strengths and weaknesses.


In [33]:
# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:\n', classification_report_str)
print('Confusion Matrix:\n', conf_matrix)

Accuracy: 0.80
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.91      0.85        99
           1       0.79      0.60      0.68        55

    accuracy                           0.80       154
   macro avg       0.79      0.75      0.77       154
weighted avg       0.80      0.80      0.79       154

Confusion Matrix:
 [[90  9]
 [22 33]]
