# Identifying disease risk
predicting the presence of a health condition based on patient data

## Scikit Learn (https://scikit-learn.org/dev/index.html)
Open-sourced Library for Machine Learning
- Pro-processing
- Machine Learning Models
- Evaluation Metrics


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## Mock disease dataset

a mock dataset that contains patient data, such as age, bmi, blood pressure, etc.

The data is available at: https://github.com/MRCIEU/python_and_health_ds_training/tree/main/day-3/data

Save data to the following path:

`./data/mock_disease_data.csv`

In [None]:
# Load the dataset
df = pd.read_csv("./data/mock_disease_data.csv")
df.info()

In [None]:
# Display the first few lines of the DataFrame
df.head()

## Exploratory Data Analysis (EDA)
EDA is a process used in data science to analyse datasets and summarize their main characteristics. 
- understand data patterns
- spot outliner/error
- identify relationships between variables
- prepare data for further analysis or modeling.

## Common Approach for EDA
- Descriptive Statistics
- Handling Missing Values
- Correlation Analysis: Measures how strongly two variables are related, often using a correlation matrix. This helps identify features with high correlation to the target variable or high correlation with each other.
- Outlier Detection
- Feature Engineering: Selecting features, scaling, etc.

In [None]:
# Display basic statistics
eda_summary = df.describe()
eda_summary

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values

## Missing Value
- Remove
- Impute missing values

In [None]:
# Drop rows with any missing values
df = df.dropna()
df.info()

In [None]:
# Display the correlations to understand relationships
correlation_matrix = df.corr()
correlation_matrix

In [None]:
# Plotting Correlation Matrix using matplotlib
plt.figure(figsize=(8, 6))
plt.matshow(correlation_matrix, fignum=1, cmap="coolwarm")
plt.colorbar()
plt.title("Correlation Matrix", pad=15)
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=45)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.show()

## Selecting Features
Selecting Features Based on Correlation with 'target'

Keep the features that have a correlation larger than 0.2

In [None]:
selected_features = correlation_matrix['target'].drop('target').abs().sort_values(ascending=False)
selected_features = selected_features[selected_features > 0.2]  # Threshold to select meaningful correlations
selected_feature_names = selected_features.index.tolist()
selected_feature_names

## Split Training and test set
Splitting data into a training set and test set help build and evaluate a machine learning model.

- Evaluate Model Performance: The test set acts as new data, giving a realistic view of how the model might perform in real-world situations.
- Compare Different Models
- Prevent Overfitting

Scikit-Learn: 

```train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)```

https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html


In [None]:
# Splitting the data into training and test sets
X = df[selected_feature_names]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Training Set Size:', len(X_train))
print('Test Set Size:', len(X_test))

## Data Scaling
Transforming data to a standard range or distribution

Common Types of Scaling:
- Standarlisation: https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.StandardScaler.html#standardscaler
- Normalisation: https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.normalize.html

Many algorithms are sensitive to the scale of the input features, such as
- Linear regression and logistic regression
- Support Vector Machines (SVMs)
- k-Nearest Neighbors (KNN)

In [None]:
# Standardising the features (important for linear models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Training a model
- Select a machine learning alorithms (e.g. Linear Classification)
- Fit the data

Scikit-Learn:

```LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='deprecated', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)```

https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

Despite its name, it is implemented as a linear model for classification rather than regression in scikit-learn.

In [None]:
# Applying Logistic Regression for Linear Classification
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

In [None]:
# Making predictions on the test set
y_pred = model.predict(X_test)
y_pred

## Model Evaluation
Assessing the performance of a machine learning model to ensure that it performs well not only on training data but also on unseen test data.

### Classification Metrics
  
**Confusion Matrix**
|                     | Predicted Positive | Predicted Negative |
|---------------------|--------------------|--------------------|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |

- **Accuracy**: The ratio of correctly predicted samples to the total samples.
  
  Accuracy = (True Positives + True Negatives) / Total Samples

- **Precision**: Measures the accuracy of positive predictions. Useful when false positives are costly. (e.g. Medicine Prescriptions)

  Precision = True Positives / (True Positives + False Positives)

- **Recall (Sensitivity)**: Measures the ability of the model to find all positive samples. Useful when false negatives are costly. (e.g. Detecting Early-Stage of a Disease)

  Recall = True Positives / (True Positives + False Negatives)

- **F1 Score**: The harmonic mean of precision and recall, balancing the two.

  F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

### Regression Metrics
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-square (R^2)

### Scikit-learn: 

```classification_report(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')```

https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.classification_report.html

In [None]:
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

In [None]:
# Displaying Evaluation Metrics
print("Selected Features for Classification:", selected_feature_names)
print("\nModel Accuracy: {:.3f}%".format(accuracy*100))
print("\nConfusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

# Plotting Confusion Matrix
plt.figure(figsize=(6, 5))
plt.matshow(conf_matrix, fignum=1, cmap="Blues")
plt.colorbar()
plt.title("Confusion Matrix", pad=15)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.xticks([0, 1], ["No Condition", "Condition"])
plt.yticks([0, 1], ["No Condition", "Condition"])
plt.show()