<a href="https://colab.research.google.com/github/MetaKate/CSEL-302/blob/main/2B_MACASAET_MIDTERM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Diabetes Dataset

df = pd.read_csv('diabetes.csv')
print("Dataset Information: ")
print(df.info())

# **2. Dataset Preparation**

The dataset has 768 entries with nine columns about diabetes. It includes details like Pregnancies, Glucose Levels, Blood Pressure, Skin Thickness, Insulin Levels, BMI, Diabetes Pedigree Function, Age and Outcome. The dependent variable, or the target variable, is "Outcome." It represents whether a patient has diabetes or not. It's a binary variable, with 1 indicating the presence of diabetes and 0 indicating the absence. While the independent variables are the remaining features. The aim is to predict if someone has diabetes using linear and logistic regression. There's no missing data, and we don't know if any cleaning or fixing was done. Steps include checking the data, making new features if needed, picking and training models, checking how good they are, understanding the results, and making sure everything's correct.

In [None]:
# 3. Exploratory Data Analysis (EDA)

# Summary statistics
summary_stats = df.describe()
print("Summary Statistics:")
print(summary_stats)

In [None]:
# Probability distributions
print("\nProbability Distributions:")
for column in df.columns:
    print(f"\n{column}:")
    print(df[column].value_counts(normalize=True))

In [None]:
# Visualization - Histograms
df.hist(figsize=(10, 8))
plt.tight_layout()
plt.show()

In [None]:
# Visualization - Scatter plots
pd.plotting.scatter_matrix(df, figsize=(12, 12))
plt.show()

In [None]:
#4. Linear Regression Model

# Select independent variables (features) and the target variable (Outcome)
X = df.drop(columns=['Outcome'])
y = df['Outcome']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_

print("Coefficients:", coefficients)
print("Intercept:", intercept)

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

In [None]:
# Evaluation metrics
r_squared = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("R-squared:", r_squared)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)

In [None]:
# 5. Logistic Regression Model

# Build the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Coefficients and intercept
coefficients = model.coef_[0]
intercept = model.intercept_[0]

print("Coefficients:", coefficients)
print("Intercept:", intercept)

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

In [None]:
# Model evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC:", roc_auc)

In [None]:
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, label='ROC Curve (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

In [None]:
# 6. Model Comparison and Selection

# Performance metrics for Linear Regression
linear_regression_metrics = {
    "R-squared": r_squared,
    "Mean Squared Error (MSE)": mse,
    "Root Mean Squared Error (RMSE)": rmse
}

# Performance metrics for Logistic Regression
logistic_regression_metrics = {
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1 Score": f1,
    "ROC AUC": roc_auc
}

In [None]:
# Print performance metrics for both models
print("Linear Regression Metrics:")
for metric, value in linear_regression_metrics.items():
    print(f"{metric}: {value}")

print("\nLogistic Regression Metrics:")
for metric, value in logistic_regression_metrics.items():
    print(f"{metric}: {value}")

# Decision-making process
print("\nDecision-making Process:")
print("Consider Linear Regression for predicting continuous outcomes, such as disease progression, where the relationship between independent and dependent variables is linear.")
print("Consider Logistic Regression for binary classification problems, like predicting disease presence or absence, where the target variable is categorical and the relationship between features and outcome is non-linear.")
print("Choose the model based on the problem at hand, considering factors like model accuracy, interpretability, and assumptions. Linear Regression is suitable for predicting continuous outcomes, while Logistic Regression is more appropriate for classification tasks with binary outcomes.")

Linear Regression Metrics:
R-squared: 0.25500281176741757
Mean Squared Error (MSE): 0.17104527280850104
Root Mean Squared Error (RMSE): 0.4135761995189049

Logistic Regression Metrics:
Accuracy: 0.7467532467532467
Precision: 0.6379310344827587
Recall: 0.6727272727272727
F1 Score: 0.6548672566371682
ROC AUC: 0.8150596877869605

Decision-making Process:
Consider Linear Regression for predicting continuous outcomes, such as disease progression, where the relationship between independent and dependent variables is linear.
Consider Logistic Regression for binary classification problems, like predicting disease presence or absence, where the target variable is categorical and the relationship between features and outcome is non-linear.
Choose the model based on the problem at hand, considering factors like model accuracy, interpretability, and assumptions. Linear Regression is suitable for predicting continuous outcomes, while Logistic Regression is more appropriate for classification tasks 

# **7. Conclusion and Insights**

In this case study, we utilized statistical and machine learning methods to analyze a dataset and make data-driven decisions. The Logistic Regression model demonstrated effectiveness in predicting the presence or absence of diabetes, showcasing reasonable accuracy, precision, and recall. This highlights the significance of employing appropriate modeling techniques for specific prediction tasks. The application of these models extends beyond healthcare to various fields such as finance, marketing, and social sciences. However, it's essential to understand the underlying assumptions and limitations of the models. For instance, Logistic Regression assumes a linear relationship between features and the log-odds of the outcome, while Linear Regression assumes a linear relationship between independent and dependent variables. Acknowledging these assumptions helps in interpreting model outputs accurately and avoiding potential biases in decision-making. Overall, this case study underscores the importance of leveraging statistical and machine learning tools to derive actionable insights from data while being mindful of model constraints and uncertainties.

# **8. References**

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix

from sklearn.metrics import mean_squared_error, r2_score