# Logistic Regression Machine Learning Project

This project demonstrates the application of the Logistic Regression model on a real-world dataset. In this example, we use the Breast Cancer Wisconsin (Diagnostic) dataset to classify tumors as malignant or benign.

## Introduction

In this project, our objective is to apply Logistic Regression to predict whether a breast tumor is malignant or benign. Logistic Regression is a fundamental classification algorithm that models the probability of a binary outcome using the sigmoid (logistic) function. Its simplicity and interpretability make it a popular choice for medical diagnosis and many other applications.

## Dataset Description & Exploratory Data Analysis (EDA)

We use the Breast Cancer Wisconsin (Diagnostic) dataset available in scikit-learn. The dataset consists of 30 features computed from images of a fine needle aspirate (FNA) of a breast mass along with a target variable that indicates whether the tumor is malignant (1) or benign (0).

In the following section, we load the data, display summary statistics, plot key visualizations, and uncover potential patterns in the data.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Ignore warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import scikit-learn's breast cancer dataset
from sklearn.datasets import load_breast_cancer

# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

In [None]:
# Display the first few rows of the dataset
print("First five rows of the dataset:")
display(df.head())

# Summary statistics
print("\nSummary statistics:")
display(df.describe())

# Distribution of the target variable
print("\nTarget variable distribution:")
display(df['target'].value_counts())

In [None]:
# Correlation matrix heatmap
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), annot=False, cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()

In [None]:
# Histograms of a few selected features
features_to_plot = ['mean radius', 'mean texture', 'mean perimeter', 'mean area']
df[features_to_plot].hist(bins=20, figsize=(10,8))
plt.suptitle('Histograms of Selected Features')
plt.show()

## Data Preprocessing

In the preprocessing step, we first check for missing values and then separate our dataset into features and target. Although this dataset does not contain missing values, we proceed with feature scaling using StandardScaler. Finally, we split the data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Check for missing values
print("Missing values in the dataset:")
print(df.isnull().sum())

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
features = X.columns  # Store feature names for later reference

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## Mathematical Explanation

Logistic Regression estimates the probability that a given input x belongs to a particular class (e.g., malignant or benign) using the sigmoid function. The key equations are:

**Sigmoid function:**

$ \sigma(z) = \frac{1}{1 + e^{-z}} $

where $z = w^Tx + b$ .

For binary classification, we model the probability:

$P(y=1|x) = \sigma(w^Tx + b)$

The corresponding cost (log-loss) function is:

$J(w, b) = -\left[ y \log(\sigma(z)) + (1-y) \log(1-\sigma(z)) \right]$

The optimization of the parameters \( w \) and \( b \) is typically performed using gradient descent or other numerical solvers.

## Model Training & Evaluation

We now train a Logistic Regression model using scikit-learn. The dataset is split into training and testing sets, and we evaluate the model using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, roc_auc_score, classification_report

# Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Plot the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Plot the ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, marker='.', label='Logistic Regression')
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Chance')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## Model Analysis & Visualization

In this section, we analyze the model's coefficients to understand the impact of each feature. In addition, we reduce the feature space to two dimensions using PCA to visualize the decision boundaries of the Logistic Regression model.

In [None]:
# Create a DataFrame to display model coefficients
coef_df = pd.DataFrame({
    'Feature': features,  # Using stored feature names
    'Coefficient': model.coef_[0]
})
print("Model Coefficients:")
display(coef_df.sort_values(by='Coefficient', key=abs, ascending=False))

# Reduce data to 2 dimensions using PCA for visualization
from sklearn.decomposition import PCA

# Apply PCA on the training and testing data
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Retrain Logistic Regression on PCA-transformed data
model_pca = LogisticRegression(max_iter=10000, random_state=42)
model_pca.fit(X_train_pca, y_train)

# Create a mesh grid for plotting the decision boundary
x_min, x_max = X_train_pca[:, 0].min() - 1, X_train_pca[:, 0].max() + 1
y_min, y_max = X_train_pca[:, 1].min() - 1, X_train_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))

# Predict on the mesh grid
Z = model_pca.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundary
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
scatter = plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1], c=y_test, s=50, cmap='RdBu', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Decision Boundary on PCA-transformed Data')
plt.legend(*scatter.legend_elements(), title="Classes")
plt.show()

## Discussion

The Logistic Regression model shows reasonable performance on the Breast Cancer dataset. Its simplicity makes it highly interpretable, and the coefficients provide insight into how each feature contributes to the classification decision. However, the model may not capture complex nonlinear relationships and may be sensitive to outliers. Future work might involve exploring advanced models or additional feature engineering to improve performance.

## Conclusion

This project demonstrated the process of developing a Logistic Regression model for binary classification using a real-world dataset. From data exploration and preprocessing to model training and evaluation, each step provided valuable insights. The visualizations and evaluation metrics confirm that Logistic Regression can be an effective baseline model for classification tasks, while also highlighting areas for future improvement.

## References

- Breast Cancer Wisconsin (Diagnostic) Data Set, available via scikit-learn.
- Scikit-learn Documentation: https://scikit-learn.org/
- Various online resources on Logistic Regression and machine learning best practices.