# Mini Project: Logistic Regression

In this mini-project you'll be introduced to **Logistic Regression**, a fundamental algorithm in machine learning for binary classification problems. Logistic Regression is a statistical method that uses a logistic function to model a binary dependent variable, making it perfect for predicting outcomes that have two possible values (e.g., yes/no, 0/1, true/false).

## What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for binary classification. Despite its name, it's actually a classification algorithm, not a regression algorithm. The key insight is that it uses a logistic function (sigmoid function) to transform the output of a linear equation into a probability value between 0 and 1.

## Key Concepts

1. **Sigmoid Function**: The logistic function that maps any real-valued number into a value between 0 and 1
2. **Decision Boundary**: The threshold that determines the classification (typically 0.5)
3. **Cost Function**: Log loss function that measures how well the model is performing
4. **Regularization**: Techniques to prevent overfitting (L1/L2 regularization)

## What You'll Learn

In this mini-project, we will:
1. Load and explore the Breast Cancer Wisconsin dataset
2. Implement logistic regression using scikit-learn
3. Evaluate model performance using various metrics
4. Understand feature importance and interpretability
5. Apply regularization techniques
6. Visualize results and decision boundaries

## Task 1: Import Libraries and Load Data

For this mini-project we'll be using the [Breast Cancer Wisconsin (Diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html). First, let's import all the libraries we'll be using.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## Task 2: Load and Explore the Dataset

Here are your tasks:

1. Use [load_breast_cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) to load the Breast Cancer Wisconsin dataset as a Pandas dataframe.
2. Split the dataset into training and test sets.
3. Display the first five rows of data and make sure everything looks ok.
4. Conduct some basic exploratory data analysis (EDA).

In [3]:
# Load the Breast Cancer Wisconsin dataset
cancer_data = load_breast_cancer()

# Convert the dataset into a DataFrame for easier handling
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)

# Add the target column to the DataFrame
df['target'] = cancer_data.target

# Display first 5 rows
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Let's examine the dataset structure and basic information:

In [5]:
# Check dataset shape
print(f"Dataset shape: {df.shape}")
df.shape

Dataset shape: (569, 31)


(569, 31)

In [7]:
# Check target distribution
print("Target distribution:")
df['target'].value_counts()

Target distribution:


target
1    357
0    212
Name: count, dtype: int64

## Task 3: Data Preprocessing and Splitting

Now let's prepare our data for modeling by splitting it into features and target, then into training and testing sets.

In [9]:
# Split the dataset into training and testing sets
X = df.drop('target', axis=1)  # Features
y = df['target']  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
print(f"Training target distribution: {np.bincount(y_train)}")
print(f"Testing target distribution: {np.bincount(y_test)}")

Training set shape: (455, 30)
Testing set shape: (114, 30)
Training target distribution: [169 286]
Testing target distribution: [43 71]


## Task 4: Feature Scaling

Logistic Regression is sensitive to the scale of features. Let's standardize our features to improve model performance.

In [11]:
# Scale the features using StandardScaler
scaler = StandardScaler()

# Fit the scaler on training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features have been scaled successfully!")
print(f"Training data shape: {X_train_scaled.shape}")
print(f"Test data shape: {X_test_scaled.shape}")

Features have been scaled successfully!
Training data shape: (455, 30)
Test data shape: (114, 30)


## Task 5: Build and Train Logistic Regression Model

Now let's create our first logistic regression model and train it on our data.

In [13]:
# Create and train the logistic regression model
log_reg = LogisticRegression(random_state=42, max_iter=1000)

# Train the model
log_reg.fit(X_train_scaled, y_train)

print("Logistic Regression model trained successfully!")
log_reg

Logistic Regression model trained successfully!


## Task 6: Make Predictions and Evaluate Model Performance

Let's see how well our model performs on both training and test data.

In [15]:
# Make predictions
y_train_pred = log_reg.predict(X_train_scaled)
y_test_pred = log_reg.predict(X_test_scaled)

# Calculate accuracy scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")

Training Accuracy: 0.9868
Testing Accuracy: 0.9737


## Task 7: Detailed Model Evaluation

Let's get a more comprehensive view of our model's performance using various metrics.

In [17]:
# Generate detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_test_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



## Task 8: Feature Importance Analysis

Let's examine which features are most important for our logistic regression model.

In [19]:
# Get feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': abs(log_reg.coef_[0])
})

# Sort by absolute coefficient values
feature_importance = feature_importance.sort_values('coefficient', ascending=False)

print("Top 10 Most Important Features:")
for idx, row in feature_importance.head(10).iterrows():
    print(f"{row['feature']}: {row['coefficient']:.4f}")

Top 10 Most Important Features:
worst texture: 1.3506
radius error: 1.2682
worst symmetry: 1.2082
mean concave points: 1.1198
worst concavity: 0.9431
area error: 0.9072
worst radius: 0.8798
worst area: 0.8418
mean concavity: 0.8015
worst concave points: 0.7782


## Task 9: Cross-Validation

Let's use cross-validation to get a more robust estimate of our model's performance.

In [21]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

Cross-validation scores: [0.97802198 0.96703297 1.         0.97802198 0.94505495]
Mean CV accuracy: 0.9736 (+/- 0.0357)


## Task 10: Model Interpretation and Conclusion

Let's summarize what we've learned and interpret our results.

In [23]:
# Summary of results
print("Model Performance Summary:")
print(f"Training Accuracy: {train_accuracy*100:.2f}%")
print(f"Testing Accuracy: {test_accuracy*100:.2f}%")
print(f"Cross-validation Accuracy: {cv_scores.mean()*100:.2f}%")
print()
print("Key Findings:")
print("1. The model shows good generalization with minimal overfitting")
print("2. Top features are related to tumor shape characteristics")
print("3. The model achieves high precision and recall for both classes")
print("4. Cross-validation confirms the model robustness")

Model Performance Summary:
Training Accuracy: 98.68%
Testing Accuracy: 97.37%
Cross-validation Accuracy: 97.36%

Key Findings:
1. The model shows good generalization with minimal overfitting
2. Top features are related to tumor shape characteristics
3. The model achieves high precision and recall for both classes
4. Cross-validation confirms the model robustness
