# Logistic Regression and KNN in Practice

This notebook demonstrates practical implementation of Logistic Regression and K-Nearest Neighbors (KNN) algorithms using scikit-learn. We'll work through a complete machine learning workflow from data loading to model evaluation.

## 1. Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve, average_precision_score
)
from sklearn.datasets import load_breast_cancer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV, cross_val_score

## 2. Load and Explore the Dataset
We'll use the Breast Cancer Wisconsin dataset, a classic binary classification dataset.

In [None]:
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Display basic information
print(f"Dataset shape: {X.shape}")
print("\nFirst 5 rows of the dataset:")
display(X.head())

# Check class distribution
print("\nClass distribution:")
print(y.value_counts())

# Visualize feature distributions
plt.figure(figsize=(12, 6))
sns.boxplot(data=X.iloc[:, :5])
plt.xticks(rotation=45)
plt.title('Feature Distributions')
plt.show()

## 3. Data Preprocessing

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Handle class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print(f"Original training set shape: {X_train.shape}")
print(f"Resampled training set shape: {X_train_resampled.shape}")
print("\nClass distribution after resampling:")
print(pd.Series(y_train_resampled).value_counts())

## 4. Logistic Regression

In [None]:
# Initialize and train the model
log_reg = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced'
)

# Train the model
log_reg.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba_log = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluate the model
print("Logistic Regression Results:")
print("=\nprint(f"Accuracy: {accuracy_score(y_test, y_pred_log):.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_log))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_log))

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_log)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

## 5. K-Nearest Neighbors (KNN)

In [None]:
# Find the optimal k using cross-validation
k_values = list(range(1, 31, 2))
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Plot accuracy vs k
plt.figure(figsize=(10, 6))
plt.plot(k_values, cv_scores, marker='o')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Cross-Validated Accuracy')
plt.title('KNN: Accuracy vs. k')
plt.grid(True)
plt.show()

# Train the model with optimal k
optimal_k = k_values[np.argmax(cv_scores)]
print(f"Optimal number of neighbors: {optimal_k}")

knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred_knn = knn.predict(X_test_scaled)
y_pred_proba_knn = knn.predict_proba(X_test_scaled)[:, 1]

# Evaluate the model
print("\nKNN Results:")
print("=\nprint(f"Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_knn))

## 6. Model Comparison

In [None]:
# Compare ROC curves
fpr_log, tpr_log, _ = roc_curve(y_test, y_pred_proba_log)
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_pred_proba_knn)
roc_auc_log = auc(fpr_log, tpr_log)
roc_auc_knn = auc(fpr_knn, tpr_knn)

plt.figure(figsize=(10, 8))
plt.plot(fpr_log, tpr_log, color='blue', lw=2, label=f'Logistic Regression (AUC = {roc_auc_log:.2f})')
plt.plot(fpr_knn, tpr_knn, color='green', lw=2, label=f'KNN (AUC = {roc_auc_knn:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

## 7. Feature Importance (Logistic Regression)

In [None]:
# Get feature importances
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': np.abs(log_reg.coef_[0])
}).sort_values('Importance', ascending=False)

# Plot top 10 most important features
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()

## 8. Hyperparameter Tuning (Optional)

In [None]:
# Example: Tuning Logistic Regression
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

grid_search = GridSearchCV(
    LogisticRegression(max_iter=1000, random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation accuracy: {:.4f}".format(grid_search.best_score_))
print("Test set accuracy: {:.4f}".format(grid_search.score(X_test_scaled, y_test)))

## 9. Conclusion

In this notebook, we've covered:
1. Data loading and exploration
2. Data preprocessing and handling class imbalance
3. Implementing and evaluating Logistic Regression
4. Implementing and evaluating KNN
5. Comparing model performance
6. Feature importance analysis
7. Hyperparameter tuning

### Key Takeaways:
- Logistic Regression works well with linearly separable data and provides interpretable coefficients.
- KNN is a simple but powerful algorithm that works well with smaller datasets but can be computationally expensive.
- Both models benefit from feature scaling.
- Model evaluation should go beyond accuracy, especially with imbalanced datasets.

### Next Steps:
1. Try different classification algorithms (e.g., Random Forest, SVM)
2. Experiment with feature selection techniques
3. Explore more advanced evaluation metrics
4. Deploy the model as a web application