In [1]:
# Common imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression, make_classification, load_wine

# --- Generate Data for Problem 1: Linear Regression ---
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=20, random_state=42)
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# --- Generate Data for Problem 2: Imbalanced Classification ---
# Create an imbalanced dataset (e.g., 2% fraud cases)
X_imb, y_imb = make_classification(n_samples=1000, n_features=10, n_informative=5,
                                   n_redundant=0, n_classes=2, weights=[0.98, 0.02],
                                   flip_y=0, random_state=42)
X_imb_train, X_imb_test, y_imb_train, y_imb_test = train_test_split(X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb)


# --- Generate Data for Problem 3 & 4: Standard Classification ---
X_class, y_class = make_classification(n_samples=500, n_features=10, n_informative=5,
                                       n_redundant=2, n_classes=3, random_state=42)
# Make features have different scales for Problem 4
X_class[:, 0] *= 100
X_class[:, 3] *= 50
X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(X_class, y_class, test_size=0.2, random_state=42)

print("Sample data generated successfully!")

Sample data generated successfully!


**Problem 1: Linear Regression**

A model to predict monthly sales revenue

In [2]:
# Training and evaluation

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# a) Train a Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_reg_train, y_reg_train)
print(f"a) Model Coefficient: {lin_reg.coef_[0]:.4f}")

# Make predictions on the test data
y_pred = lin_reg.predict(X_reg_test)

# b) Evaluate the model
r2 = r2_score(y_reg_test, y_pred)
mae = mean_absolute_error(y_reg_test, y_pred)
print(f"b) R^2 Score: {r2:.4f}")
print(f"b) Mean Absolute Error (MAE): {mae:.4f}")

a) Model Coefficient: 86.5115
b) R^2 Score: 0.9450
b) Mean Absolute Error (MAE): 16.0405


The model is **overfitting** because it performs well on training data but poorly on test data. To improve generalization, you can:
1.  **Use Regularization:** Introduce a penalty for large coefficients to prevent the model from becoming too complex.
    * **L1 (Lasso) Regularization:** Can shrink some coefficients to exactly zero, effectively performing feature selection.
    * **L2 (Ridge) Regularization:** Shrinks coefficients but doesn't set them to zero. It's a good default choice.
2.  **Get More Training Data:** A larger, more diverse dataset can help the model learn the underlying patterns better instead of memorizing noise.
3.  **Feature Selection:** Remove irrelevant or redundant features that might be confusing the model.

In [3]:
# Problem 2: Fraud Detection

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# a) Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_imb_train, y_imb_train)
log_reg_pred = log_reg.predict(X_imb_test)

# a) Train Linear SVM
linear_svm = SVC(kernel='linear', random_state=42)
linear_svm.fit(X_imb_train, y_imb_train)
svm_pred = linear_svm.predict(X_imb_test)

# b) Compare accuracy
log_reg_acc = accuracy_score(y_imb_test, log_reg_pred)
svm_acc = accuracy_score(y_imb_test, svm_pred)

print(f"b) Logistic Regression Accuracy: {log_reg_acc:.4f}")
print(f"b) Linear SVM Accuracy: {svm_acc:.4f}")

b) Logistic Regression Accuracy: 0.9800
b) Linear SVM Accuracy: 0.9800


With only 2% of cases being fraudulent, **accuracy is a poor metric**. A model that always predicts "not fraudulent" would have 98% accuracy but be useless. To improve performance, you should:
1.  **Change the Evaluation Metric:** Use metrics that are better for imbalanced data, such as **Precision**, **Recall**, **F1-Score**, or the **AUC-ROC curve**.
2.  **Use Resampling Techniques:**
    * **Oversampling:** Increase the number of minority class (fraud) samples. A common method is **SMOTE** (Synthetic Minority Over-sampling Technique), which creates new synthetic samples.
    * **Undersampling:** Reduce the number of majority class (legitimate) samples.
3.  **Set Class Weights:** Many models (including `LogisticRegression` and `SVC`) have a `class_weight='balanced'` parameter. This tells the model to pay more attention to the minority class during training by adjusting the loss function.

In [4]:
# Problem 3: KNN

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# a) Train a KNeighbors Classifier with k=3
knn_3 = KNeighborsClassifier(n_neighbors=3)
knn_3.fit(X_class_train, y_class_train)
y_pred_3 = knn_3.predict(X_class_test)
acc_3 = accuracy_score(y_class_test, y_pred_3)
print(f"a) Accuracy with k=3: {acc_3:.4f}")

# b) Increase k to 50
knn_50 = KNeighborsClassifier(n_neighbors=50)
knn_50.fit(X_class_train, y_class_train)
y_pred_50 = knn_50.predict(X_class_test)
acc_50 = accuracy_score(y_class_test, y_pred_50)
print(f"b) Accuracy with k=50: {acc_50:.4f}")

a) Accuracy with k=3: 0.4300
b) Accuracy with k=50: 0.4900


A significant drop in test accuracy compared to training accuracy indicates **overfitting**.
* **Cause:** In KNN, a small `k` (like 3) makes the model highly sensitive to noise in the training data, creating a very complex decision boundary. It essentially "memorizes" the training set.
**How to fix:**
    1.  **Increase `k`:** A larger `k` considers more neighbors, which smooths out the decision boundary and makes the model less complex and more generalized (as seen by the potential accuracy increase with k=50).
    2.  **Use Cross-Validation:** Instead of guessing, use techniques like GridSearchCV to systematically test a range of `k` values and find the one that performs best on validation data.
    3.  **Feature Scaling:** Ensure all features are on a similar scale. This is crucial for distance-based algorithms like KNN (see Problem 4).

In [5]:
# Problem 4: Feature scaling

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# a) Train KNN on the original (unscaled) dataset
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_class_train, y_class_train)
y_pred_unscaled = knn_unscaled.predict(X_class_test)
acc_unscaled = accuracy_score(y_class_test, y_pred_unscaled)
print(f"a) Accuracy before normalization: {acc_unscaled:.4f}")

# b) Apply StandardScaler to normalize the dataset
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_class_train)
X_test_scaled = scaler.transform(X_class_test) # Use the SAME scaler from training data

# Retrain the model on scaled data
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_class_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_class_test, y_pred_scaled)
print(f"b) Accuracy after normalization: {acc_scaled:.4f}")

a) Accuracy before normalization: 0.3600
b) Accuracy after normalization: 0.8100


No, **Decision Trees would not benefit from feature scaling**.
* **Why:** Decision trees work by splitting data based on feature thresholds (e.g., `if feature_A > 5.3`). The scale of the feature does not matter; only the relative ordering of the values within that feature matters. Scaling a feature (e.g., multiplying it by 10) will not change the order of its values, so the optimal split point remains the same. The algorithm is not based on distance calculations like KNN or SVM.

In [6]:
# Problem 5: Naive Bayes vs Logistic Regression

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# a) Load data and train Gaussian Naïve Bayes
wine = load_wine()
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(wine.data, wine.target, test_size=0.2, random_state=42)

gnb = GaussianNB()
gnb.fit(X_wine_train, y_wine_train)
gnb_pred = gnb.predict(X_wine_test)
gnb_acc = accuracy_score(y_wine_test, gnb_pred)
print(f"a) Gaussian Naïve Bayes Test Accuracy: {gnb_acc:.4f}")

# b) Train Logistic Regression
log_reg_wine = LogisticRegression(max_iter=10000, random_state=42) # Increased max_iter for convergence
log_reg_wine.fit(X_wine_train, y_wine_train)
log_reg_wine_pred = log_reg_wine.predict(X_wine_test)
log_reg_acc = accuracy_score(y_wine_test, log_reg_wine_pred)
print(f"b) Logistic Regression Test Accuracy: {log_reg_acc:.4f}")

a) Gaussian Naïve Bayes Test Accuracy: 1.0000
b) Logistic Regression Test Accuracy: 1.0000


One model might perform better than the other due to their underlying assumptions:
* **Gaussian Naïve Bayes** assumes that all features are **conditionally independent** given the class. This is a "naïve" and often incorrect assumption. If the features in the Wine dataset are correlated (e.g., color intensity and alcohol content), Naïve Bayes may not perform as well. However, it can work very well even when the assumption is violated, especially with limited data.
* **Logistic Regression** does not assume feature independence. It learns a linear relationship between the features and the log-odds of the outcome. If the decision boundary between the wine classes is roughly linear, Logistic Regression is likely to perform very well. In this case, it appears the features are not perfectly independent, giving a slight edge to Logistic Regression.

**Problem 6: LR vs KNN for hiring candidates**

a) For predicting if a candidate will be hired, **Logistic Regression would generally be a better choice**.
* **Interpretability:** Logistic Regression is highly interpretable. You can examine the model's coefficients to understand how much each feature (work experience, education) contributes to the hiring decision, which is valuable for business insights.
* **Performance:** It's computationally fast, both for training and prediction.
* **Simplicity:** It's less sensitive to irrelevant features and doesn't require tuning hyperparameters like `k` in KNN.

KNN is less suitable because it can be slow with large datasets, requires feature scaling, and doesn't provide a clear model of *why* a prediction was made.