### **Chunk 3: First Supervised Learning Models**

#### **1. Concept Introduction**

We are now moving into the core of supervised learning. You will learn the two workhorse models that form the basis for many more advanced techniques:

1.  **`LogisticRegression` for Classification**: Despite its name, Logistic Regression is a **classification** algorithm. It works by calculating the probability that a given input belongs to a certain class. From a mathematical standpoint, it fits a line (or a plane in higher dimensions) that best separates the different classes in your data. It's fast, highly interpretable, and a fantastic baseline model.

2.  **`LinearRegression` for Regression**: This is the classic statistical model for **regression** tasks (predicting a continuous value like price or temperature). It finds the best-fitting linear relationship between the features and the target. The goal is to find the coefficients (weights) for each feature that minimize the difference between the predicted and actual values.

A key advantage of both models is their **interpretability**. After training, you can inspect the model's learned `coef_` attribute. These coefficients tell you how much a one-unit increase in a feature affects the prediction, holding all other features constant. A large positive coefficient means the feature strongly increases the probability/value of the target, while a large negative coefficient means it strongly decreases it.

#### **2. Dataset EDA: Breast Cancer Wisconsin Dataset (Classification)**

This is another classic, clean dataset from `sklearn.datasets`. The goal is to predict whether a breast tumor is malignant (cancerous) or benign (not cancerous) based on 30 numeric features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer

# Set plot style
sns.set_style("whitegrid")


In [None]:
# Load Data
cancer  = load_breast_cancer()
df      = pd.DataFrame(data = cancer.data,
                       columns=cancer.feature_names)
df['target'] = cancer.target

df.head()

In [None]:
df.info()

In [None]:
# Basic Statistics
# Notice the different scales again (e:g., 'mean area' vs 'mean smoothness')
pd.set_option('display.max_columns', None) # Show all columns
df.describe()

In [None]:
df.isnull().sum() 

In [None]:
# Target Variable Distribution
print(df['target'].value_counts())
plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=df)
plt.title('Distribution of Target (0=Malignant, 1=Benign)')
plt.show()


In [None]:
# Feature Distributions (Histograms)
# Let's look at the 'mean' features for brevity
mean_feature = [col for col in df.columns if 'mean' in col]
df[mean_feature].hist(figsize=(15, 12),
                      bins=30, edgecolor='black')
plt.suptitle('Histograms of "Mean" Feature Distributions', y=0.92)
plt.show()

In [None]:
# Correlation Matric Heatmap
plt.figure(figsize=(20, 15))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=False,
            cmap='viridis') # Annot=False due to high number of features
plt.title('Correlation Matrix of Breast Cancer Features') 
plt.show()

**3. Minimal Working Example (Classification)**

Let's build a `LogisticRegression` model. We'll follow the exact same pattern as before:
**Split --> Scale --> Train --> Predict --> Evaluate**

In [None]:
# Imports, Data, Splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import accuracy_score

X,y = cancer.data, cancer.target

X_train, X_test, y_train, y_test  = \
                    train_test_split(
                        X,y,
                        test_size=0.2,
                        random_state=42,
                        stratify=y
                    )


In [None]:
# SCale the Data
scaler = StandardScaler()
X_train_scaled  =  scaler.fit_transform(X_train)
X_test_scaled   =  scaler.transform(X_test)



In [None]:
# Train 
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

In [None]:
y_pred = log_reg.predict(X_test_scaled)
accuracy = accuracy_score(y_pred=y_pred,
                          y_true=y_test)
print(f"Logistic Regression Accuracy: {accuracy * 100:.2f}%")

In [None]:
# Interpreting Coefficients
# Create a DataFrame to view the coefficients alongside their feature names
coefficients = pd.DataFrame(
    data=log_reg.coef_.T, # Transpose to make it a column
     index=cancer.feature_names,
      columns=['coefficient']  
)

# Sort by the absolute value to see the most impactful features
coefficients['abs_coefficient'] = coefficients['coefficient'].abs()
coefficients = coefficients.sort_values('abs_coefficient', ascending=False)

coefficients.head()

# NEGATIVE COEFFICIENT -> increases chance of bgein class 0 (Malignant)
# POSITIVE COEFFICIENT -> increases chance of being class 1 (Benign)

---

----

---

---

# Regression

**4. Dataset EDA: Boston Housing Dataset (Regression)**
The goal is to predict the median value of owner-occupied homes (MEDV) in the Boston area using various features about the suburbs.

**NOTE:** The original `load_boston` function in scikit-learn is depracated due to ethical concerns with the dataset. We will fetch a version from OpenML, which is the modern way to access many classic datasets.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml

# Set plot style
sns.set_style('whitegrid')




In [None]:
boston = fetch_openml(name="boston",
                      version=1,
                      as_frame=True,
                      parser='auto')
df = boston.frame
# The Target variable is named 'MEDV' in this version
df['target'] = boston.target

df.head()

**Basic INfo of the dataset**

In [None]:
df.info()

**Basic Statistics**

In [None]:
df.describe()

In [None]:
# Target Variable Distribution
plt.figure(figsize=(8,6))
sns.histplot(df['target'],
             kde=True,
             bins=30)
plt.title('Distribution of House Prices (MEDV)')
plt.xlabel('Median Value')
plt.show()

In [None]:
# Correlation with target
# Find features most correlated with the target price
corr_with_target  =  df.corr()['target'].sort_values(ascending=False)
plt.figure(figsize=(10, 8))
corr_with_target.drop('target').plot(kind='bar')
plt.title('Correlation of Features with house Price')
plt.show()

5. **Minimal Working Example (Regression)**

Let's build a `LinearRegression` model. The workflow is identical, just the model and evaluation metric change.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Use .values to get numpy arrays
X = df.drop('target', axis=1).values
y = df['target'].values
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(f"X_Train shape : {X_train.shape}")
print(f"X_Test shape : {X_test.shape}")
print(f"y_Train shape : {y_train.shape}")
print(f"y_Test shape : {y_test.shape}")


**Scale The Data**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Train the Linear Regression Model**

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)

**Predict and Evaluate**

In [None]:
y_pred = lin_reg.predict(X_test_scaled)

# We'll use Root Mean Squared Error (RMSE), which is in the same units as the target.
mse  =   mean_squared_error(y_test, y_pred)
rmse  = np.sqrt(mse)
print(f"Linear Regression RMSE: ${rmse * 1000:.2f}") # Multiply by 1000 since target is in $1000s

# OR we can also do
rmse = np.sqrt(np.mean(np.square(y_test - y_pred)))
print(f"Linear Regression RMSE: ${rmse * 1000:.2f}")

# OR we can also do
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(y_true=y_test, y_pred=y_pred)
print(f"Linear Regression RMSE: ${rmse * 1000:.2f}")

# the last one is probably the best . BUt I wanted to open your mind a little towards possibilities.

**Interpreting Coefficients**
> Creating a DataFrame to view the coefficients

In [None]:
feature_names = df.drop('target', axis=1).columns
coeffs = pd.DataFrame(
    data=lin_reg.coef_,
    index = feature_names,
    columns=['coefficient']
).sort_values('coefficient', ascending=False)
print('Feature Coefficients :')
coeffs

Coefficients indicate how much **each feature affects the house price (MEDV)**, assuming all others remain constant.

---

## **Features Explained (with Interpretation of Coefficients)**

| Feature     | Meaning                                                    | Coefficient      | Interpretation                                                                                                                                                                                                                                 |
| ----------- | ---------------------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **MEDV**    | Dependent variable — Median home value ($1000s)            | **9.388523e+00** | This is likely your **intercept** (bias term). It means that when all other features are 0, the model predicts a median value of **≈ $9,388**. (In real life, this number alone doesn’t have meaning since features can’t realistically be 0.) |
| **RAD**     | Index of accessibility to radial highways                  | **1.15e-14**     | Practically **0**, meaning highway access has **no strong linear relationship** with home price in this fitted model.                                                                                                                          |
| **AGE**     | Proportion of owner-occupied units built before 1940       | **1.03e-14**     | Again **almost zero** — older homes don’t seem to linearly affect prices here (maybe due to feature scaling or correlation).                                                                                                                   |
| **INDUS**   | Proportion of non-retail business acres per town           | **7.66e-15**     | **Very small positive effect** — but practically negligible; more industrial area doesn’t meaningfully affect price in this model.                                                                                                             |
| **LSTAT**   | % of lower-status population                               | **6.99e-15**     | Normally, this should be **negative** (higher LSTAT → lower price), so your result near **0** means the model didn’t capture the expected inverse relationship (possibly due to normalization, regularization, or numerical precision).        |
| **TAX**     | Property tax rate per $10,000                              | **6.33e-15**     | No meaningful linear relationship captured here — higher taxes usually decrease prices, but your coefficient is almost 0.                                                                                                                      |
| **CRIM**    | Per capita crime rate by town                              | **5.65e-15**     | Nearly 0 — meaning crime rate didn’t show a measurable linear effect on price in this fit.                                                                                                                                                     |
| **PTRATIO** | Pupil–teacher ratio by town                                | **4.99e-15**     | Practically 0 — education quality didn’t show an effect here, but usually it does (lower PTRATIO → higher prices).                                                                                                                             |
| **NOX**     | Nitric oxide concentration (pollution level)               | **1.11e-15**     | Close to 0 — no linear relation captured, though typically higher NOX → lower price.                                                                                                                                                           |
| **CHAS**    | Charles River dummy variable (1 if tract bounds river)     | **6.80e-16**     | No noticeable effect, though in the real dataset, being near the river usually **increases** prices.                                                                                                                                           |
| **B**       | Proportion of Black population (historical, coded feature) | **-3.99e-15**    | Essentially 0 — model didn’t learn any effect. (Note: this feature is outdated and ethically inappropriate in modern datasets.)                                                                                                                |
| **RM**      | Average number of rooms per dwelling                       | **-4.16e-15**    | Surprisingly **slightly negative**, but magnitude is near 0. Normally, more rooms → higher price.                                                                                                                                              |
| **DIS**     | Weighted distances to employment centers                   | **-6.44e-15**    | Nearly 0 — distance from work centers didn’t matter linearly here.                                                                                                                                                                             |


#### **6. Common Pitfalls**

1.  **Interpreting Coefficients on Unscaled Data**: If you don't scale your features first, a feature with a large scale (like `DIS` in Boston) might get a tiny coefficient, while a feature with a small scale gets a huge one, completely misleading your interpretation of their importance. **Always interpret coefficients on scaled data.**
2.  **Assuming Causation**: The coefficients show **correlation**, not **causation**. A high negative coefficient for `LSTAT` (lower status of population) doesn't *cause* low prices on its own; it's correlated with a complex mix of factors that lead to lower prices.
3.  **Using the Wrong Model for the Task**: Trying to use `LinearRegression` on the Breast Cancer (classification) dataset will result in an error or nonsensical predictions. The inverse is also true.

#### **7. Quick Win**

Incredible progress. You have now successfully built, evaluated, and interpreted models for the two primary types of supervised machine learning problems. You can:
-   Solve a classification problem with `LogisticRegression`.
-   Solve a regression problem with `LinearRegression`.
-   Scale data to ensure model performance and valid interpretation.
-   Inspect model coefficients (`.coef_`) to understand what drives the model's predictions.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Use .values to get numpy arrays
X_new = df.drop('target', axis=1).values
y_new = df['target'].values
print(X_new.shape, y_new.shape)
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.2, random_state=42)
print(f"X_Train shape : {X_train.shape}")
print(f"X_Test shape : {X_test.shape}")
print(f"y_Train shape : {y_train.shape}")
print(f"y_Test shape : {y_test.shape}")


In [None]:
lin_reg_on_not_Scaled_data = LinearRegression()
lin_reg_on_not_Scaled_data.fit(X_train, y_train)

y_pred = lin_reg_on_not_Scaled_data.predict(X_test)

# We'll use Root Mean Squared Error (RMSE), which is in the same units as the target.
mse  =   mean_squared_error(y_test, y_pred)
rmse  = np.sqrt(mse)
print(f"Linear Regression on not scaled data RMSE: ${rmse * 1000:.2f}") # Multiply by 1000 since target is in $1000s

In [None]:
feature_names = df.drop('target', axis=1).columns
coeffs_not_scaled = pd.DataFrame(
    data=lin_reg_on_not_Scaled_data.coef_,
    index = feature_names,
    columns=['coefficient']
).sort_values('coefficient', ascending=False)
print('Feature Coefficients :')
coeffs_not_scaled

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

mse  =   mean_squared_error(y_test, y_pred)
rmse  = np.sqrt(mse)
print(f"Random Forest RMSE: ${rmse * 1000:.2f}")


In [None]:
coeffs_forest = pd.DataFrame(
    data=rf.feature_importances_,
    index = feature_names,
    columns=['coefficient']
).sort_values('coefficient', ascending=False)
print('Feature Coefficients :')
coeffs_forest