# Class 11 – Advanced Scikit-Learn
## Lesson Objective
By the end of this lesson, you will understand advanced techniques in Scikit-Learn, including:
- Feature Scaling
- Splitting Datasets
- Implementing Multiple Machine Learning Models
- Evaluating Model Performance using Accuracy, Precision, Recall, F1-Score

## 1. Feature Scaling and Preprocessing
### Why Feature Scaling is Important
Many machine learning algorithms (like SVM and Logistic Regression) are sensitive to the scale of data. For example, if one feature has values in thousands and another in decimals, the model may treat them unequally, leading to bias.

Feature scaling helps normalize the range of features so that each contributes equally.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:5]

array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ]])

### Explanation:
- `StandardScaler` removes the mean and scales the data to unit variance.
- This is helpful for algorithms that assume all features are centered around zero.
- We show the first 5 scaled values for a quick check.

In [2]:
from sklearn.preprocessing import MinMaxScaler

# Normalize the data to a range of 0 to 1
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:5]

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667]])

### Explanation:
- `MinMaxScaler` scales all data points between 0 and 1.
- Useful when you want to preserve zero entries and are working with image pixels or bounded ranges.

## 2. Splitting Datasets
### Why Split Data?
We split the dataset to evaluate model performance on unseen data. This helps us prevent overfitting.

In [3]:
from sklearn.model_selection import train_test_split

# Split dataset into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled, iris.target, test_size=0.2, random_state=42)

### Explanation:
- `train_test_split` randomly splits the data into training and testing sets.
- `random_state=42` ensures reproducibility.

## 3. Implementing Multiple Machine Learning Models

In [None]:
# Regression and Classification

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error,r2_score

# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R-squared Score: {r2:.2f}") # Measures how well the model explains variation in data
mse = mean_squared_error(y_test, y_pred) # measures the average squared difference between predicted and actual values
print(f"Mean Squared Error: {mse:.2f}")

R-squared Score: 0.45
Mean Squared Error: 2900.19


In [None]:
# Mean Squared Error(MSE):
# Lower= better



In [None]:

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

MSE: 4976.80
R² Score: 0.06


In [7]:
from sklearn.svm import SVR

model = SVR(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")


Mean Squared Error: 5190.39
R² Score: 0.02


## 4. Evaluating Model Performance

In [16]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

In [17]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

## 5. Activity: Implement and Evaluate Logistic Regression, Decision Tree, and SVM

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Reload data and scale
iris = load_iris()
X = iris.data
y = iris.target

scaler = StandardScaler() # mean = 0 and std = 1 
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Logistic Regression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

# SVM
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

# Evaluate all models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
for model, y_pred in [("Logistic Regression", y_pred_lr),
                      ("Decision Tree", y_pred_dt),
                      ("SVM", y_pred_svm)]:
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro', zero_division=0)
    recall = recall_score(y_test, y_pred, average='macro', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='macro', zero_division=0)
    print(f"Model: {model}")
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 Score: {f1:.2f}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n")

Model: Logistic Regression
Accuracy: 100.00%
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Model: Decision Tree
Accuracy: 100.00%
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Model: SVM
Accuracy: 96.67%
Precision: 0.97
Recall: 0.96
F1 Score: 0.97
Confusion Matrix:
[[10  0  0]
 [ 0  8  1]
 [ 0  0 11]]



## 6. Wrap-Up and Homework
### Recap:
- Feature Scaling: Helps normalize different ranges
- Data Splitting: Training vs Testing for fair evaluation
- ML Models: Logistic Regression, Decision Tree, SVM
- Evaluation Metrics: Accuracy, Precision, Recall, F1-Score

### Homework:
- Try another model (like Naive Bayes or k-NN) on a dataset of your choice
- Try different scalers (e.g., RobustScaler) and check how accuracy changes.