Using all those five learning models which we discussed earlier for predicting.

**Using buit-in Iris dataset from Scikit-learn**

- This dataset contains 150 samples of iris flowers.

- Each sample has 4 features:
    - Sepal length (cm)
    - Sepal width (cm)
    - Petal length (cm)
    - Petal width (cm)

- Each sample belongs to one of 3 species:
    - Setosa (label = 0)
    - Versicolor (label = 1)
    - Virginica (label = 2)

In [129]:
from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
iris = load_iris()

# Create DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

print(df.head()) # View dataset

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


# Preprocess the data

In [130]:
# Split features and labels
X = df.drop('target', axis=1)  # Feature data
y = df['target']               # Labels

# Split into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=30)

# Code to Train and Predict

## 1. Linear Regression

In [116]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Create the model
LR_model = LinearRegression()

# Train the model on training data
LR_model.fit(X_train, y_train)

# Predict labels for the test set
y_pred_lin = LR_model.predict(X_test)

# Since it's output is continuous numbers, we round them to nearest class
y_pred_round = np.round(y_pred_lin).astype(int)

# Clip the predictions to stay within valid label range [0, 2]
y_pred_clipped = np.clip(y_pred_round, 0, 2)

### Accuracy of the model

In [117]:
acc_lin = accuracy_score(y_test, y_pred_clipped)
print("Accuracy of Linear-Regression :", acc_lin)

# See a per-class breakdown
print(classification_report(y_test, y_pred_clipped, target_names=iris.target_names))

Accuracy of Linear-Regression : 0.9333333333333333
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       0.89      0.89      0.89         9
   virginica       0.89      0.89      0.89         9

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



### Evaluation

Accuracy: 93.3%
Predicted labels: Mostly correct, but not perfect
Evaluation: Misclassifies some versicolor/virginica samples.

It is not meant for classification but for continuous values. We are rounding these values to nearest classes.
That's why : 
- It is working for classes that are well seperated numerically (Setosa).
- Stuggles when class boundaries arent linear. (btw versicolor and virginica)

**Shouldn't be used for classification. Worked here only because the the probelm is simple and setosa is well seperated.**

## 2. Logistic Regression

In [118]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create the model
log_reg = LogisticRegression(
    max_iter=200,      # allow enough iterations to converge
)

# Train (fit) on the training data
log_reg.fit(X_train, y_train)

# Predict labels for the test set
y_pred_log = log_reg.predict(X_test)

In [119]:
# Evaluate
acc_log = accuracy_score(y_test, y_pred_log)
print("Accuracy of Logistic-Regression :", acc_log)

# See a per-class breakdown
print(classification_report(y_test, y_pred_log, target_names=iris.target_names))

Accuracy of Logistic-Regression : 0.9666666666666667
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       1.00      0.89      0.94         9
   virginica       0.90      1.00      0.95         9

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.96        30
weighted avg       0.97      0.97      0.97        30



### Evaluation

Accuracy: 96.7%
Slighly low recall for Virgenica

**Why this performance?**
Logistic regression draws linear boundaries between classes.

Works very well when classes are linearly separable — and Iris almost is (except versicolor vs virginica overlap).

What this tells about the data?
The dataset is mostly linearly separable, except a few cases between versicolor and virginica.

## 3. Decision Tree

In [131]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report

# Create the model
dec_tree = DecisionTreeClassifier(
    criterion="gini",   # or "entropy"
    max_depth=3,              # prevent overfitting
    min_samples_split=4,      # don’t split unless ≥4 samples
    min_samples_leaf=2,       # each leaf should have ≥2 samples
    random_state=30     # reproducible splits
)

# Fit (train) on the training data
dec_tree.fit(X_train, y_train)

# Predict labels for the test set
y_pred_dt = dec_tree.predict(X_test)

In [132]:
# Evaluate
acc_dt = accuracy_score(y_test, y_pred_dt)
print("Accuracy of Decision-Tree:", acc_dt)
print(classification_report(y_test, y_pred_dt, target_names=iris.target_names))

Accuracy of Decision-Tree: 0.9333333333333333
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       0.89      0.89      0.89         9
   virginica       0.89      0.89      0.89         9

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



### Evaluation of this model

Accuracy: 93.3%

Why this performance?
Trees split based on thresholds of feature values (like petal length < x).
In Iris, a few splits are enough to classify most points correctly — but they may still make errors on overlapping samples.

**What it says about the data:**

Decision boundaries based on specific feature thresholds are enough to distinguish most classes.
But the noise/overlap between class 1 and 2 causes misclassifications.


## 4. Support Vector Machine

In [136]:
from sklearn.svm import SVC

# Create the model
svm = SVC(kernel='linear', C=1.0, random_state=30)

# Train the model
svm.fit(X_train, y_train)

# Predict on the test data
y_pred_svm = svm.predict(X_test)

In [137]:
# Evaluate
acc_svm = accuracy_score(y_test, y_pred_svm)
print("Accuracy of SVM :", acc_svm)
print(classification_report(y_test, y_pred_svm, target_names=iris.target_names))

Accuracy of SVM : 0.9666666666666667
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       1.00      0.89      0.94         9
   virginica       0.90      1.00      0.95         9

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.96        30
weighted avg       0.97      0.97      0.97        30



### Evaluation

Accuracy: 96.7% - same as logistic regression

Same interpretation.

## k-Nearest Neighbors

In [138]:
from sklearn.neighbors import KNeighborsClassifier

# Create the model
k_nn = KNeighborsClassifier(n_neighbors=5)  # 5 nearest neighbors

# Train the model (just memorizes training data)
k_nn.fit(X_train, y_train)

# Predict on test set
y_pred_k_nn = k_nn.predict(X_test)

In [139]:
# Evaluate
acc_k_nn = accuracy_score(y_test, y_pred_k_nn)
print("Accuracy of k-Nearest Neighbors :", acc_k_nn)
print(classification_report(y_test, y_pred_k_nn, target_names=iris.target_names))

Accuracy of k-Nearest Neighbors : 0.9333333333333333
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       1.00      0.78      0.88         9
   virginica       0.82      1.00      0.90         9

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.92        30
weighted avg       0.95      0.93      0.93        30



Accuracy: 93.3%

What this tells about the data:

Local neighborhood for Setosa is clean → perfect accuracy

But Versicolor vs Virginica overlap in feature space → confusion in neighborhood

# Regression Problem

Similar to the previous one, but here we are trying to solve a regression problem instead of classification.

We are using another built-in real world regression dataset - California_housing

In [126]:
# Combined code for training four algorithms
 
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

# Optional: Standardize features (important for SVR, KNN, etc.)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=30),
    "KNN Regressor": KNeighborsRegressor(n_neighbors=5),
    "SVR (RBF Kernel)": SVR(kernel='rbf')
}

# Train and evaluate
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    results.append({
        "Model": name,
        "MSE": round(mse, 2),
        "R² Score": round(r2, 2)
    })

# Show results
df_results = pd.DataFrame(results)
print(df_results)

               Model   MSE  R² Score
0  Linear Regression  0.56      0.58
1      Decision Tree  0.50      0.62
2      KNN Regressor  0.43      0.67
3   SVR (RBF Kernel)  0.36      0.73


# Evaluation : -

1. SVR (RBF Kernel) – Best MSE = 0.36

- Most accurate model overall.

- This low MSE means the model is consistently close to the true values.

- The RBF kernel captures nonlinear relationships smoothly.

- Great when the data has complex, curved, or indirect patterns.


**The data likely has nonlinear dependencies.**

2. KNN Regressor – MSE = 0.43

- Second-best performance and very close to SVR.

- Predicts values based on averages of nearby training samples.

**The dataset has some local structure or clusters where nearby samples tend to have similar outputs.**

3. Decision Tree – MSE = 0.50

- Less precise than KNN or SVR.

- Predicts using stepwise thresholds, which might miss finer variations.

- Can create jumps in predictions instead of smooth transitions.

**The data has some natural thresholds or decision points. Not good but atleast better than Linear Regression**

4. Linear Regression – MSE = 0.56

- Worst performer here.

- Assumes a straight-line relationship between inputs and target.

- MSE is higher because it cannot handle curved or complex patterns.

- Still not terrible — means some linear trend is present, just not enough.


**The model is too simple for this dataset — linear assumptions don't capture the full pattern.**