# Model Building

## Understand the ML Workflow
- **Data Preprocessing**: Preparing the data for modeling (e.g., handling missing values, scaling, splitting).
- **Model Training**: Fitting the model to the training data.
- **Model Evaluation**: Assessing the model's performance using appropriate metrics.

## Apply Linear Regression on a Dataset (e.g., California Housing)
- Load the dataset.
- Split the data into training and test sets.
- Train a Linear Regression model.
- Evaluate the model using Mean Squared Error and R-squared score.


In [3]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Step 2: Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# Step 3: Understand the data
print("Dataset shape:", df.shape)
print(df.head())

# Step 4: Define features (X) and target (y)
X = df.drop(columns='MedHouseVal')  # all features
y = df['MedHouseVal']               # target: median house value

# Step 5: Split data into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 7: Predict on test data
y_pred = model.predict(X_test)

# Step 8: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel Performance:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R²) Score: {r2:.4f}")


Dataset shape: (20640, 9)
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422  

Model Performance:
Mean Squared Error (MSE): 0.5559
R-squared (R²) Score: 0.5758


#### ✅ R² score shows how well the model explains the variability. Closer to 1 is better.

# Apply Logistic Regression on the Iris Dataset

## Steps:
1. **Load the Iris dataset** using `sklearn.datasets`.
2. **Preprocess the data** (e.g., feature selection, train-test split).
3. **Train a Logistic Regression model** using `sklearn.linear_model.LogisticRegression`.
4. **Evaluate the model** using accuracy score, confusion matrix, or classification report.


In [4]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Binary classification example (e.g., class 0 vs others)
# For simplicity, turn it into binary
import numpy as np
y_binary = np.where(y == 0, 0, 1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Train model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict
y_pred = logreg.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        20

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

