# Scikit learn

**Scikit-learn (sklearn)** is the most popular Python library for classical machine learning algorithms. It provides a simple, unified interface for a wide range of tasks, from classification and regression to clustering and preprocessing.

Why Scikit-learn?

**1. Unified Interface**: Every algorithm (e.g., Linear Regression, Decision Tree, K-Means) uses the exact same four main methods: fit(), predict(), score(), and transform().

**2. Breadth of Tools**: It covers nearly every traditional machine learning model you'll need.

**3. Integration**: It works seamlessly with NumPy arrays and Pandas DataFrames.

Area,Purpose|Key Algorithms|Functions
| :--- | :---: | ---: |
Preprocessing|Preparing raw data for modeling.|"StandardScaler, MinMaxScaler, train_test_split"
Regression|"Predicting a continuous numerical value (e.g., price, temperature)."|"LinearRegression, DecisionTreeRegressor"
Classification|"Predicting a categorical label (e.g., spam/not spam, dog/cat)."|"LogisticRegression, KNeighborsClassifier"
Model Evaluation|Assessing how well your model performs.|"mean_squared_error, accuracy_score"

## Scikit-learn Pipeline for Regression 
y is the target variable x1, x2, x3..xn represents the features. m1, m2, m3....mn are the coefficients of each feature ie., x1, x2, x3.. respectively.
We will follow the same four-step process used for all Scikit-learn models:
### Model Creation Steps
- Step 1: Data Preparation and Splitting
- Step 2: Choose and Instantiate the Model
- Step 3: Train the Model (The fit Method)
- Step 4: Make Predictions (The predict Method)
- Step 5: Evaluate the Model (The score Method)

## 1. Linear Regression: The Foundation of Prediction 

- Linear Regression is the most basic and widely used supervised learning algorithm for **predicting a continuous value** (like price, age, or score).
- Linear Regression finds the **best-fit straight line** through your data points to predict a continuous target variable (
y ) based on one or more feature variables ( X ).

### i. The Core Idea: Finding the Best Line

Linear Regression operates on the assumption that the relationship between your input features ( **X** ) and your output target ( **y** ) is essentially **linear** (a straight line or a flat plane).

* **Goal:** To find the mathematical equation for the straight line that best fits the existing data points. This line is called the **Regression Line**. 

### ii. The Formula (The Model)

The model is defined by its equation. Simple and Multiple forms, here are both:

| Type | Formula | Interpretation |
| :--- | :--- | :--- |
| **Simple LR** (1 Feature) | $\hat{y} = \beta_0 + \beta_1 x_1$ | $\beta_1$ is the **slope** ($m$), and $\beta_0$ is the **y-intercept** ($c$). |
| **Multiple LR** ($\ge$ 2 Features) | $\hat{y} = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n$ | The model finds a separate **coefficient ($\beta_i$)** for every input feature. |

### iii. Training the Model (OLS)

The process of training the model ( **model.fit()** ) is about determining the best values for the $\beta$ coefficients ($\beta_0, \beta_1, \dots$).

* **Method:** Scikit-learn uses the **Ordinary Least Squares (OLS)** method.
* **What it does:** OLS finds the line that minimizes the **Sum of Squared Errors (SSE)**. The error is the vertical distance between each actual data point and the regression line. By squaring these errors, the model heavily penalises large mistakes.

### iv. Interpretation is Key

The values learned by the model are highly interpretable:

* **Coefficient ($\beta_i$):** This is the **most important** piece of information. It tells you exactly how much the predicted target ($\hat{y}$) is expected to change for every one-unit increase in the feature ($\mathbf{x}_i$).
* **Intercept ($\beta_0$):** This is the baseline predicted value of $\hat{y}$ when all input features ( **X** ) are zero.

### Simple Linear Regression (SLR):
**Formula**: y = mx + c

- Variables: One Predictor (X) and One Target (Y).
- Coefficients: "Only two parameters are learned:  the intercept c and the slope m."
- Purpose: "Used when you believe a single factor drives the variation in the target, or for quick exploratory analysis and visualisation."
- Interpretation: The slope m is the clear measure of how much y changes for a one-unit change in x.

-----
### Multiple Linear Regression (MLR)
**Formula**: y = m1x1 + m2x2.....+ mnxn

- Variables: "y Target (dependent variable), x1, x2, x3... xn represents the features"
- Coefficients: Learns an intercept ( c ) and a coefficient ( m1,m2,..., mn ) for every single feature.
- Purpose: "Predict continuous target using multiple inputs."
- Interpretation: "Each coefficient ( mi ) shows the change in y, for a one-unit change in the feature xi, assuming all other features constant (ceteris paribus assumption), Intercept ( ùëê ) Baseline value of ùë¶ when all features are zero."

In [1]:
import numpy as np # Needed for np.sqrt for RMSE
import pandas as pd
import warnings
# Suppress warnings for clean output
warnings.filterwarnings("ignore") 

### Step 1: Data Preparation and Splitting
A. Data Loading and Preparation (Pandas Integration)

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# --- A. Data Loading and Preparation (Pandas Integration) ---

# Load the California Housing Dataset
cal_housing = fetch_california_housing(as_frame=True)
df_housing = cal_housing.frame

# Define Features (X) and Target (y)
# X: All feature columns (e.g., MedInc, HouseAge, etc.)
# droping column 'MedHouseVal' and using other columns
X = df_housing.drop(columns=['MedHouseVal'])

# y: The column we want to predict (Median House Value)
y = df_housing['MedHouseVal']

# --- B. Train-Test Split (The Crucial Step) ---

# Split the data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=42 # Ensures reproducible results
)

In [3]:
df_housing

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


### Step 2: Choose and Instantiate the Model

In [4]:
# --- C. Model Training (Fitting the Hyperplane) ---

# 1. Instantiate the Linear Regression model
model = LinearRegression() 

### Step 3: Train the Model (The fit Method)

In [5]:
# 2. Fit the model to the training data
model.fit(X_train, y_train)

### Step 4: Make Predictions (The predict Method)

In [6]:
# 3. Predict on the unseen test data
y_pred = model.predict(X_test)

### Step 5: Evaluate the Model (The score Method)

In [7]:
# --- 1. Calculate Metrics First ---
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# --- 2. Print Performance Summary ---
print("   LINEAR REGRESSION RESULTS   ")
print(f"R¬≤ Score : {r2:.4f}  (Closer to 1.0 is better)")
print(f"RMSE     : {rmse:.4f}  (Average error in prediction)")
print(f"MSE      : {mse:.4f}")

   LINEAR REGRESSION RESULTS   
R¬≤ Score : 0.5758  (Closer to 1.0 is better)
RMSE     : 0.7456  (Average error in prediction)
MSE      : 0.5559


In [None]:
# --- 3. Interpret Model Parameters ---
print("   FEATURE INTERPRETATION (Weights)   ")
print(f"Intercept (Baseline): {model.intercept_:.4f}")

In [None]:
# Create a clean DataFrame for coefficients
coef_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Weight': model.coef_
})

# Sort by weight to see the biggest drivers first
coef_df = coef_df.sort_values(by='Weight', ascending=False)

# Print the table without the index numbers for a cleaner look
print(coef_df.to_string(index=False))