In [2]:
import pandas as pd

In [4]:
import numpy as np

In [6]:
# Load the data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [8]:
train_data

Unnamed: 0,Id,LotArea,OverallQual,YearBuilt,GrLivArea,FullBath,BedroomAbvGr,SalePrice
0,1,8450,7,2003,1710,2,3,208500
1,2,9600,6,1976,1262,2,3,181500
2,3,11250,7,2001,1786,2,3,223500
3,4,9550,7,1915,1717,1,3,140000
4,5,14260,8,2000,2198,2,4,250000
5,6,14115,5,1993,1362,1,2,143000
6,7,10084,5,2004,1442,2,3,307000
7,8,10382,7,1973,1575,2,3,200000
8,9,6120,6,1931,1629,1,2,129900
9,10,7420,4,1939,1040,1,2,118000


In [10]:
test_data

Unnamed: 0,Id,LotArea,OverallQual,YearBuilt,GrLivArea,FullBath,BedroomAbvGr
0,1461,11622,6,1961,896,1,2
1,1462,14267,5,1958,1329,1,2
2,1463,13830,7,1997,1629,2,3
3,1464,9978,7,1998,1604,2,3
4,1465,5005,5,1992,812,1,2
5,1466,10000,5,1970,1280,1,2
6,1467,7980,6,1998,1400,2,3
7,1468,11250,7,1997,1494,2,3
8,1469,8480,6,1970,1316,1,2
9,1470,8533,6,1970,1118,1,2


In [16]:
# Explore the data
print("Train data shape:", train_data.shape)
print("Test data shape:", test_data.shape)
print("Missing values in the trained data:", train_data.isnull().sum().sort_values(ascending = False))

Train data shape: (10, 8)
Test data shape: (10, 7)
Missing values in the trained data: Id              0
LotArea         0
OverallQual     0
YearBuilt       0
GrLivArea       0
FullBath        0
BedroomAbvGr    0
SalePrice       0
dtype: int64


In [28]:
# Drop irrelevant features (e.g., 'Id') if they exist
if 'Id' in train_data.columns:
    train_data.drop(columns=['Id'], inplace=True)

# Save test data IDs for final submission if 'Id' exists
if 'Id' in test_data.columns:
    test_data_ids = test_data['Id']  # Store test IDs separately
    test_data.drop(columns=['Id'], inplace=True)
else:
    test_data_ids = None  # Handle case where 'Id' is missing


In [30]:
# Fill missing values
# For numerical columns, use the median
train_data.fillna(train_data.median(), inplace = True)
test_data.fillna(test_data.median(), inplace = True)

In [44]:
# One-hot encode categorical features
train_data = pd.get_dummies(train_data, drop_first=True)
test_data = pd.get_dummies(test_data, drop_first=True)

In [46]:
# Align train and test data to have the same features, excluding SalePrice
train_data, test_data = train_data.align(test_data, join = 'left', axis = 1)

#This ensures that both the train and test datasets have the same columns (features) after one-hot encoding.

In [48]:
# Ensure the target variable 'SalePrice' is not part of the test data
test_data = test_data.drop(columns=['SalePrice'], errors='ignore')

In [58]:
# Separate features (X) and target variable (y)
x = train_data.drop(columns=['SalePrice'])
y = train_data['SalePrice']                    

In [67]:
# Split the data into training and validation sets
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
test_data = scaler.transform(test_data)

### **Feature Scaling with `StandardScaler`**
```python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
test_data = scaler.transform(test_data)
```
This code **scales the numerical features** in `X_train`, `X_val`, and `test_data` using **Standardization (Z-score normalization)**.

---

### **Why Feature Scaling?**
Machine learning models (especially gradient-based ones like **Linear Regression, Logistic Regression, SVMs, and Neural Networks**) perform better when numerical features have **similar scales**. Otherwise, features with larger values can dominate learning.  

**StandardScaler** ensures that:
- Each feature has a **mean of 0** and a **standard deviation of 1**.
- All numerical features contribute **equally** to model training.

---

### **Breaking Down the Code**
#### **Step 1: Create a `StandardScaler` object**
```python
scaler = StandardScaler()
```
- `StandardScaler()` is from `sklearn.preprocessing`.
- It computes the **mean** and **standard deviation** of each feature.

#### **Step 2: Fit and Transform `X_train`**
```python
X_train = scaler.fit_transform(X_train)
```
- `fit_transform(X_train)` does two things:
  1. **Computes mean and standard deviation** of `X_train`.
  2. **Applies transformation**:  
     \[
     X_{\text{scaled}} = \frac{X - \mu}{\sigma}
     \]
     Where:
     - \( X \) = original feature value  
     - \( \mu \) = mean of feature  
     - \( \sigma \) = standard deviation  

#### **Step 3: Transform `X_val` and `test_data`**
```python
X_val = scaler.transform(X_val)
test_data = scaler.transform(test_data)
```
- We **only transform** `X_val` and `test_data`, using the **same mean and standard deviation from `X_train`**.
- We **do not use `.fit_transform()`** here because:
  - **Fitting on validation/test data would cause data leakage.**
  - We need to apply the same scaling as `X_train` for consistency.

---

### **Example Before and After Scaling**
#### **Original Data (`X_train`)**
| Feature  | Value |
|----------|-------|
| Age      | 50    |
| Salary   | 100000 |
| Height   | 175   |

#### **After Standard Scaling (`X_train`)**
| Feature  | Scaled Value |
|----------|-------------|
| Age      | 0.12        |
| Salary   | 1.56        |
| Height   | -0.43       |

Now, all features are on the **same scale**.

---

### **Key Takeaways**
✅ **Feature scaling is important** for ML models that rely on gradient-based optimization.  
✅ `StandardScaler()` **scales numerical features** by making them have a mean of `0` and standard deviation of `1`.  
✅ `fit_transform(X_train)` **computes and applies scaling** based on training data.  
✅ `transform(X_val)` and `transform(test_data)` **apply the same scaling** without recomputing, preventing data leakage.  

This step **improves model performance** and **speeds up convergence**! 🚀

In [76]:
# Train a Linear Regression model

from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(x_train, y_train)

In [None]:
# Train a Decision Tree Regressor
dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)


### **Training a Decision Tree Regressor**
```python
dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)
```
This code trains a **Decision Tree Regressor** to predict a target variable `y_train` based on features in `X_train`.

---

### **Step-by-Step Explanation**
#### **1️⃣ Create a Decision Tree Regressor**
```python
dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
```
- `DecisionTreeRegressor()` is from `sklearn.tree`.
- It creates a decision tree model that learns patterns in the data to make predictions.

**Parameters:**
- `max_depth=5`:  
  ✅ Limits the depth of the tree to **5 levels** to prevent overfitting.  
  ✅ A deeper tree captures more details but may overfit.  
  ✅ A shallow tree generalizes better but might underfit.
  
- `random_state=42`:  
  ✅ Ensures reproducibility by setting a fixed random seed.

---

#### **2️⃣ Train the Model**
```python
dt_model.fit(X_train, y_train)
```
- **Fitting** means the model learns patterns from `X_train` (features) and `y_train` (target).
- It recursively splits data into **smaller regions** based on the best feature-value splits.
- The model minimizes the error (e.g., Mean Squared Error) at each split.

---

### **How a Decision Tree Works?**
1. **Select the Best Feature to Split On**  
   - The model picks a feature that **best separates the data** (based on minimizing variance).
   
2. **Split Data Recursively**  
   - The dataset is split into branches based on feature values.
   - This process continues until a **stopping criterion** (like `max_depth=5`) is met.

3. **Make Predictions**  
   - Each leaf node contains the **average target value** of the samples in that region.

---

### **Example: Predicting House Prices 🏡**
#### **Training Data (`X_train` & `y_train`)**
| Square Feet | Bedrooms | Price (y_train) |
|------------|---------|----------------|
| 1200       | 2       | 250,000        |
| 1800       | 3       | 350,000        |
| 2200       | 4       | 450,000        |

#### **How the Decision Tree Learns?**
1. **Splits data** at `Square Feet ≤ 1800`
   - Left branch: Small houses → Average Price = 250,000
   - Right branch: Larger houses → Further splits

2. **Splits further** based on `Bedrooms`
   - More branches refine price predictions.

#### **Final Tree (if `max_depth=5`)**
```
         Square Feet ≤ 1800?
        /                  \
   Yes (250K)         No (Bedrooms ≤ 3?)
                     /            \
                Yes (350K)     No (450K)
```
Now, if we give a **new house (2000 sq ft, 3 beds)**, the tree predicts **₹350,000**.

---

### **Key Takeaways**
✅ `DecisionTreeRegressor` **learns patterns** by splitting data based on feature values.  
✅ `max_depth=5` **limits the complexity**, preventing overfitting.  
✅ `.fit(X_train, y_train)` trains the model to make predictions.  
✅ **Used for regression problems**, like predicting house prices, stock values, or sales revenue.

🚀 Now, you can use `dt_model.predict(X_val)` to test the model!

In [89]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Train Decision Tree Regressor
dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_model.fit(x_train, y_train)

# Define evaluation function
def evaluate_model(model, x, y_true, model_name):
    y_pred = model.predict(x)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{model_name} - MSE: {mse:.2f}, R²: {r2:.2f}")
    return y_pred

# Evaluate models
print("\nModel Evaluation on Validation Set:")
lr_predictions = evaluate_model(lr_model, x_val, y_val, "Linear Regression")
dt_predictions = evaluate_model(dt_model, x_val, y_val, "Decision Tree")



Model Evaluation on Validation Set:
Linear Regression - MSE: 1280734363.59, R²: -0.92
Decision Tree - MSE: 7960930000.00, R²: -10.96


Let's break down the code **line by line** to understand how it works:

---

## **Importing Required Libraries**
```python
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
```
- `DecisionTreeRegressor` → Creates and trains a **Decision Tree model** for regression tasks.
- `mean_squared_error` → Measures how far the predicted values are from the actual values.
- `r2_score` → Evaluates how well the model explains the variance in the data.
- `StandardScaler` → Used for **feature scaling** (though it's not used in this code snippet).

---

## **Training a Decision Tree Regressor**
```python
dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_model.fit(x_train, y_train)
```
- Creates a **Decision Tree Regressor model**.
- `max_depth=5` → Limits the depth of the tree to **prevent overfitting**.
- `random_state=42` → Ensures **reproducibility** by making the random splits the same each time.
- `.fit(x_train, y_train)` → Trains the model using the **training features (`x_train`)** and **target variable (`y_train`)**.

---

## **Defining an Evaluation Function**
```python
def evaluate_model(model, x, y_true, model_name):
    y_pred = model.predict(x)  # Predict target values
    mse = mean_squared_error(y_true, y_pred)  # Compute Mean Squared Error
    r2 = r2_score(y_true, y_pred)  # Compute R² Score
    
    print(f"{model_name} - MSE: {mse:.2f}, R²: {r2:.2f}")  # Print evaluation metrics
    
    return y_pred  # Return predicted values
```
- **Purpose:** Evaluates a trained model on a given dataset (`x`).
- `model.predict(x)` → Uses the trained model to make predictions.
- `mean_squared_error(y_true, y_pred)` → Calculates the **Mean Squared Error (MSE)**:
  - Measures the **average squared difference** between actual and predicted values.
  - Lower MSE = **Better model**.
- `r2_score(y_true, y_pred)` → Calculates the **R² score**:
  - Ranges from `-∞ to 1` (closer to `1` is better).
  - Represents the proportion of variance explained by the model.
- `print(f"{model_name} - MSE: {mse:.2f}, R²: {r2:.2f}")` → Displays **MSE and R² score**.
- `return y_pred` → Returns predicted values for further use.

---

## **Evaluating Models**
```python
print("\nModel Evaluation on Validation Set:")
lr_predictions = evaluate_model(lr_model, x_val, y_val, "Linear Regression")
dt_predictions = evaluate_model(dt_model, x_val, y_val, "Decision Tree")
```
- **Prints a heading** to indicate model evaluation.
- Calls `evaluate_model()` **for Linear Regression (`lr_model`)**:
  - **Error:** `lr_model` is not defined! Ensure you trained a `LinearRegression()` model before using it.
- Calls `evaluate_model()` **for Decision Tree (`dt_model`)**:
  - Evaluates the trained Decision Tree on **validation data (`x_val, y_val`)**.
  - Prints **MSE and R² score** for performance comparison.

---

### **Example Output (Hypothetical)**
```
Model Evaluation on Validation Set:
Linear Regression - MSE: 5000.23, R²: 0.85
Decision Tree - MSE: 4500.76, R²: 0.89
```
- **Lower MSE is better** (Decision Tree performed slightly better).
- **Higher R² is better** (Decision Tree explains 89% of variance, better than Linear Regression).

---

### **Key Fixes & Next Steps**
✅ Ensure `lr_model` is trained before evaluation:
```python
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
```
✅ Make sure `x_train`, `x_val`, `y_train`, and `y_val` **exist and are correctly preprocessed**.

🚀 **Now your code should run correctly!**

In [92]:
# Use the better model (Decision Tree in this case) to predict house prices for the test set
final_predictions = dt_model.predict(test_data)

In [94]:

# Save the predictions to a CSV file
submission = pd.DataFrame({'Id': test_data_ids, 'SalePrice': final_predictions})
submission.to_csv('submission.csv', index=False)

print("\nPredictions saved to 'submission.csv'")


Predictions saved to 'submission.csv'


In [96]:
import pickle
with