# In this file, we have train some baseline models without using hyperparameters

## **Data Preprocessing Plan for House Price Prediction**

#### **1. One-Hot Encoding for Categorical Columns**
We will apply **one-hot encoding** to the following categorical columns to avoid introducing artificial ordinal relationships:
- **`sector`** (Neighborhood identifier)
- **`balcony`** (Number of balconies, treated categorically)
- **`agePossession`** (Property age category)
- **`furnishing_type`** (Furnishing level)
- **`luxury_category`** (Luxury classification)
- **`floor_category`** (Floor level category)


---

#### **2. Feature Scaling**
Numerical features (`bedRoom`, `bathroom`, `built_up_area`, etc.) will be standardized using **StandardScaler** to ensure equal contribution to the regression model.

---

#### **3. Target Variable Transformation**
The `price` column (in units like Crores?) is right-skewed. We will apply:  
```python
df["log_price"] = np.log1p(df["price"])
```
This ensures a more normal distribution for better model performance.



In [19]:
import numpy as np
import pandas as pd

In [20]:
df = pd.read_csv('gurgaon_properties_post_feature_selection.csv')

In [21]:
df.head()

Unnamed: 0,property_type,sector,bedRoom,bathroom,balcony,agePossession,built_up_area,servant room,store room,furnishing_type,luxury_category,floor_category,price
0,0,36,3.0,2.0,2,1,850.0,0.0,0.0,0.0,1,1,0.82
1,0,95,2.0,2.0,2,1,1226.0,1.0,0.0,0.0,1,2,0.95
2,0,103,2.0,2.0,1,1,1000.0,0.0,0.0,0.0,1,0,0.32
3,0,99,3.0,4.0,4,3,1615.0,1.0,0.0,1.0,0,2,1.6
4,0,5,2.0,2.0,1,3,582.0,0.0,1.0,0.0,0,2,0.48


In [22]:
X = df.drop(columns=['price'])
y = df['price']

In [23]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR

In [24]:
columns_to_encode = ['sector', 'balcony', 'agePossession', 'furnishing_type', 'luxury_category', 'floor_category']

Since, price column is right skewed , so we are applying  log-transformation, we can apply log-transformation on other columns as well( if they are right skewed)

In [25]:
# Applying the log1p transformation to the target variable
y_transformed = np.log1p(y)

### **What is ColumnTransformer?**  

**`ColumnTransformer`** is a **scikit-learn** class that allows you to apply different preprocessing steps to different columns of your dataset **in a single step**. It is especially useful when:  
- Your dataset contains **mixed data types** (numerical + categorical).  
- Different columns require **different transformations** (e.g., scaling for numbers, one-hot encoding for categories).  
- You want to **avoid manual column handling** and ensure consistency in train-test splits.  

---

### **How ColumnTransformer Works**  
It takes a list of **transformers** (e.g., `StandardScaler`, `OneHotEncoder`) and applies each to **specific columns**.  

#### **Key Parameters**  
| Parameter | Description | Example |
|-----------|------------|---------|
| `transformers` | List of (`name`, `transformer`, `columns`) tuples | `('num', StandardScaler(), ['age', 'income'])` |
| `remainder` | What to do with columns not specified (`'passthrough'`, `'drop'`) | `remainder='passthrough'` |
| `sparse_threshold` | Control output sparsity (default: `0.3`) | `sparse_threshold=0` (dense output) |

---

### **Example Breakdown**
#### **1. Without ColumnTransformer (Manual Approach)**
```python
# Separate numerical and categorical columns
num_cols = ['age', 'income']
cat_cols = ['gender', 'city']

# Apply transformations manually
scaler = StandardScaler()
X_num = scaler.fit_transform(X[num_cols])

encoder = OneHotEncoder()
X_cat = encoder.fit_transform(X[cat_cols])

# Combine results (requires careful alignment)
X_processed = np.hstack([X_num, X_cat.toarray()])
```

#### **2. With ColumnTransformer (Clean & Automated)**
```python
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ],
    remainder='passthrough'
)

# All transformations applied in one step
X_processed = preprocessor.fit_transform(X)
```

---

### **Why Use ColumnTransformer?**  
✅ **Avoids data leakage** (transformations are correctly applied during cross-validation).  
✅ **Cleaner code** (no manual column splitting/joining).  
✅ **Works seamlessly with Pipelines** (critical for production ML).  

---

### **Common Use Cases**  
1. **Mixed Data Types**:  
   - Scale numerical features + encode categorical ones.  
2. **Feature Selection**:  
   - Apply transformations only to relevant columns.  
3. **Custom Transformations**:  
   - Combine different preprocessing steps (e.g., log-transform some numerical columns).  

---

### **Integration with Pipeline**  
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(...)),  # Preprocessing
    ('model', RandomForestClassifier())        # Model
])
```

---

### **Key Takeaways**  
- **ColumnTransformer = Different preprocessing for different columns**.  
- **Safer than manual preprocessing** (especially with train-test splits).  
- **Required for building production-ready ML pipelines**.  

Would you like a practical example with your housing dataset? 🏠

In [26]:
# Creating a column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers = [
        ('num', StandardScaler(), ['property_type', 'bedRoom', 'bathroom', 'built_up_area', 'servant room', 'store room']),
        ('cat', OneHotEncoder(drop='first'), columns_to_encode)
    ],
    remainder = 'passthrough'
)

### **What is a Pipeline in Machine Learning?**  

A **Pipeline** in `scikit-learn` is a way to **chain multiple data processing steps and a model together** into a single object. It ensures that all steps (e.g., preprocessing, feature scaling, model training) are applied **automatically and in the correct order** when fitting or predicting.  

---

### **Why Use a Pipeline?**  
✅ **Avoids Data Leakage** – Prevents test data from influencing preprocessing steps (e.g., scaling is fit only on training data).  
✅ **Simplifies Code** – Combines preprocessing + modeling into one step.  
✅ **Ensures Consistency** – Same transformations are applied during training and prediction.  
✅ **Works with Cross-Validation** – Correctly handles preprocessing in each fold.  

---

### **How Your Pipeline Works**  
Your example:  
```python
pipeline = Pipeline([
    ('preprocessor', preprocessor),  # Step 1: Apply preprocessing (scaling, encoding)
    ('regressor', SVR(kernel='rbf')) # Step 2: Train an SVR model
])
```

#### **Step-by-Step Execution:**  
1. **`preprocessor`** (ColumnTransformer):  
   - Scales numerical features (`StandardScaler`).  
   - One-hot encodes categorical features (`OneHotEncoder`).  
2. **`regressor`** (SVR):  
   - Fits a Support Vector Regression model with an RBF kernel.  

When you call:  
```python
pipeline.fit(X_train, y_train)  # Applies preprocessing THEN trains SVR
pipeline.predict(X_test)        # Applies same preprocessing THEN predicts
```

---

### **Key Benefits of Your Pipeline**  
1. **No Manual Steps** – No need to separately preprocess data before fitting the model.  
2. **Prevents Bugs** – Guarantees that the same transformations are applied to new data.  
3. **Deployment-Ready** – Export the entire pipeline (preprocessing + model) as one object.  

---

### **Example Without vs. With Pipeline**  

#### **❌ Without Pipeline (Error-Prone)**  
```python
# Manually preprocess training data
X_train_scaled = scaler.fit_transform(X_train[num_cols])
X_train_encoded = encoder.fit_transform(X_train[cat_cols])
X_train_processed = np.hstack([X_train_scaled, X_train_encoded])

# Train model
model = SVR(kernel='rbf').fit(X_train_processed, y_train)

# Preprocess test data (must repeat steps!)
X_test_scaled = scaler.transform(X_test[num_cols])  # Risk: Forgetting .transform()
X_test_encoded = encoder.transform(X_test[cat_cols])
X_test_processed = np.hstack([X_test_scaled, X_test_encoded])

# Predict
y_pred = model.predict(X_test_processed)
```

#### **✅ With Pipeline (Clean & Safe)**  
```python
pipeline.fit(X_train, y_train)  # Preprocessing + training in one line
y_pred = pipeline.predict(X_test)  # Automatic preprocessing
```

---

### **When to Use Pipelines?**  
- **Any ML workflow** (classification/regression).  
- **Complex preprocessing** (e.g., imputation + scaling + encoding).  
- **Hyperparameter tuning** (use `GridSearchCV` on the pipeline).  

---

### **Advanced: Tuning Hyperparameters in a Pipeline**  
```python
from sklearn.model_selection import GridSearchCV

params = {
    'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],  # Tune scaler
    'regressor__C': [0.1, 1, 10]  # Tune SVR's regularization
}

grid_search = GridSearchCV(pipeline, params, cv=5)
grid_search.fit(X_train, y_train)
```
This searches over different scalers **and** SVR parameters while keeping preprocessing intact!

---

### **Summary**  
Your `Pipeline`:  
1. **Preprocesses data** (scaling + encoding).  
2. **Trains an SVR model** in one seamless flow.  
3. **Ensures reproducibility** and avoids data leakage.  

Would you like to see how to save/load this pipeline for deployment? 🚀

## Linear Regression

In [37]:
# Creating a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

In [38]:
# K-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y_transformed, cv=kfold, scoring='r2')

In [39]:
scores.mean()

np.float64(0.8558123475287653)

mean r2 score is 0.85 , this means a positive sign. If without hyperparameter tuning, the score is good then it will definately improve  

In [40]:
scores.std()

np.float64(0.015558381492447235)

The variation is also less (=0.015)

In [41]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y_transformed,test_size=0.2,random_state=42)

In [42]:
pipeline.fit(X_train,y_train)

In [43]:
y_pred = pipeline.predict(X_test)

In [55]:
# here we have changed log(y) to e^(log(y)) = y
y_pred = np.expm1(y_pred)

In [45]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(np.expm1(y_test),y_pred)

0.6483879307359236

This means on an average the model have error of 64 lakhs.

## SVR

In [46]:
# Creating a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', SVR(kernel='rbf'))
])

In [47]:
# K-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y_transformed, cv=kfold, scoring='r2')

In [48]:
scores.mean()

np.float64(0.8845360715052788)

In [49]:
scores.std()

np.float64(0.014784881452419891)

In [50]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y_transformed,test_size=0.2,random_state=42)

In [51]:
pipeline.fit(X_train,y_train)

In [52]:
y_pred = pipeline.predict(X_test)

In [53]:
y_pred = np.expm1(y_pred)

In [54]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(np.expm1(y_test),y_pred)

0.5324591082613244

 After using SVR, mean score increase and mean absolute error decreases

## Random forest

In [94]:
from sklearn.ensemble import RandomForestRegressor

In [95]:
rf = RandomForestRegressor()

In [96]:
# Creating a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', rf)
])

In [97]:
# K-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y_transformed, cv=kfold, scoring='r2')

In [98]:
scores.mean()

np.float64(0.8706136164619934)

In [99]:
scores.std()

np.float64(0.023847950061136373)

In [100]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y_transformed,test_size=0.2,random_state=42)

In [101]:
pipeline.fit(X_train,y_train)

In [102]:
y_pred = pipeline.predict(X_test)
y_pred = np.expm1(y_pred)

In [103]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(np.expm1(y_test),y_pred)

0.5365761800105637

## xgboost

In [103]:
from xgboost import XGBRegressor

In [None]:
# Creating a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', rf)
])

In [None]:
# K-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y_transformed, cv=kfold, scoring='r2')

In [None]:
scores.mean()

In [None]:
scores.std()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y_transformed,test_size=0.2,random_state=42)

In [None]:
pipeline.fit(X_train,y_train)

In [None]:
y_pred = pipeline.predict(X_test)
y_pred = np.expm1(y_pred)

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(np.expm1(y_test),y_pred)