<a href="https://colab.research.google.com/github/J878-commits/-Task-1-Text-Summarization-with-Transformers-Gradio-/blob/main/%E2%80%9CCodTech_Internship_Task_3_%E2%80%93_Rental_Price_Prediction_API_with_Scikit_Learn_and_Flask%E2%80%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🎯 Task 3 Objective and Goals
🔹 Objective
The primary objective of Task 3 was to design and implement a complete end-to-end data science workflow—starting from raw data preprocessing and culminating in a deployed machine learning model. This task focused on transforming real-world rental data into a functional prediction system that could be accessed via an API or interactive web app.

In [1]:
# Step 1: Upload ZIP file
from google.colab import files
uploaded = files.upload()  # Upload archive1.zip

# Step 2: Unzip the archive
import zipfile
import os

zip_path = next(iter(uploaded))  # Get uploaded file name
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("rent_data")

# Step 3: List extracted files
os.listdir("rent_data")


Saving archive (1).zip to archive (1).zip


['House_Rent_Dataset.csv', 'Dataset Glossary.txt']

📊 Step 3: Load the Dataset

📥 Step 1: Extract ZIP and Load CSV

In [4]:
import zipfile
import pandas as pd

# Extract the ZIP file
with zipfile.ZipFile("/content/archive (1).zip", 'r') as zip_ref:
    zip_ref.extractall("/content/rent_data")

# Load the CSV file
df = pd.read_csv("/content/rent_data/House_Rent_Dataset.csv")
df.head()


Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner


🧼 Step 2: Data Cleaning & Overview

In [5]:
# Check basic info
df.info()

# Drop irrelevant columns
df.drop(columns=['Posted On'], inplace=True)  # Optional: timestamp not needed for prediction

# Check for missing values
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Posted On          4746 non-null   object
 1   BHK                4746 non-null   int64 
 2   Rent               4746 non-null   int64 
 3   Size               4746 non-null   int64 
 4   Floor              4746 non-null   object
 5   Area Type          4746 non-null   object
 6   Area Locality      4746 non-null   object
 7   City               4746 non-null   object
 8   Furnishing Status  4746 non-null   object
 9   Tenant Preferred   4746 non-null   object
 10  Bathroom           4746 non-null   int64 
 11  Point of Contact   4746 non-null   object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB


Unnamed: 0,0
BHK,0
Rent,0
Size,0
Floor,0
Area Type,0
Area Locality,0
City,0
Furnishing Status,0
Tenant Preferred,0
Bathroom,0


🔄 Step 3: Encode Categorical Features

In [6]:
from sklearn.preprocessing import LabelEncoder

# List of categorical columns
categorical_cols = ['Area Type', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact']

# Apply Label Encoding
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])


📊 Step 4: Feature Selection & Scaling

In [8]:
# Step 1: Split 'Floor' into two new columns
df[['Floor Level', 'Total Floors']] = df['Floor'].str.extract(r'(\w+)\s*out of\s*(\d+)')

# Step 2: Convert 'Floor Level' to numeric
floor_map = {
    'Ground': 0,
    'Lower Basement': -2,
    'Upper Basement': -1,
    '1st': 1, '2nd': 2, '3rd': 3, '4th': 4, '5th': 5,
    '6th': 6, '7th': 7, '8th': 8, '9th': 9, '10th': 10,
    '11th': 11, '12th': 12, '13th': 13, '14th': 14, '15th': 15,
    '16th': 16, '17th': 17, '18th': 18, '19th': 19, '20th': 20
}
df['Floor Level'] = df['Floor Level'].map(floor_map)
df['Total Floors'] = pd.to_numeric(df['Total Floors'], errors='coerce')

# Step 3: Drop original 'Floor' column
df.drop(columns=['Floor'], inplace=True)

# Step 4: Drop any remaining rows with missing values
df.dropna(inplace=True)



In [11]:
# Check all column names
df.columns.tolist()




['BHK',
 'Rent',
 'Size',
 'Area Type',
 'Area Locality',
 'City',
 'Furnishing Status',
 'Tenant Preferred',
 'Bathroom',
 'Point of Contact',
 'Floor Level',
 'Total Floors']

In [12]:
from sklearn.preprocessing import LabelEncoder

# Corrected list of categorical columns
categorical_cols = ['Area Type', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact', 'Area Locality']

# Apply Label Encoding
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])


In [13]:
# Define features and target
X = df.drop(columns=['Rent'])  # Features
y = df['Rent']                 # Target

# Scale features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


🧠 Step 5: Model Training

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


🧪 Step 6: Model Evaluation

In [16]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"✅ RMSE: {rmse:.2f}")
print(f"✅ R² Score: {r2:.2f}")



✅ RMSE: 18315.67
✅ R² Score: 0.29


💾 Step 7: Save Model & Scaler for Deployment

In [17]:
import joblib

# Save model and scaler
joblib.dump(model, 'rental_price_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

print("✅ Model and scaler saved successfully.")


✅ Model and scaler saved successfully.


🌐 Step 9: Flask API Packaging

In [18]:
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)

# Load model and scaler
model = joblib.load('rental_price_model.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.json['features']  # Expecting a list of feature values
        scaled = scaler.transform([data])
        prediction = model.predict(scaled)[0]
        return jsonify({'predicted_rent': round(prediction, 2)})
    except Exception as e:
        return jsonify({'error': str(e)})

if __name__ == '__main__':
    app.run(debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


flask
numpy
scikit-learn
joblib


web: gunicorn app:app


🧪 Step 10: Test Your API

🔟 Step 10: Model Deployment and Interface Integration
🔹 Overview
Step 10 focused on deploying the trained rental price prediction model to make it accessible for real-time use. While traditional deployment methods like Heroku or command-line Gunicorn servers were not used in this phase, the model was successfully integrated into a Gradio interface—a powerful and user-friendly alternative for showcasing machine learning models.

🔹 Approach
Instead of deploying via gunicorn app:app in a terminal or cloud CLI, the model was wrapped in a Gradio app and launched directly within Google Colab.

This method allowed for:

Instant testing and sharing via public Gradio links

Interactive input fields for real-time predictions

A clean, visual interface suitable for both technical and non-technical users

✅ Outcome
The model is now accessible through a Gradio-powered web interface, fulfilling the core requirement of Task 3: to demonstrate the model’s functionality in a deployed environment.

This approach also enables easy transition to platforms like Hugging Face Spaces, where the same Gradio app can be hosted publicly without needing CLI deployment.

🧠 Reflection
While command-line deployment using Flask and Gunicorn is a valuable skill, Gradio offers a lightweight, flexible, and visually engaging alternative—especially useful for prototyping, educational demos, and portfolio projects. It aligns well with the goals of Task 3 by making the model usable, testable, and shareable.

💻 Gradio App Code Cell

In [22]:
import gradio as gr
import joblib

# Load model and scaler
model = joblib.load("rental_price_model.pkl")
scaler = joblib.load("scaler.pkl")

# Define prediction function
def predict_rent(bedrooms, area, bathrooms, age, floor, furnishing, parking, tier, locality, amenities, proximity):
    features = [bedrooms, area, bathrooms, age, floor, furnishing, parking, tier, locality, amenities, proximity]
    scaled = scaler.transform([features])
    prediction = model.predict(scaled)
    return f"Estimated Monthly Rent: ₹{round(prediction[0], 2)}"

# Create Gradio interface
demo = gr.Interface(
    fn=predict_rent,
    inputs=[
        gr.Number(label="Bedrooms"),
        gr.Number(label="Area (sq ft)"),
        gr.Number(label="Bathrooms"),
        gr.Number(label="Age of Property"),
        gr.Number(label="Floor"),
        gr.Number(label="Furnishing (0=Unfurnished, 1=Furnished)"),
        gr.Number(label="Parking Spaces"),
        gr.Number(label="City Tier"),
        gr.Number(label="Locality Score"),
        gr.Number(label="Amenities Count"),
        gr.Number(label="Proximity Index")
    ],
    outputs="text",
    title="🏠 Rental Price Predictor",
    description="Enter property details to estimate monthly rent."
)

# Launch the app
demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f473bdbd30b19cfe88.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




📄 Step 12: Final Project Structure for Heroku

rental-price-predictor/
├── app.py
├── rental_price_model.pkl
├── scaler.pkl
├── requirements.txt
├── Procfile


web: gunicorn app:app
flask
gunicorn
numpy
scikit-learn
joblib


I can summarize Task 3 and provide insights and a conclusion.

Task 3 involved encoding categorical features in the dataset. This was done using LabelEncoder from sklearn.preprocessing on columns like 'Area Type', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact', and 'Area Locality'.

Insight: Encoding categorical features is a crucial step in preparing data for machine learning models. Models typically work with numerical data, so converting categorical variables into a numerical format is necessary. Label encoding assigns a unique integer to each category.

Conclusion: The categorical features have been successfully encoded, making the dataset suitable for subsequent steps like model training. However, it's worth noting that Label Encoding assumes an ordinal relationship between categories, which might not always be true for the columns encoded here. For nominal categorical features, one-hot encoding might be a more appropriate technique to avoid introducing unintended ordinality.

📝 Task 3 Summary: Categorical Feature Encoding for Rental Price Prediction
🔹 Objective
As part of CodTech Internship Task 3, the goal was to preprocess the rental price dataset by encoding categorical features to make it suitable for machine learning model training and deployment.

🔹 Approach
Applied Label Encoding using sklearn.preprocessing.LabelEncoder to transform categorical columns into numerical format.

Targeted columns included:

Area Type

City

Furnishing Status

Tenant Preferred

Point of Contact

Area Locality

🔍 Insight
Encoding categorical features is a critical preprocessing step in any data science pipeline. Since most machine learning models require numerical input, categorical variables must be converted appropriately.

Label Encoding assigns a unique integer to each category.

This method is simple and efficient, but it implies an ordinal relationship between categories—which may not reflect the true nature of the data.

✅ Outcome
All categorical features were successfully encoded.

The dataset is now compatible with regression models and ready for training and deployment.

⚠️ Reflection
While Label Encoding was used for simplicity, it’s important to recognize its limitations:

For nominal features (e.g., City, Point of Contact), One-Hot Encoding may be more appropriate to avoid introducing false ordinal relationships.

Future iterations could explore target encoding or embedding layers for more nuanced handling of high-cardinality categorical data.