# Data-Driven Decisions: Predicting Goa House Prices for Smart Investments

This project develops a machine learning model designed to accurately predict real estate prices in Goa. By analyzing key features, the model aims to foster a deeper understanding of the Goa housing market, empowering better, data-driven decisions. 

## Overview 
This project focuses on building a machine learning model to predict real estate prices in Goa. The model utilizes key features such as location (longitude, latitude, ocean_proximity), housing characteristics (age, total rooms, total bedrooms), and economic indicators (population, households, median income) to estimate the median house value. The goal is to provide a data-driven framework for anticipating the Goa housing market, aiding in informed decision-making for buyers, sellers, and investors.


### 1. Importing required libraries 

In [1]:
import os
import joblib
import  pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

In [2]:
# Files for automating logic with unseen or new data
MODEL_FILE = "Model.pkl"
PIPELINE_FILE = "pipeline.pkl"

### 2. Pipeline 

In [3]:
def build_pipeline(num_variables , cat_variables):
    # Pipeline logic (Numeric Data preprocessing)
    num_pipeline1 = Pipeline([
        ("Imputer", SimpleImputer(strategy='median')),
        ("Scaling", StandardScaler())
    ])  

    # Pipeline logic (Categorical Data preprocessing)
    cat_pipeline2 = Pipeline([
        ("Imputer", SimpleImputer(strategy="most_frequent")),
        ("encoding", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
        ("Scaling", StandardScaler())
    ])

    # For merging categorical and numeric data
    full_pipeline = ColumnTransformer([
        ('numeric', num_pipeline1, num_variables),
        ('category', cat_pipeline2, cat_variables),

    ])
    return full_pipeline


### 3. Main Logic within IF-ELSE clause-

In [7]:
# 2) Stratified shuffling and train_test_split

if not os.path.exists(MODEL_FILE):

    data = pd.read_csv("housing.csv")
    # Creating Strata...
    data["income_cat"] = pd.cut(data["median_income"] , bins=[0 , 1.9 , 3.8, 5.7, 7.6 , np.inf] , labels=[1 , 2 , 3 , 4 , 5])

    # An object of stratified class
    split = StratifiedShuffleSplit(n_splits=1 , test_size=0.2, random_state=42)


    for train_index, test_index in split.split(data , data["income_cat"]):
        strat_train_data = data.loc[train_index].drop("income_cat" , axis= 1)
        data.loc[test_index].drop("income_cat" , axis= 1).to_csv("input.csv") # Storing in file for testing

    # 3) Now we'll work on training data( but with copy of data for future risk possibilities)

    train_data = strat_train_data.copy()

    #4) Separate features and labels from training data

    labels_data = train_data["median_house_value"].copy()
    features_data = train_data.drop("median_house_value" , axis= 1)

    # 5) Now, we'll separate numerical and categorical features for different_typed_pipeline_steps

    num_cols = features_data.drop("ocean_proximity", axis= 1).columns.tolist()
    cat_cols = ['ocean_proximity']

    pipeline = build_pipeline(num_cols, cat_cols)
    preprocessed_data = pipeline.fit_transform(features_data)

    # 7) Train ML Model (Choose Ml model based on performance)
    random_reg = RandomForestRegressor(random_state=42)
    random_reg.fit(preprocessed_data, labels_data)

    # Save models and pipeline
    joblib.dump(pipeline, PIPELINE_FILE)
    joblib.dump(random_reg, MODEL_FILE)

    print("Training of model has been done! Congrats...")
else:
   # Inference (Future prediction with unseen data.)
   model = joblib.load(MODEL_FILE)
   pipeline = joblib.load(PIPELINE_FILE)

   new_data = pd.read_csv("input.csv").drop(["median_house_value" ], axis = 1)

   transformed_data = pipeline.fit_transform(new_data)
   predictions = model.predict(transformed_data)
   new_data["median_house_values"] = predictions

   # Now return the output
   new_data.to_csv("Final Project/predicted_data.csv" , index=False)
   print("Inference complete. Results saved to predicted_data.csv")



Inference complete. Results saved to predicted_data.csv
