<a href="https://colab.research.google.com/github/JEBAKUMAR12/HOUSE-PRICE-PREDICTION/blob/main/house_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color='#ff0000'> **House Price Predication** </font>

The dataset contains information about houses in Bengaluru, India, with the following features:

**Categorical Features:**

*   `area_type`: Type of area (e.g., Super built-up area, Plot area, Built-up area, Carpet area).
*   `availability`: Date or status of availability.
*   `location`: Neighborhood or area of the house.
*   `size`: Number of bedrooms or BHK (e.g., 2 BHK, 3 Bedroom).
*  `society`: Society Name of the apartment

**Numerical Features:**

*   `total_sqft`: Total area in square feet (can be a single value or a range).
*   `bath`: Number of bathrooms.
*   `balcony`: Number of balconies.
*   `price`: Price of the house in the Indian currency unit (Lakhs).

**Key Characteristics:**

*   **Mix of Area Types:** The dataset includes various types of areas: Super built-up, Plot, Built-up, and Carpet areas.
*   **Availability Patterns:** Properties are available at various times, with "Ready To Move" as the most frequent status. Some are marked with specific dates.
*  **Varied Locations:** The dataset covers a diverse range of locations across Bengaluru.
*   **Different Property Sizes:** The dataset contains properties from 1 RK to 10+ bedroom houses.
*   **Presence of Missing Data:** The dataset contains several missing values across the features
*   **Inconsistencies in Size values:** The 'size' feature includes 'Bedroom' or 'BHK'
*   **Inconsistencies in total_sqft values:** The values in total_sqft include a range (example: "2100-2850")
*   **Inconsistencies in price values:** prices are in lakhs

The dataset appears to represent a mix of various property types (apartments, houses) with a wide variety of size, location, and price, presenting an interesting challenge for analysis.


### **Import Libraries**

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

### **Import Dateset**

In [None]:
df = pd.read_csv('../data/Bengaluru_House_Data.csv')
df.head(50)

### <font color="#00CEFF"> **Exploratory data analysis (EDA)** </font>

##### **Summary of Data**

In [None]:
df.info()

#### Number of Rows and column

In [None]:
df.shape

##### **Statistical Summary of Data**

In [None]:
df.describe() # only Numeric Columns


**1.  Count:**

*   `bath`: 13,247
*   `balcony`: 12,711
*   `price`: 13,320

This indicates the number of non-missing values for each feature in the dataset. It also shows that the 'balcony' feature has the most missing values with only 12711 values.

**2. Mean:**

*   `bath`:  2.69
*  `balcony`: 1.58
*   `price`: 112.57

This is the average value for each feature. It shows that on average the dataset properties have 2.69 bathrooms, 1.58 balconies and price at approximately 112.57.

**3. Standard Deviation (std):**

*   `bath`: 1.34
*   `balcony`: 0.82
*   `price`: 148.97

Standard deviation is a measure of the dispersion or spread of the data. Higher values indicate a wider spread. The price feature displays a high spread as compared to the bath and balcony features.

**4. Minimum (min):**

*   `bath`: 1.00
*   `balcony`: 0.00
*   `price`: 8.00

These are the smallest values in the dataset for each of these features respectively. It shows the minimum number of bathrooms, balcony or price in the dataset.

**5. 25th Percentile:**

*   `bath`: 2.00
*   `balcony`: 1.00
*  `price`: 50.00

This means that 25% of the houses have 2 bathrooms, 1 balcony and a price less than or equal to 50

**6. 50th Percentile (Median):**

*   `bath`: 2.00
*   `balcony`: 2.00
*   `price`: 72.00

This is the middle value of each of the features. Half of the houses have 2 bathrooms, 2 balconies and half have price less than or equal to 72.

**7. 75th Percentile:**

*   `bath`: 3.00
*   `balcony`: 2.00
*   `price`: 120.00

This shows that 75% of the properties have 3 bathrooms, 2 balconies, and a price less than or equal to 120

**8. Maximum (max):**

*   `bath`: 40.00
*   `balcony`: 3.00
*   `price`: 3600.00

These are the largest values in the dataset for each feature. There are few outliers with as high as 40 bathrooms and price as high as 3600



In [None]:
# df.describe(include=['object'])
df.describe(exclude=['number']).T

#### Check missing values

In [None]:
df.isnull().sum()

#### Drop unwanted Features

In [None]:
df1 = df.drop(['area_type','availability','society','balcony'], axis=1)
df1.head()

In [None]:
df1.isnull().sum()

<p> Missing values is less only, compared to the Total number of data set. so drop the missing rows.

In [None]:
df2 = df1.dropna()
df2.isnull().sum()

#### Analyse Data using groupby()

In [None]:
df2.groupby('size')['total_sqft'].agg('count').sort_values()

#### Apply Naming conversion

In [None]:
df2['size'].unique()

In [None]:
df2['bhk']=df2['size'].apply(lambda x : int(x.split(' ')[0]))
df2.head()

In [None]:
df2['bhk'].unique()

In [None]:
df2[df2['bhk']>20]

find problematic entries in a column that is expected to contain numeric values.
1. Non-numeric strings like "2000 - 3000", "N/A", or "unknown".
2. Mixed data types in the column.

In [None]:
df2['total_sqft'].dtype

In [None]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True


##### ~ (Tilde Operator):
<p>It converts True to False and vice versa.So, ~df2['total_sqft'].apply(is_float)  identifies rows where total_sqft cannot be converted to a float. </p>

In [None]:
df2[~df2['total_sqft'].apply(is_float)].head(10)

In [None]:
df2.loc[410]

In [None]:
# df2 = df2.drop(['size'],axis=1)
# df2.head()

In [None]:
# def convert_sqrt_num(x):
#     tokens = x.split('-')
#     if len(tokens) == 2:
#         return (float(tokens[0])+float(tokens[1]))/2
#     try:
#         numeric_part = ''.join(c for c in x if c.isdigit() or c == '.')
#         return float(numeric_part)
#         # return float(x)
#     except:
#         return None


In [None]:
def convert_sqrt_num(x):
    try:
        # Handle ranges (e.g., '1015 - 1540')
        if '-' in x:
            tokens = x.split('-')
            return (float(tokens[0].strip()) + float(tokens[1].strip())) / 2

        # Remove non-numeric characters (e.g., 'Sq. Meter', 'Sq. Yards')
        for unit in ['Sq. Meter', 'Sq. Yards', 'Perch', 'Acres', 'Guntha', 'Cents', 'Grounds']:
            x = x.replace(unit, '')

        # Convert cleaned string to float
        return float(x.strip())
    except Exception as e:
        print(f"Error processing '{x}': {e}")
        return None


In [None]:
df3 = df2.copy()
# print(df3.loc[410])
# print(df3.loc[188])
# print(df3.loc[648])
# print("---------")

df3['total_sqft']= df3['total_sqft'].apply(convert_sqrt_num)
# df3.head(3)


# print(df3.loc[410])
# print(df3.loc[188])
# print(df3.loc[648])
# print(df3['total_sqft'].dtype)


In [None]:
df3[~df3['total_sqft'].apply(is_float)].head(10)

#### Feature Engineering

In [None]:
df4 = df3.copy()

##### Explore the Location column

In [None]:
print(df4.location.unique())
print(len(df4.location.unique()))

1304 unique values is just large, this called curse of Dimensionality or high dimensionality problem. using 'other' in unique values, we reduce the dimensionality problem

**Strip extra space from the location**

In [None]:
df4.location = df4.location.apply(lambda x : x.strip())
len(df4.location.unique())

In [None]:
location_stats = df4.groupby('location')['location'].agg('count').sort_values(ascending=False)
location_stats

In [None]:
location_stats_less_than_10 = location_stats[location_stats<=10]
len(location_stats[location_stats<=10])

In [None]:
df4['location'] = df4.location.apply(lambda x : 'other' if x in location_stats_less_than_10 else x)
len(df4.location.unique())

##### **Find per sqft value**

In [None]:
df4['price_per_sqft'] = df4['price']*100000/df4['total_sqft']
df4.head()

#### Ourlier Detection and Removel

compare total sqft with bhk. the below result shows outliers in there.

In [None]:
df4[df4.total_sqft/df4.bhk<300].head()

In [None]:
print("Before Removing outliers",df4.shape)
df5 = df4[~(df4.total_sqft/df4.bhk<300)]
print("After Removing Outliers",df5.shape)

In [None]:
df5.price_per_sqft.describe()

In [None]:
def remove_pps_ourliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft <=(m+st))]
        df_out = pd.concat([df_out, reduced_df], ignore_index=True)
    return df_out

df6 = remove_pps_ourliers(df5)
df6.shape





In [None]:
import matplotlib.pyplot as plt


def plot_scatter_chart (df, location):
    bhk2 = df[(df.location==location)& (df.bhk==2)]
    bhk3 = df[(df.location==location)& (df.bhk==3)]
    plt.rcParams['figure.figsize'] = (15,10)
    plt.scatter(bhk2.total_sqft,bhk2.price_per_sqft, color = 'blue', label = '2 BHK', s =50)
    plt.scatter(bhk3.total_sqft,bhk3.price_per_sqft, marker= '+',color = 'green', label = '3 BHK', s =50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price per Square Feet")
    plt.title(location)
    plt.legend()
    plt.show()

plot_scatter_chart(df6, "Rajaji Nagar")

In [None]:
df6.shape

In [None]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean' : np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count' : bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')

df7 = remove_bhk_outliers(df6)
df7.shape




In [None]:
plot_scatter_chart(df7, 'Hebbal')

In [None]:
plt.rcParams['figure.figsize'] = (20,10)
plt.hist(df7.price_per_sqft, rwidth=0.8)
plt.xlabel("Price per Square Feet")
plt.ylabel("Count")

In [None]:
df7.bath.unique()

In [None]:
df7[df7.bath>10]

In [None]:
plt.hist(df7.bath, rwidth=0.8)
plt.xlabel("Number of Bathrooms")
plt.ylabel("Count")

In [None]:
df7[df7.bath>df7.bhk+2]

In [None]:
df8 = df7[df7.bath<df7.bhk+2]
df8.shape

In [None]:
df9 = df8.drop(['size', 'price_per_sqft'],axis = 'columns')
df9.head()


In [None]:
dummies = pd.get_dummies(df9['location']).astype(int)
dummies.head(3)

In [None]:
df10 = pd.concat([df9,dummies.drop('other', axis=1)], axis=1)
df10.head(3)

In [None]:
df11  = df10.drop('location', axis=1)
df11.head()

In [None]:
X = df11.drop('price', axis=1 )
X.head()

In [None]:
y = df11.price
y.head()

In [None]:
from numpy import test
from sklearn.model_selection import train_test_split
XTrain,XTest,YTrain,yTrueTest = train_test_split(X,y,test_size=0.2, random_state=42)


#### Fit to the Model

In [None]:
from sklearn.linear_model import LinearRegression
lr_modelObj = LinearRegression()
lr_modelObj.fit(XTrain, YTrain)
print(lr_modelObj.score(XTest,yTrueTest))


In [None]:
yPredict = lr_modelObj.predict(XTest)
yPredict

#### Cross Validation

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5,test_size=0.2,random_state=42)

cross_val_score(LinearRegression(),X,y,cv=cv)


#### Grid Search CV

In [None]:
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd

def find_best_model(X, y):
    algorithms = {
        'linear_regression': {
            'model': Pipeline([
                ('scaler', StandardScaler()),  # Normalize data
                ('regressor', LinearRegression())
            ]),
            'params': {
                'regressor__fit_intercept': [True, False],
                'regressor__positive': [True, False]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1, 2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion': ['mse', 'friedman_mse'],
                'splitter': ['best', 'random']
            }
        }
    }

    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
    for algor_name, config in algorithms.items():
        gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X, y)
        scores.append({
            'model': algor_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])

# Example usage (make sure to define X and y first):
df = find_best_model(X, y)
df


In [None]:
import pickle
with open('bangalore_home_prices_model.pickle','wb') as f:
    pickle.dump(lr_modelObj,f)

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open('columns.json','w') as f:
    f.write(json.dumps(columns))