# ``  House Price Prediction ``

## Define the Problem
Objective: Clearly define the objective of your prediction model (e.g., predicting house prices, classifying emails as spam).
Output: Determine what you need to predict (e.g., a continuous value for regression, a class label for classification).

In [32]:
# !pip install git+https://github.com/ArinGujarati/ArinLib.git

# !pip install E:/Github/ArinLib


In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ArinLib as al

## Data Collection:
Gather the data required for your problem. Ensure the data is relevant, accurate, and sufficient.

In [34]:
df = pd.read_csv('Housing.csv')

df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


## Data Preprocessing:
Remove rows/columns with missing values.

Impute missing values using mean, median, mode, or more advanced techniques.

Remove Duplicates: Check for and remove duplicate records.

Correct Errors: Identify and correct errors or inconsistencies in the data.

Normalize/Scale Data: Standardize features by removing the mean and scaling to unit variance.

Encode Categorical Variables: Convert categorical variables into numerical format using one-hot encoding or label encoding.

Split Data: Divide the data into training and testing sets (e.g., 80/20 split).

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


In [36]:
df.isnull().sum().sum()

0

In [37]:
df.duplicated().sum().sum()

0

In [38]:
## Convert Data To Int
df = al.ObjectToInt(df)

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   price             545 non-null    int64
 1   area              545 non-null    int64
 2   bedrooms          545 non-null    int64
 3   bathrooms         545 non-null    int64
 4   stories           545 non-null    int64
 5   mainroad          545 non-null    int32
 6   guestroom         545 non-null    int32
 7   basement          545 non-null    int32
 8   hotwaterheating   545 non-null    int32
 9   airconditioning   545 non-null    int32
 10  parking           545 non-null    int64
 11  prefarea          545 non-null    int32
 12  furnishingstatus  545 non-null    int32
dtypes: int32(7), int64(6)
memory usage: 40.6 KB


In [40]:
from sklearn.model_selection import train_test_split

x = df.iloc[:, 1:]  # Feature columns
y = df['price']     # Target column

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)


print('x_train: ',x_train.shape)
print('x_test: ',x_test.shape)
print('y_train: ',y_train.shape)
print('y_test: ',y_test.shape)

x_train:  (436, 12)
x_test:  (109, 12)
y_train:  (436,)
y_test:  (109,)


## Exploratory Data Analysis (EDA)
Summary Statistics: Compute basic statistics (mean, median, mode, standard deviation).

Histograms: Understand the distribution of numerical features.

Box Plots: Identify outliers.

Scatter Plots: Explore relationships between features.

Heatmaps: Examine correlations between features.

Pattern Identification: Look for trends, seasonality, or patterns in the data.

Outliers: Detect and consider handling outliers.

In [41]:
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,0
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,0
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,1
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,0
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,0


In [42]:
al.Outliers_Visualization(df)

interactive(children=(Dropdown(description='Column:', options=('price', 'area', 'bedrooms', 'bathrooms', 'stor…

In [43]:
def remove_outliers_iqr(data, outliers_columns):    
    
    for column in outliers_columns:
        # Ensure the column is numeric
        if column not in data.columns:
            raise ValueError(f"Column '{column}' does not exist in the DataFrame.")
    
        # Compute IQR
        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Filter out the outliers
        cleaned_df = data[~((data[column] < lower_bound) | (data[column] > upper_bound))]
        
    return cleaned_df

outliers_columns = ['price','bedrooms','area']

cleaned_df_iqr = remove_outliers_iqr(df, outliers_columns)

al.Outliers_Visualization(cleaned_df_iqr)

interactive(children=(Dropdown(description='Column:', options=('price', 'area', 'bedrooms', 'bathrooms', 'stor…

## Feature Engineering
Create New Features: Derive new features from existing ones (e.g., creating a feature for the day of the week from a date).

Feature Selection: Use techniques like correlation analysis, feature importance from models, or recursive feature elimination.

Dimensionality Reduction: Apply techniques like PCA (Principal Component Analysis) if needed to reduce feature space.

## Model Selection:
Algorithm Choice:

Regression: Linear regression, decision trees, random forest, gradient boosting, etc.

Classification: Logistic regression, SVM, decision trees, random forest, gradient boosting, neural networks, etc.


Suitability:

Evaluate models based on their suitability for your data (e.g., linear models for linearly separable data, complex models for non-linear data).

In [44]:
from sklearn.linear_model import LinearRegression
import joblib

lr = LinearRegression()

lr.fit(x_train,y_train)

# Save the model
joblib.dump(lr, 'linear_regression_model.pkl')



['linear_regression_model.pkl']

## Model Training
Training the Model: Fit the model to the training data.

Cross-Validation: Use techniques like k-fold cross-validation to tune hyperparameters and avoid overfitting.

In [49]:
# Load the model
lr = joblib.load('linear_regression_model.pkl')

# Use the model to make predictions
y_pred = lr.predict(x_test)


## Model Evaluation:
Performance Metrics:


Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R² score.

Classification: Accuracy, Precision, Recall, F1 Score, ROC-AUC.


Evaluate on Test Data:
Measure the model’s performance on the test dataset to check for generalization.

In [46]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.6494754192267804

 ##  Load Model

In [47]:
import joblib

# Load the model
lr = joblib.load('linear_regression_model.pkl')

# Use the model to make predictions
predictions = lr.predict(x_test)
