<a href="https://colab.research.google.com/github/Sayed-Hossein-Hosseini/Linear_Regression_for_Predicting_House_Prices/blob/master/Linear_Regression_for_Predicting_House_Prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Linear Regression for Predicting House Prices**

## **Libraries**

In [12]:
import gdown
import pandas as pd
import numpy as np
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## **Download Dataset**

In [2]:
gdown.download("https://drive.google.com/uc?id=1bAJnZWRMPuRF0yRotnLG4Nlr-pRgqWIt", "House_Prices.csv", quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1bAJnZWRMPuRF0yRotnLG4Nlr-pRgqWIt
To: /content/House_Prices.csv
100%|██████████| 3.48k/3.48k [00:00<00:00, 7.97MB/s]


'House_Prices.csv'

## **Preprocessing**

### **Remove NULL Data**

In [13]:
# Load the dataset
df = pd.read_csv('House_Prices.csv')

# Display initial dataset information
print("Dataset Info:\n", df.info())
print("\nStatistical Description:\n", df.describe())

# Display the number of NULL values in each column
print("\nNumber of NULL values per column:\n", df.isna().sum())

# Replace NULL values in numerical columns with median or mode
def impute_missing_values(df):
    for column in df.columns:
        if df[column].isna().sum() > 0:  # Check for NULL values in the column
            if df[column].dtype in ['int64', 'float64']:
                df[column].fillna(int(df[column].median()), inplace=True)  # Replace with median (converted to integer)
            else:
                df[column].fillna(df[column].mode()[0], inplace=True)  # Replace with mode (for non-numeric data)
    return df

# Execute replacement
df_replace = impute_missing_values(df)

# Recheck NULL values
print("\nNumber of NULL values after replacement:\n", df_replace.isna().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Price         128 non-null    int64  
 1   SqFt          126 non-null    float64
 2   Bedrooms      125 non-null    float64
 3   Bathrooms     127 non-null    float64
 4   Offers        126 non-null    float64
 5   Brick         128 non-null    object 
 6   Neighborhood  128 non-null    object 
dtypes: float64(4), int64(1), object(2)
memory usage: 7.1+ KB
Dataset Info:
 None

Statistical Description:
                Price         SqFt    Bedrooms   Bathrooms      Offers
count     128.000000   126.000000  125.000000  127.000000  126.000000
mean   130427.343750  2001.666667    3.032000    2.448819    2.563492
std     26868.770371   212.387382    0.728852    0.514992    1.069550
min     69100.000000  1450.000000    2.000000    2.000000    1.000000
25%    111325.000000  1882.500000    3.000000    2

### **Remove Outliers**

In [14]:
# Display initial dataset statistics
print("Initial row count:", df_replace.shape[0])
print("\nInitial descriptive statistics:\n", df_replace.describe())

def remove_manual_outliers(df):
    """
    Remove outliers based on manually defined thresholds for each relevant column.
    This provides more control than automatic IQR-based removal.
    """
    clean_df = df.copy()

    # Price thresholds - remove extremely cheap or expensive houses
    clean_df = clean_df[(clean_df['Price'] >= 70000) & (clean_df['Price'] <= 200000)]

    # Square footage thresholds - remove very small or very large houses
    clean_df = clean_df[(clean_df['SqFt'] >= 1500) & (clean_df['SqFt'] <= 2500)]

    # Bedrooms threshold - remove houses with more than 4 bedrooms
    clean_df = clean_df[clean_df['Bedrooms'] <= 4]

    # Offers threshold - remove houses with more than 5 offers
    clean_df = clean_df[clean_df['Offers'] <= 5]

    return clean_df

# Apply the manual outlier removal
df_clean = remove_manual_outliers(df_replace)

# Display results after cleaning
print("\nRow count after outlier removal:", df_clean.shape[0])
print("Number of rows removed:", df.shape[0] - df_clean.shape[0])
print("\nDescriptive statistics after cleaning:\n", df_clean.describe())

# Show sample of removed outliers for inspection
outliers_removed = df_replace[~df_replace.index.isin(df_clean.index)]
print("\nSample of removed outliers:")
print(outliers_removed[['Price', 'SqFt', 'Bedrooms', 'Bathrooms', 'Offers']].head())

# Save the cleaned dataset
df_clean.to_csv('House_Prices_Manually_Cleaned.csv', index=False)
print("\nCleaned dataset saved to 'House_Prices_Manually_Cleaned.csv'")

Initial row count: 128

Initial descriptive statistics:
                Price         SqFt    Bedrooms   Bathrooms      Offers
count     128.000000   128.000000  128.000000  128.000000  128.000000
mean   130427.343750  2001.640625    3.031250    2.445312    2.570312
std     26868.770371   210.708506    0.720209    0.514492    1.062486
min     69100.000000  1450.000000    2.000000    2.000000    1.000000
25%    111325.000000  1887.500000    3.000000    2.000000    2.000000
50%    125950.000000  2000.000000    3.000000    2.000000    3.000000
75%    148250.000000  2140.000000    3.000000    3.000000    3.000000
max    211200.000000  2590.000000    5.000000    4.000000    6.000000

Row count after outlier removal: 120
Number of rows removed: 8

Descriptive statistics after cleaning:
                Price         SqFt    Bedrooms   Bathrooms      Offers
count     120.000000   120.000000  120.000000  120.000000  120.000000
mean   129585.000000  1991.500000    2.991667    2.425000    2.52500