## Objective:
Build a system using the Ames Housing dataset to predict house prices and understand which factors affect the price most. The project also compares different regression models and checks where the predictions do not work well, especially for very expensive or very cheap houses.

## Project overview 
Housing prices depend on many things like location, size of the house, quality, and nearby area. Knowing the correct price is important for buyers, sellers, and real estate agents to make better decisions.

In this project, the goal is not only to predict house prices but also to understand which factors affect the price the most. The project also focuses on finding where the model works well and where it makes mistakes (where predictive models tend to fail ) so the results can be trusted in real-life situations.

In [3]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np 

In [4]:
dataset = pd.read_csv("AmesHousing.csv")

In [5]:
dataset.columns

Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
      

In [6]:
dataset['Sale Condition'].unique()

array(['Normal', 'Partial', 'Family', 'Abnorml', 'Alloca', 'AdjLand'],
      dtype=object)

In [7]:
dataset[['Lot Area','Sale Condition','SalePrice']].sample(3)

Unnamed: 0,Lot Area,Sale Condition,SalePrice
2001,10800,Normal,159500
48,7658,Normal,319900
2225,1782,Normal,123900


## Problem Framing & Key Questions

The aim of this project is to predict house prices using past data and understand which factors affect prices the most. The focus is not only on prediction, but also on learning price patterns.

A successful model should give reasonable predictions on new data and clearly show which features influence the price. It should work well for most houses, not just the training data.

The main risk is wrong price prediction, which can mislead buyers or sellers. Overfitting or underfitting of model can cause financial loss, so model reliability is important.

The model may not perform well for very expensive houses, very old properties, or areas with limited data. These cases are harder to predict because there are fewer similar examples in the dataset.


In [8]:
dataset.shape

(2930, 82)

In [10]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

In [17]:
dataset['Pool QC'].isnull().sum()

2917

In [21]:
dataset['Pool QC'].unique()

array([nan, 'Ex', 'Gd', 'TA', 'Fa'], dtype=object)