# USA House Prices Analysis & Prediction
The goal of this project is to explore and understand the factors influencing the real estate market in the United States and build machine learning models to accurately predict house prices.

During EDA, we aim to:

+ Identify key factors that influence house prices (e.g., location, size, number of rooms, year built).
+ Detect outliers and anomalies in the dataset.
+ Analyze the distribution of house prices and uncover trends.
+ Explore correlations between property features and prices.

> Source data: https://www.kaggle.com/datasets/farhankarim1/usa-house-prices

## Import libs

In [2]:
import pandas as pd
import seaborn as sns

## Import raw data

In [3]:
import os
import sys
from google.colab import drive
drive.mount('/content/drive')
project_path = "/content/drive/MyDrive/Pytorch pet projects/ML - Projects/ML - USA HOUSE PRICES"
sys.path.append(os.path.join(project_path, "src"))

df = pd.read_csv("/content/drive/MyDrive/Pytorch pet projects/ML - Projects/ML - USA HOUSE PRICES/data/raw/USA_Housing.csv")


Mounted at /content/drive


## Discover data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
 6   Address                       5000 non-null   object 
dtypes: float64(6), object(1)
memory usage: 273.6+ KB


In [5]:
df.describe()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,68583.108984,5.977222,6.987792,3.98133,36163.516039,1232073.0
std,10657.991214,0.991456,1.005833,1.234137,9925.650114,353117.6
min,17796.63119,2.644304,3.236194,2.0,172.610686,15938.66
25%,61480.562388,5.322283,6.29925,3.14,29403.928702,997577.1
50%,68804.286404,5.970429,7.002902,4.05,36199.406689,1232669.0
75%,75783.338666,6.650808,7.665871,4.49,42861.290769,1471210.0
max,107701.748378,9.519088,10.759588,6.5,69621.713378,2469066.0


## Generating New Features for Better Predictions
+ Address features in this form is not the most usable.
Creating new features base on ```Address```

In [21]:
# How many of the address contains Military address?
df["Military_address"] = df["Address"].str.extract(r"\b(APO|FPO|DPO)\b", expand=True)
mil_add = df["Military_address"].notna().sum()
print(f"{mil_add} of the address is contains 'APO/FPO/DPO' use a specific set of ZIP Codes assigned to military locations.")

514 of the address is contains 'APO/FPO/DPO' use a specific set of ZIP Codes assigned to military locations.


In [25]:
# Remove the military address
df.drop(df[df['Military_address'].notna()].index, inplace=True)

# Split the Address to Steet and to the City+State+Zip
df[["Street", "CityStateZip"]] = df["Address"].str.split("\n", expand=True)
df[["City", "StateZip"]] = df["CityStateZip"].str.split(", ", expand=True)
df[["State", "Zip"]] = df["StateZip"].str.split(" ", n=1, expand=True)

df.drop(columns=["Address", "CityStateZip", "StateZip", "Military_address"], axis=1, inplace=True)
df.head(5)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Street,City,State,Zip
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,208 Michael Ferry Apt. 674,Laurabury,NE,37010-5101
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,188 Johnson Views Suite 079,Lake Kathleen,CA,48958
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,9127 Elizabeth Stravenue,Danieltown,WI,06482-3489
5,80175.754159,4.988408,6.104512,4.04,26748.428425,1068138.0,06039 Jennifer Islands Apt. 443,Tracyport,KS,16077
6,64698.463428,6.025336,8.14776,3.41,60828.249085,1502056.0,4759 Daniel Shoals Suite 442,Nguyenburgh,CO,20247
