### DATA103 Final Project <br>
Submitted by **ALDECOA**, Renzel; **LLANES**, Arlan; **OPALLA**, Rijan - S11

---
#### Statment of the Problem

The <a href="https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset">USA Real Estate Dataset</a> offers listings of property sold in the USA across multiple states. According to the author, the data was collected from <a href="https://www.realtor.com/">realtor.com</a>, a real estate listing website managed by *Move, Inc.*, a company based in Santa Clara, California.

The dataset contains **2,226,382** property listings with **10** features. These features are the following:
- ```brokered by``` - an encoded identification number for an agency/broker
- ```status``` - housing status; can be (a) ready for sale, (b) sold, or (c) ready to build
- ```price``` - housing price; either current listing price or recently sold price
- ```bed``` - number of beds in the property
- ```bath``` - number of bathrooms in the property
- ```acre_lot``` - total property/lot area, in acres
- ```street``` - encoded street address
- ```city``` - city where the property is located
- ```state``` - state where the property is located
- ```zip_code``` - postal code of the area
- ```house_size``` - the size of the property, in square feet; and
- ```prev_sold_date``` - previous date of sale

This will be a **classification** problem.

Main features:
- ```price```
- ```bed``` (number of bedrooms)
- ```bath``` (number of bathrooms)
- ```acre_lot``` (area of lot in acres)
- ```house_size``` (size of house in square feet)

Label:
- ```Worth [1]``` / ```Not worth [0]``` (binary) - determines if the house, with its amenities, is worth the price


---
#### Importing necessary libraries

In [6]:
import pandas as pd
import numpy as np
import plotly.express as px

---
#### Extracting dataset from compressed file

The dataset itself is **large**, amounting to **178.86** MB of space. The dataset is pushed into the repository in a compressed state, as to not require the use of *Git LFS* (Large File Storage) 

In [9]:
import zipfile
with zipfile.ZipFile('USA-Real-Estate/archive.zip', 'r') as zip_ref:
    zip_ref.extractall('USA-Real-Estate/')

In [10]:
df = pd.read_csv("USA-Real-Estate/realtor-data.zip.csv", low_memory=False)
df.shape

(2226382, 12)

---
#### Exploring the dataset

In [12]:
df.head(3)

Unnamed: 0,brokered_by,status,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,prev_sold_date
0,103378.0,for_sale,105000.0,3.0,2.0,0.12,1962661.0,Adjuntas,Puerto Rico,601.0,920.0,
1,52707.0,for_sale,80000.0,4.0,2.0,0.08,1902874.0,Adjuntas,Puerto Rico,601.0,1527.0,
2,103379.0,for_sale,67000.0,2.0,1.0,0.15,1404990.0,Juana Diaz,Puerto Rico,795.0,748.0,


The features include the following:

In [14]:
df.columns

Index(['brokered_by', 'status', 'price', 'bed', 'bath', 'acre_lot', 'street',
       'city', 'state', 'zip_code', 'house_size', 'prev_sold_date'],
      dtype='object')

In [15]:
df.dtypes

brokered_by       float64
status             object
price             float64
bed               float64
bath              float64
acre_lot          float64
street            float64
city               object
state              object
zip_code          float64
house_size        float64
prev_sold_date     object
dtype: object

Brief statistical description of ```float``` type features (that are **not** numbers used for categorical encoding):

In [44]:
df[["price", "bed", "bath", "acre_lot", "house_size"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,2224841.0,524195.519291,2138893.0,0.0,165000.0,325000.0,550000.0,2147484000.0
bed,1745065.0,3.275841,1.567274,1.0,3.0,3.0,4.0,473.0
bath,1714611.0,2.49644,1.652573,1.0,2.0,2.0,3.0,830.0
acre_lot,1900793.0,15.223027,762.8238,0.0,0.15,0.26,0.98,100000.0
house_size,1657898.0,2714.471335,808163.5,4.0,1300.0,1760.0,2413.0,1040400000.0


Exploring the ```status``` feature:

In [32]:
df["status"].unique()

array(['for_sale', 'ready_to_build', 'sold'], dtype=object)

In [34]:
df["status"].value_counts()

status
for_sale          1389306
sold               812009
ready_to_build      25067
Name: count, dtype: int64