# Milestone 1 - House Sales
### Reeya Patel & Megan Mohr
## Dataset Description: 
The USA House Sales Dataset contains information about residential property listings across different cities and states in the United States. Each row represents a single property and includes details such as its price, location, size, number of bedrooms and bathrooms, year built, and how long it stayed on the market. The dataset also includes additional listing information like the agent responsible for selling the property and the listing’s current status (e.g., Active, Sold, Pending).

## Attribute Description: 
1. Price-
    The price of a listed property.
    Type- REAL
2. Address-
    The address of the listed property.
    Type- TEXT
3. City-
    The city the listed property is listed in.
    Type- TEXT
4. Zipcode-
    The zipcode the listed property is in.
    Type- INTEGER
5. State-
    The state the property is listed in.
    Type- TEXT
6. Bedrooms-
    The amount of bedrooms the listed property contains.
    Type- INTEGER
7. Bathrooms-
    The amount of bathrooms the listed property includes.
    Type- REAL
8. Area (sqft)-
    The amount of area the listed property has, in square footage.
    Type- INTEGER
9. Lot Size-
    The lot size the listed property includes.
    Type- TEXT
10. Year Built-
    The year the listed property was built.
    Type- INTEGER
11. Days on Market-
    The amount of days the listed property has been on the market for.
    Type- INTEGER
12. Property Type-
    The type of listed property.
    Type- TEXT
13. MLS ID-
    A unique identifier assigned to real estate agent or office that is in charge of the listed property.
    Type- TEXT
14. Listing Agent-
    The agent that has been assigned to the listed property.
    Type- TEXT
15. Status-
    The selling status of which the listed property is.
    Type- TEXT
16. Listing URL-
    The URL for the listing page of the listed property.
    Type- TEXT

## Data Source: 
https://www.kaggle.com/datasets/abdulwadood11220/usa-house-sales-data

## Dataset Display:

In [11]:
import pandas as pd
df = pd.read_csv("usa_house_sales.csv")
df.head()

Unnamed: 0,Price,Address,City,Zipcode,State,Bedrooms,Bathrooms,Area (Sqft),Lot Size,Year Built,Days on Market,Property Type,MLS ID,Listing Agent,Status,Listing URL
0,"$554,217","5926 Oak Ave, San Diego, CA 65383",San Diego,65383,CA,1 bds,3 ba,772 sqft,4757 sqft,1959,101,Townhouse,Z104635,Alex Johnson - Compass,For Sale,https://www.zillow.com/homedetails/80374762_zpid/
1,"$164,454","9583 Oak Ave, Fresno, IL 79339",Fresno,79339,IL,1 bds,1 ba,2348 sqft,3615 sqft,1969,46,Apartment,Z535721,Emily Davis - Century 21,Sold,https://www.zillow.com/homedetails/86143665_zpid/
2,"$1,249,331","8224 Oak Ave, Sacramento, TX 87393",Sacramento,87393,TX,6 bds,1 ba,3630 sqft,9369 sqft,1990,59,Townhouse,Z900458,Mike Lee - Coldwell Banker,For Sale,https://www.zillow.com/homedetails/37082403_zpid/
3,"$189,267","232 Oak Ave, Fresno, TX 38666",Fresno,38666,TX,2 bds,1 ba,605 sqft,8804 sqft,1958,119,Apartment,Z318589,John Doe - RE/MAX,Pending,https://www.zillow.com/homedetails/39318132_zpid/
4,"$465,778","5446 Pine Rd, Los Angeles, CA 23989",Los Angeles,23989,CA,3 bds,2 ba,1711 sqft,9260 sqft,2020,26,Townhouse,Z899716,John Doe - RE/MAX,Pending,https://www.zillow.com/homedetails/22454634_zpid/


## Relational Schema Design: 
**Entities:** 
1. **Property**(property_id PK)  
2. **Agent**(agent_id PK)  
3. **Status**(status_id PK)  
4. **City**(city_id PK)  
5. **Listing**(listing_id PK, property_id FK → Property.property_id, agent_id FK → Agent.agent_id, status_id FK → Status.status_id)

## Tables
**City Table**
City(

    city_id INT NOT NULL AUTO_INCREMENT,
    city VARCHAR(100),
    state VARCHAR(50),
    zipcode VARCHAR(20),
    PRIMARY KEY (city_id)
);

**Property Table**

Property(

    property_id INT NOT NULL AUTO_INCREMENT,
    address VARCHAR(255),
    city_id INT,
    price DECIMAL(15,2),
    bedrooms INT, 
    bathrooms DECIMAL(3,1),
    area_sqft INT,
    lot_size DECIMAL(10,2),
    year_built INT,
    days_on_market INT,
    property_type VARCHAR(50),
    PRIMARY KEY (property_id),
    FOREIGN KEY (city_id) REFERENCES City(city_id)
);

**Agent Table**

Agent(

    agent_id INT NOT NULL AUTO_INCREMENT,
    listing_agent VARCHAR(100),
    PRIMARY KEY (agent_id)
);

**Status Table**

Status(

    status_id INT NOT NULL AUTO_INCREMENT,
    status VARCHAR(50),
    PRIMARY KEY (status_id)
);

**Listings Table**

Listing(

    listing_id INT NOT NULL AUTO_INCREMENT,
    property_id INT,
    agent_id INT,
    status_id INT,
    listing_url VARCHAR(255),
    PRIMARY KEY (listing_id),
    FOREIGN KEY (property_id) REFERENCES Property(property_id),
    FOREIGN KEY (agent_id) REFERENCES Agent(agent_id),
    FOREIGN KEY (status_id) REFERENCES Status(status_id)
);

- City – Stores unique city, state, and zipcode combinations.
- Property – Each property’s details (price, bedrooms, bathrooms, area, etc.).
- Agent – Real estate agents responsible for listings.
- Status – Listing state (Active, Sold, Pending, etc.).
- Listing – Links each property to its agent, status, and listing URL.

## Joined Tables 
**Property City** connects each property to the city it’s located in

PropertyCity(

    property_id INT NOT NULL,
    city_id INT NOT NULL,
    PRIMARY KEY (property_id, city_id),
    FOREIGN KEY (property_id) REFERENCES Property(property_id),
    FOREIGN KEY (city_id) REFERENCES City(city_id)
);

**Property Agent** connects each property to the agent(s) responsible for selling or listing it.

PropertyAgent(

    property_id INT NOT NULL,
    agent_id INT NOT NULL,
    PRIMARY KEY (property_id, agent_id),
    FOREIGN KEY (property_id) REFERENCES Property(property_id),
    FOREIGN KEY (agent_id) REFERENCES Agent(agent_id)
);

**Property Status** connects each property to its current or historical listing status (e.g., Active, Pending, Sold).

PropertyStatus(

    property_id INT NOT NULL,
    status_id INT NOT NULL,
    PRIMARY KEY (property_id, status_id),
    FOREIGN KEY (property_id) REFERENCES Property(property_id),
    FOREIGN KEY (status_id) REFERENCES Status(status_id)
);

**Listing** connects a property, its listing agent, and its current status

Listing(

    listing_id INT NOT NULL AUTO_INCREMENT,
    property_id INT NOT NULL,
    agent_id INT NOT NULL,
    status_id INT NOT NULL,
    listing_url VARCHAR(255),
    PRIMARY KEY (listing_id),
    FOREIGN KEY (property_id) REFERENCES Property(property_id),
    FOREIGN KEY (agent_id) REFERENCES Agent(agent_id),
    FOREIGN KEY (status_id) REFERENCES Status(status_id)
);
## Entity-Relation (ER) Diagram
<p align="center">
  <img src="ER.png" width="700">
</p>

## Relational Model (RM) Diagram
<p align="center">
    <img src="RM.png" width="700">
</p>

## Normalization
The original CSV file of the House Sales dataset contained columns with repeating information, such as city and state names or agent details listed multiple times. This violates 1NF, which requires all attribute values to be atomic. To fix this, we broke down the repeating attributes into separate relations so that each field holds only a single value.

We created separate tables for City, Property, Agent, Status, and Listing to organize the data. Instead of storing city and state repeatedly for each property, a City table was created, and each property references it using a foreign key (city_id). Similarly, agent and status information were separated and connected through the Listing table.

This ensures that there are no partial dependencies and that every non-prime attribute depends fully on its table’s primary key, satisfying 2NF and improving data consistency.

## Mockups 
1. Average house price by state (bar chart)
<p align="center">
    <img src="mockup1.png" width="700">
</p>

2. Average days on market per state (bar chart)
<p align="center">
    <img src="mockup2.png" width="700">
</p>

3. Top 10 cities with the highest average price (column chart)
<p align="center">
    <img src="mockup3.png" width="700">
</p>

4. Price trend by year built (line chart)
<p align="center">
    <img src="mockup4.png" width="700">
</p>

5. Top Agents by Number of Listings (pie chart)
<p align="center">
    <img src="mockup5.png" width="700">
</p>
