# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset: https://www.kaggle.com/datasets/tonygordonjr/zillow-real-estate-data?utm_source=chatgpt.com&select=listing_mortgage_info.csv

property_listings.csv

listing_subtype.csv

listing_nearby_homes.csv



Import the necessary libraries and create your dataframe(s).

In [1]:
import pandas as pd
import numpy as np

BASE_PATH = "../Data/Final Project Datasets/"

property_df = pd.read_csv(BASE_PATH + "property_listings.csv")
subtype_df = pd.read_csv(BASE_PATH + "listing_subtype.csv")
nearby_df = pd.read_csv(BASE_PATH + "listing_nearby_homes.csv")

property_df.head()

Unnamed: 0,zpid,price,homeStatus,homeType,datePosted,streetAddress,city,state,zipcode,county,...,rentZestimate,bathrooms,bedrooms,pageViewCount,favoriteCount,propertyTaxRate,timeOnZillow,dateSold,url,lastUpdated
0,32107262.0,750000.0,Recently Sold,Multi Family,2024-03-19,7417 87th Rd,Jamaica,NY,11421.0,Queens County,...,2930.0,2.0,,20.0,0.0,0.86,9 hours,2024-11-24,https://www.zillow.com/homedetails/7417-87th-R...,2024-11-25 09:04:11.007468 UTC
1,20503342.0,3995.0,Recently Sold,Apartment,2024-09-24,1300 Midvale Ave APT 510,Los Angeles,CA,90024.0,Los Angeles County,...,3867.0,2.0,2.0,187.0,5.0,1.16,9 hours,2024-11-24,https://www.zillow.com/homedetails/1300-Midval...,2024-11-25 09:04:11.007468 UTC
2,20183958.0,820000.0,Recently Sold,Single Family,2024-10-27,8300 Capps Ave,Northridge,CA,91324.0,Los Angeles County,...,4540.0,2.0,3.0,21.0,0.0,1.16,9 hours,2024-11-24,https://www.zillow.com/homedetails/8300-Capps-...,2024-11-25 09:04:11.007468 UTC
3,32332472.0,550000.0,Recently Sold,Single Family,2024-07-09,433 Hamden Ave,Staten Island,NY,10306.0,Richmond County,...,2668.0,1.0,2.0,96.0,0.0,0.89,9 hours,2024-11-24,https://www.zillow.com/homedetails/433-Hamden-...,2024-11-25 09:04:11.007468 UTC
4,352427429.0,703478.0,Recently Sold,Single Family,2024-06-19,504 Edwin St #8,Nashville,TN,37207.0,Davidson County,...,3599.0,4.0,4.0,7.0,0.0,0.57,9 hours,2024-11-24,https://www.zillow.com/homedetails/504-Edwin-S...,2024-11-25 09:04:11.007468 UTC


In [None]:
# Creating a working copy of the main dataset for cleaning
# This ensures the original data remains unchanged

clean_df = property_df.copy()


In [None]:
# Converting datePosted to datetime to support time-based analysis

clean_df["datePosted"] = pd.to_datetime(clean_df["datePosted"], errors="coerce")


In [35]:
clean_df.info()
clean_df.describe()
clean_df.isna().sum()


<class 'pandas.core.frame.DataFrame'>
Index: 18607 entries, 0 to 18777
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   zpid             18607 non-null  float64       
 1   price            18607 non-null  float64       
 2   homeStatus       18607 non-null  object        
 3   homeType         18607 non-null  object        
 4   datePosted       18607 non-null  datetime64[ns]
 5   streetAddress    18607 non-null  object        
 6   city             18607 non-null  object        
 7   state            18607 non-null  object        
 8   zipcode          18607 non-null  float64       
 9   county           18607 non-null  object        
 10  yearBuilt        18471 non-null  float64       
 11  livingArea       18607 non-null  float64       
 12  livingAreaUnits  18607 non-null  object        
 13  rentZestimate    18607 non-null  float64       
 14  bathrooms        18607 non-null  float64   

zpid                 0
price                0
homeStatus           0
homeType             0
datePosted           0
streetAddress        0
city                 0
state                0
zipcode              0
county               0
yearBuilt          136
livingArea           0
livingAreaUnits      0
rentZestimate        0
bathrooms            0
bedrooms             0
pageViewCount        0
favoriteCount        0
propertyTaxRate      0
timeOnZillow         0
dateSold             0
url                  0
lastUpdated          0
dtype: int64

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [84]:
# Checking for missing values in the working dataset

clean_df.isna().sum()



zpid                 0
price                0
homeStatus           0
homeType             0
datePosted           0
city                 0
state                0
zipcode              0
county               0
yearBuilt          136
livingArea           0
livingAreaUnits      0
rentZestimate        0
bathrooms            0
bedrooms             0
pageViewCount        0
favoriteCount        0
propertyTaxRate      0
timeOnZillow         0
dateSold             0
lastUpdated          0
dtype: int64

In [85]:
# Dropping rows missing critical fields needed for my analysis

clean_df = clean_df.dropna(subset=["price", "datePosted"])

In [61]:
# Missing subtype values are filled with 'Unknown' since they are categorical.

subtype_df = subtype_df.fillna("Unknown")


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [62]:
# I am removing listings with invalid prices. A price of 0 indicates missing or placeholder data 
# and would distort analysis. In my EDA, price == 0 resulted from the .describe() function and the price
# of 0 is not a real listing.

clean_df = clean_df[clean_df["price"] > 0]


In [63]:
# I replaced invalid "yearBuilt" values with "missing". Years less than or equal to 
# 0 or unrealistically high values are not valid. For example in my EDA the "yearbuilt" 
# column shows values like 0 and 9999. Thats not relistic. 

clean_df.loc[clean_df["yearBuilt"] <= 0, "yearBuilt"] = pd.NA
clean_df.loc[clean_df["yearBuilt"] > 2024, "yearBuilt"] = pd.NA


I had several outliers in my EDA but I chose to keep them because the value is needed
to highlight the appreciation in propery for the timeframe of 2022-2024, regardless of 
how much. 

In [64]:
# Viewing my dataset after cleaning.

clean_df.describe()


Unnamed: 0,zpid,price,datePosted,zipcode,yearBuilt,livingArea,rentZestimate,bathrooms,bedrooms,pageViewCount,favoriteCount,propertyTaxRate
count,18607.0,18607.0,18607,18607.0,18471.0,18607.0,18607.0,18607.0,18607.0,18607.0,18607.0,18607.0
mean,263401600.0,678552.1,2024-11-08 01:39:26.786693376,51527.167195,1976.001191,1865.257269,2780.808405,4.687859,6.135057,203.279357,12.485624,1.032197
min,1069739.0,1.0,2022-02-22 00:00:00,2108.0,1800.0,0.0,673.0,0.0,0.0,0.0,0.0,0.0
25%,26139770.0,258000.0,2024-11-19 00:00:00,30045.0,1950.0,1210.0,1984.0,2.0,3.0,51.0,2.0,0.67
50%,53153540.0,400000.0,2024-11-22 00:00:00,44113.0,1980.0,1633.0,2338.0,2.0,3.0,121.0,6.0,0.86
75%,318351800.0,695000.0,2024-11-25 00:00:00,78249.0,2007.0,2208.0,2849.5,3.0,4.0,234.0,14.0,1.47
max,2146974000.0,475000000.0,2024-12-26 00:00:00,98199.0,2024.0,49087.0,108970.0,38451.0,51268.0,18880.0,1518.0,2.0
std,528143800.0,3646904.0,,28102.590335,37.432473,1311.602788,2534.261116,282.277738,376.132878,396.86281,27.537829,0.490886


In [65]:
# Removing listings with invalid prices

clean_df = clean_df[clean_df["price"] > 0]


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [66]:
# Dropping columns not used in analysis. Could have been more but
# starting with what I know for sure I don't need.

clean_df = clean_df.drop(columns=["url", "streetAddress"])


In [67]:
# I will keep what supports my analysis and 
# drop what is not useful for our project goals.

df = property_df_clean.copy()


In [68]:
# Checking for full duplicate rows
df.duplicated().sum()


np.int64(0)

In [69]:
# Removing exact duplicate rows 
df = df.drop_duplicates()


In [70]:
# Checked for duplicate listings by unique identifier (zpid)
# If duplicates exist for the same zpid, keep the most recently updated record.
if "lastUpdated" in df.columns:
    df["lastUpdated"] = pd.to_datetime(df["lastUpdated"], errors="coerce")

# Count duplicated zpids
df["zpid"].duplicated().sum()


np.int64(363)

In [71]:
# Since multiple rows share the same zpid, keep only the most recently updated listing
# to avoid double-counting the same property.

df = (
    df.sort_values("lastUpdated", ascending=False)
      .drop_duplicates(subset="zpid", keep="first")
)


In [72]:
# Keeping only the most recently updated records on each property. This will help 
# avoid counting the same listing multiple times. 

df = df.sort_values("lastUpdated").drop_duplicates(subset=["zpid"], keep="last")


In [73]:
#Verfying the cleaning steps completed so far.

df.shape
df.head()
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 18415 entries, 17097 to 19
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   zpid             18415 non-null  float64            
 1   price            18415 non-null  float64            
 2   homeStatus       18415 non-null  object             
 3   homeType         18415 non-null  object             
 4   datePosted       18415 non-null  object             
 5   streetAddress    18415 non-null  object             
 6   city             18415 non-null  object             
 7   state            18415 non-null  object             
 8   zipcode          18415 non-null  float64            
 9   county           18415 non-null  object             
 10  yearBuilt        18415 non-null  float64            
 11  livingArea       18415 non-null  float64            
 12  livingAreaUnits  18415 non-null  object             
 13  rentZestimate    184

## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [86]:
# Removing listings with invalid prices (0 or negative)

clean_df = clean_df[clean_df["price"] > 0]

# Replacing invalid yearBuilt values with missing

clean_df.loc[clean_df["yearBuilt"] <= 0, "yearBuilt"] = pd.NA
clean_df.loc[clean_df["yearBuilt"] > 2024, "yearBuilt"] = pd.NA

In [88]:
# Standardizing categorical/text columns to fix inconsistent formatting
# This ensures grouping and filtering behave consistently

df["homeStatus"] = df["homeStatus"].astype(str).str.strip().str.upper()
df["homeType"]   = df["homeType"].astype(str).str.strip().str.upper()

df["city"]   = df["city"].astype(str).str.strip().str.title()
df["county"] = df["county"].astype(str).str.strip().str.title()

df["state"] = df["state"].astype(str).str.strip().str.upper()





In [76]:
# Zipcodes are showing in the dataset with unnecessary space, or too many zeros, etc and 
# should be treated as text to preserve formatting and avoid inconsistencies.

if "zipcode" in df.columns:
    df["zipcode"] = df["zipcode"].astype(str).str.strip()



In [77]:
# Numeric fields can sometimes be stored as strings, especially after cleaning.

numeric_cols = [
    "price", "livingArea", "rentZestimate",
    "bathrooms", "bedrooms",
    "pageViewCount", "favoriteCount",
    "propertyTaxRate", "timeOnZillow", "yearBuilt"
]

for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")


In [78]:
# Some values are inconsistent because they are impossible or unrealistic.

# Some values are inconsistent because they are impossible or unrealistic.
# Negative values are not valid for these fields, so I convert them to NaN
# and allow my missing-data strategy to handle them.

nonnegative_cols = [
    "price", "livingArea", "rentZestimate",
    "bathrooms", "bedrooms"
]

for col in nonnegative_cols:
    if col in df.columns:
        df.loc[df[col] < 0, col] = np.nan

# Bedrooms and bathrooms should be within reasonable limits
if "bedrooms" in df.columns:
    df.loc[df["bedrooms"] > 20, "bedrooms"] = np.nan

if "bathrooms" in df.columns:
    df.loc[df["bathrooms"] > 20, "bathrooms"] = np.nan



In [79]:
# Dates may appear in multiple formats; converting to datetime improves analysis.
# Invalid dates will become NaT (missing).

date_cols = ["datePosted", "dateSold", "lastUpdated"]

for col in date_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors="coerce")


  df[col] = pd.to_datetime(df[col], errors="coerce")


In [80]:

# Reviewing unique values helps me spot unexpected labels like "FORSALE" vs "FOR SALE".

for col in ["homeStatus", "homeType", "state"]:
    if col in df.columns:
        display(df[col].value_counts(dropna=False).head(15))


homeStatus
FOR SALE           14652
RECENTLY SOLD       3707
PENDING               26
FOR RENT              13
FORECLOSED             8
PRE FORECLOSURE        8
OTHER                  1
Name: count, dtype: int64

homeType
SINGLE FAMILY    11128
CONDO             3157
TOWNHOUSE         2195
MULTI FAMILY      1161
LOT                554
MANUFACTURED       150
APARTMENT           70
Name: count, dtype: int64

state
TX    3277
FL    1804
GA    1547
NY    1537
CA    1466
NV    1199
PA    1108
IL    1032
OH     798
TN     621
AZ     560
NC     547
IN     523
MD     466
KY     423
Name: count, dtype: int64

In [81]:
# Conversions and invalid-value cleaning may introduce new NaNs.

numeric_cols_in_df = df.select_dtypes(include=["int64", "float64"]).columns
df[numeric_cols_in_df] = df[numeric_cols_in_df].fillna(df[numeric_cols_in_df].median())

df.isna().sum().sort_values(ascending=False)


timeOnZillow       18415
dateSold           14734
datePosted            25
homeStatus             0
homeType               0
price                  0
zpid                   0
city                   0
streetAddress          0
state                  0
zipcode                0
livingArea             0
livingAreaUnits        0
county                 0
yearBuilt              0
bathrooms              0
rentZestimate          0
bedrooms               0
pageViewCount          0
propertyTaxRate        0
favoriteCount          0
url                    0
lastUpdated            0
dtype: int64

In [82]:
property_df_clean.to_csv("property_listings_clean.csv", index=False)


In [83]:
import os
os.getcwd()


'c:\\Users\\peele\\OneDrive\\LaunchCode\\data-analysis-projects\\cleaning-data-checkpoint'

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset? Yes

Missing data: Several numeric fields such as price, living area, rentZestimate, and yearBuilt contained missing values.

Irregular data (outliers): Extreme values were present in columns like price, living area, page view count, and favorite count.

Unnecessary data: High element count and low value columns such as full street addresses and URLs were identified and removed.

Inconsistent data: Inconsistencies were found in text formatting (case and spacing), zip code formats, and date fields.

2. Did the process of cleaning your data give you new insights into your dataset? Yes

When I cleaned the data it showed me patterns that were not immediately obvious in the raw datasets. Like, the extremely high prices which displayed the luxury and higher end homes instead of the data error. 

3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?

Real estate in general has varying and sometimes extreme values, which it important to keep outliers instead of blindly 
moving them. Also, making sure I have consistent data types is important when it comes to the visualization. 

