# Phase 1: Data Cleaning and Preprocessing

**Project**: House Price Prediction and Analysis Using King County Housing Data

**Team**: Ashwin, Ashwath, Namrata Mane

**Course**: DA 591 - Final Semester Project

---

In this notebook, we will clean and prepare the King County housing dataset for analysis. The steps include:
1. Loading the data
2. Understanding the data structure
3. Checking for missing values
4. Handling duplicates
5. Converting data types
6. Handling outliers
7. Feature engineering
8. Saving the cleaned data

## Step 1: Import Required Libraries

We will use pandas for data manipulation and numpy for numerical operations.

In [50]:
# Importing necessary libraries
import pandas as pd
import numpy as np

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.2.3
Numpy version: 2.0.2


## Step 2: Load the Dataset

Loading the King County housing dataset from the CSV file.

In [51]:
# Load the dataset
df = pd.read_csv('kc_house_data.csv')

# Check how many rows and columns we have
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print(f"Dataset loaded successfully!")

Number of rows: 21613
Number of columns: 21
Dataset loaded successfully!


## Step 3: First Look at the Data

Let's see what the data looks like - first few rows and the column names.

In [52]:
# Display first 5 rows
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [53]:
# Display last 5 rows to make sure data is complete
print("Last 5 rows of the dataset:")
df.tail()

Last 5 rows of the dataset:


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21608,263000018,20140521T000000,360000.0,3,2.5,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.5,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.5,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287
21612,1523300157,20141015T000000,325000.0,2,0.75,1020,1076,2.0,0,0,...,7,1020,0,2008,0,98144,47.5941,-122.299,1020,1357


In [54]:
# Show all column names
print("Column names in the dataset:")
print(df.columns.tolist())

Column names in the dataset:
['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


## Step 4: Understanding Data Types

Let's check the data type of each column and see if any conversions are needed.

In [55]:
# Check data types of all columns
print("Data types of each column:")
print(df.dtypes)

Data types of each column:
id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object


In [56]:
# Get more detailed info about the dataset
print("Detailed information about the dataset:")
df.info()

Detailed information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat       

## Step 5: Statistical Summary

Let's look at the basic statistics of numerical columns to understand the data distribution.

In [57]:
# Statistical summary of numerical columns
print("Statistical summary of the dataset:")
df.describe()

Statistical summary of the dataset:


Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [58]:
# Let's look at the price column specifically since it's our target variable
print("Price column statistics:")
print(f"Minimum price: ${df['price'].min():,.2f}")
print(f"Maximum price: ${df['price'].max():,.2f}")
print(f"Average price: ${df['price'].mean():,.2f}")
print(f"Median price: ${df['price'].median():,.2f}")

Price column statistics:
Minimum price: $75,000.00
Maximum price: $7,700,000.00
Average price: $540,088.14
Median price: $450,000.00


## Step 6: Check for Missing Values

Missing values can affect our analysis and model performance. Let's check if there are any.

In [59]:
# Check for missing values in each column
print("Missing values in each column:")
missing_values = df.isnull().sum()
print(missing_values)

Missing values in each column:
id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64


In [61]:
# Total missing values in the entire dataset
total_missing = df.isnull().sum().sum()
print(f"\nTotal missing values in the dataset: {total_missing}")

if total_missing == 0:
    print("Great! No missing values found in the dataset.")
else:
    print("We need to handle these missing values.")


Total missing values in the dataset: 0
Great! No missing values found in the dataset.


## Step 7: Check for Duplicate Records

Duplicate records can skew our analysis. Let's check if any house is listed more than once.

In [62]:
# Check for duplicate rows (entire row is same)
duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

Number of duplicate rows: 0


In [63]:
# Check for duplicate house IDs
# Same house might be sold multiple times
duplicate_ids = df['id'].duplicated().sum()
print(f"Number of duplicate house IDs: {duplicate_ids}")

if duplicate_ids > 0:
    print(f"\nThis means {duplicate_ids} houses were sold more than once.")
    print("We will keep only the latest sale for each house.")

Number of duplicate house IDs: 177

This means 177 houses were sold more than once.
We will keep only the latest sale for each house.


## Step 8: Handle Duplicates

For houses sold multiple times, we will keep only the most recent sale since it reflects the current market value better.

In [65]:
# Store original count
original_count = len(df)
print(f"Original number of records: {original_count}")

# Sort by date in descending order so latest sale comes first
df_sorted = df.sort_values('date', ascending=False)

# Keep only the first occurrence (latest sale) for each house ID
df_cleaned = df_sorted.drop_duplicates(subset='id', keep='first')

# Reset the index
df_cleaned = df_cleaned.reset_index(drop=True)

print(f"Number of records after removing duplicates: {len(df_cleaned)}")
print(f"Records removed: {original_count - len(df_cleaned)}")

Original number of records: 21613
Number of records after removing duplicates: 21436
Records removed: 177


In [66]:
# Verify no more duplicate IDs
remaining_duplicates = df_cleaned['id'].duplicated().sum()
print(f"Remaining duplicate IDs: {remaining_duplicates}")

if remaining_duplicates == 0:
    print("All duplicates have been removed successfully!")

Remaining duplicate IDs: 0
All duplicates have been removed successfully!


## Step 9: Convert Date Column

The date column is currently stored as a string. Let's convert it to a proper datetime format and extract useful features.

In [67]:
# Check current format of the date column
print("Sample date values:")
print(df_cleaned['date'].head())
print(f"\nCurrent data type: {df_cleaned['date'].dtype}")

Sample date values:
0    20150527T000000
1    20150524T000000
2    20150515T000000
3    20150514T000000
4    20150514T000000
Name: date, dtype: object

Current data type: object


In [68]:
# Convert date column to datetime
# The format is like '20141013T000000'
df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], format='%Y%m%dT%H%M%S')

print("Date column converted to datetime format:")
print(df_cleaned['date'].head())
print(f"\nNew data type: {df_cleaned['date'].dtype}")

Date column converted to datetime format:
0   2015-05-27
1   2015-05-24
2   2015-05-15
3   2015-05-14
4   2015-05-14
Name: date, dtype: datetime64[ns]

New data type: datetime64[ns]


In [69]:
# Extract useful features from the date
df_cleaned['sale_year'] = df_cleaned['date'].dt.year
df_cleaned['sale_month'] = df_cleaned['date'].dt.month

print("New date features created:")
print(df_cleaned[['date', 'sale_year', 'sale_month']].head())

New date features created:
        date  sale_year  sale_month
0 2015-05-27       2015           5
1 2015-05-24       2015           5
2 2015-05-15       2015           5
3 2015-05-14       2015           5
4 2015-05-14       2015           5


In [70]:
# Check the date range in our dataset
print(f"Date range of the data:")
print(f"Earliest sale: {df_cleaned['date'].min()}")
print(f"Latest sale: {df_cleaned['date'].max()}")

Date range of the data:
Earliest sale: 2014-05-02 00:00:00
Latest sale: 2015-05-27 00:00:00


## Step 10: Check for Unusual Values

Let's check if there are any unusual or incorrect values in the data that don't make sense.

In [71]:
# Check bedrooms - can a house have 0 or very high number of bedrooms?
print("Bedroom distribution:")
print(f"Minimum bedrooms: {df_cleaned['bedrooms'].min()}")
print(f"Maximum bedrooms: {df_cleaned['bedrooms'].max()}")
print(f"\nValue counts:")
print(df_cleaned['bedrooms'].value_counts().sort_index())

Bedroom distribution:
Minimum bedrooms: 0
Maximum bedrooms: 33

Value counts:
bedrooms
0       13
1      194
2     2736
3     9731
4     6849
5     1586
6      265
7       38
8       13
9        6
10       3
11       1
33       1
Name: count, dtype: int64


In [72]:
# Houses with 0 bedrooms - might be studios or data errors
zero_bedrooms = df_cleaned[df_cleaned['bedrooms'] == 0]
print(f"Number of houses with 0 bedrooms: {len(zero_bedrooms)}")

if len(zero_bedrooms) > 0:
    print("\nThese might be studios or data errors. Let's look at them:")
    print(zero_bedrooms[['id', 'bedrooms', 'bathrooms', 'sqft_living', 'price']].head())

Number of houses with 0 bedrooms: 13

These might be studios or data errors. Let's look at them:
              id  bedrooms  bathrooms  sqft_living     price
729   3374500520         0        0.0         2460  355000.0
2058  9543000205         0        0.0          844  139950.0
5352  7849202299         0        2.5         1490  320000.0
5812  3918400017         0        0.0         1470  380000.0
7203  7849202190         0        0.0         1470  235000.0


In [73]:
# Houses with very high bedrooms (more than 10)
high_bedrooms = df_cleaned[df_cleaned['bedrooms'] > 10]
print(f"Number of houses with more than 10 bedrooms: {len(high_bedrooms)}")

if len(high_bedrooms) > 0:
    print("\nThese are unusual. Let's examine them:")
    print(high_bedrooms[['id', 'bedrooms', 'bathrooms', 'sqft_living', 'price']])

Number of houses with more than 10 bedrooms: 2

These are unusual. Let's examine them:
               id  bedrooms  bathrooms  sqft_living     price
14017  1773100755        11       3.00         3000  520000.0
17961  2402100895        33       1.75         1620  640000.0


In [74]:
# The house with 33 bedrooms seems like a data entry error
# 33 bedrooms with only 1620 sqft is impossible
# Let's check the sqft per bedroom ratio

if len(high_bedrooms) > 0:
    print("Checking sqft per bedroom for unusual entries:")
    for idx, row in high_bedrooms.iterrows():
        sqft_per_bedroom = row['sqft_living'] / row['bedrooms']
        print(f"House ID {row['id']}: {row['bedrooms']} bedrooms, {row['sqft_living']} sqft")
        print(f"  -> {sqft_per_bedroom:.2f} sqft per bedroom (should be at least 100)")
        
        if sqft_per_bedroom < 80:
            print("  -> This looks like a DATA ERROR!")

Checking sqft per bedroom for unusual entries:
House ID 1773100755: 11 bedrooms, 3000 sqft
  -> 272.73 sqft per bedroom (should be at least 100)
House ID 2402100895: 33 bedrooms, 1620 sqft
  -> 49.09 sqft per bedroom (should be at least 100)
  -> This looks like a DATA ERROR!


In [75]:
# Fix the obvious data error - house with 33 bedrooms should probably be 3
# Looking at the data: 33 bedrooms, 1.75 bathrooms, 1620 sqft -> clearly an error

error_mask = (df_cleaned['bedrooms'] == 33) & (df_cleaned['sqft_living'] < 2000)
if error_mask.sum() > 0:
    print("Fixing data entry error: 33 bedrooms -> 3 bedrooms")
    df_cleaned.loc[error_mask, 'bedrooms'] = 3
    print("Fixed!")

Fixing data entry error: 33 bedrooms -> 3 bedrooms
Fixed!


In [76]:
# Check for houses with 0 bathrooms
zero_bathrooms = df_cleaned[df_cleaned['bathrooms'] == 0]
print(f"Number of houses with 0 bathrooms: {len(zero_bathrooms)}")

if len(zero_bathrooms) > 0:
    print("\n0 bathrooms is unusual for a house. Let's see:")
    print(zero_bathrooms[['id', 'bedrooms', 'bathrooms', 'sqft_living', 'price']].head())

Number of houses with 0 bathrooms: 10

0 bathrooms is unusual for a house. Let's see:
              id  bedrooms  bathrooms  sqft_living     price
729   3374500520         0        0.0         2460  355000.0
2058  9543000205         0        0.0          844  139950.0
5418  3421079032         1        0.0          670   75000.0
5812  3918400017         0        0.0         1470  380000.0
7203  7849202190         0        0.0         1470  235000.0


In [77]:
# Check sqft_living - should not be 0 or negative
print("Sqft_living check:")
print(f"Minimum: {df_cleaned['sqft_living'].min()}")
print(f"Maximum: {df_cleaned['sqft_living'].max()}")

if df_cleaned['sqft_living'].min() <= 0:
    print("\nWARNING: Found houses with 0 or negative living space!")
else:
    print("\nAll houses have valid living space values.")

Sqft_living check:
Minimum: 290
Maximum: 13540

All houses have valid living space values.


## Step 11: Handle Outliers in Price

Extreme outliers can affect our model performance. Let's identify and handle them using the IQR (Interquartile Range) method.

In [78]:
# Calculate IQR for price
Q1 = df_cleaned['price'].quantile(0.25)
Q3 = df_cleaned['price'].quantile(0.75)
IQR = Q3 - Q1

print("Price distribution:")
print(f"Q1 (25th percentile): ${Q1:,.2f}")
print(f"Q3 (75th percentile): ${Q3:,.2f}")
print(f"IQR: ${IQR:,.2f}")

Price distribution:
Q1 (25th percentile): $324,866.00
Q3 (75th percentile): $645,000.00
IQR: $320,134.00


In [79]:
# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nOutlier boundaries:")
print(f"Lower bound: ${lower_bound:,.2f}")
print(f"Upper bound: ${upper_bound:,.2f}")


Outlier boundaries:
Lower bound: $-155,335.00
Upper bound: $1,125,201.00


In [80]:
# Count outliers
price_outliers = df_cleaned[(df_cleaned['price'] < lower_bound) | (df_cleaned['price'] > upper_bound)]
print(f"Number of price outliers: {len(price_outliers)}")
print(f"Percentage of data: {(len(price_outliers)/len(df_cleaned))*100:.2f}%")

Number of price outliers: 1143
Percentage of data: 5.33%


In [81]:
# Let's see the distribution of outliers
low_outliers = df_cleaned[df_cleaned['price'] < lower_bound]
high_outliers = df_cleaned[df_cleaned['price'] > upper_bound]

print(f"Houses below lower bound (< ${lower_bound:,.0f}): {len(low_outliers)}")
print(f"Houses above upper bound (> ${upper_bound:,.0f}): {len(high_outliers)}")

Houses below lower bound (< $-155,335): 0
Houses above upper bound (> $1,125,201): 1143


In [82]:
# For this analysis, we will keep most of the data
# Only remove extreme outliers (prices above 3 million or below 50k)
# This is a practical decision since luxury homes are still valid data points

# Let's see how many extreme outliers we have
extreme_high = df_cleaned[df_cleaned['price'] > 3000000]
extreme_low = df_cleaned[df_cleaned['price'] < 50000]

print(f"Extreme outliers:")
print(f"Price > $3,000,000: {len(extreme_high)}")
print(f"Price < $50,000: {len(extreme_low)}")

Extreme outliers:
Price > $3,000,000: 45
Price < $50,000: 0


In [83]:
# Decision: We will keep all data for now
# Extreme luxury homes are still valid - they represent luxury market
# Very cheap homes might be land or special cases

print("Decision: Keeping all price data for analysis.")
print("Reason: Extreme values represent real market segments (luxury homes, land, etc.)")
print("\nNote: If model performance is poor, we can revisit this decision.")

Decision: Keeping all price data for analysis.
Reason: Extreme values represent real market segments (luxury homes, land, etc.)

Note: If model performance is poor, we can revisit this decision.


## Step 12: Feature Engineering

Let's create new features that might be useful for our analysis and modeling.

In [84]:
# Create 'house_age' feature
# Using the sale year to calculate how old the house was when sold
df_cleaned['house_age'] = df_cleaned['sale_year'] - df_cleaned['yr_built']

print("House age feature created:")
print(f"Youngest house age at sale: {df_cleaned['house_age'].min()} years")
print(f"Oldest house age at sale: {df_cleaned['house_age'].max()} years")
print(f"Average house age at sale: {df_cleaned['house_age'].mean():.1f} years")

House age feature created:
Youngest house age at sale: -1 years
Oldest house age at sale: 115 years
Average house age at sale: 43.2 years


In [85]:
# Create 'renovated' binary feature
# 1 if the house was ever renovated, 0 otherwise
df_cleaned['renovated'] = (df_cleaned['yr_renovated'] > 0).astype(int)

print("Renovated feature created:")
print(df_cleaned['renovated'].value_counts())
print(f"\nPercentage of renovated houses: {df_cleaned['renovated'].mean()*100:.2f}%")

Renovated feature created:
renovated
0    20526
1      910
Name: count, dtype: int64

Percentage of renovated houses: 4.25%


In [86]:
# Create 'price_per_sqft' feature
df_cleaned['price_per_sqft'] = df_cleaned['price'] / df_cleaned['sqft_living']

print("Price per sqft feature created:")
print(f"Minimum: ${df_cleaned['price_per_sqft'].min():.2f}/sqft")
print(f"Maximum: ${df_cleaned['price_per_sqft'].max():.2f}/sqft")
print(f"Average: ${df_cleaned['price_per_sqft'].mean():.2f}/sqft")

Price per sqft feature created:
Minimum: $87.59/sqft
Maximum: $810.14/sqft
Average: $264.72/sqft


In [87]:
# Create 'has_basement' binary feature
df_cleaned['has_basement'] = (df_cleaned['sqft_basement'] > 0).astype(int)

print("Has basement feature created:")
print(df_cleaned['has_basement'].value_counts())
print(f"\nPercentage of houses with basement: {df_cleaned['has_basement'].mean()*100:.2f}%")

Has basement feature created:
has_basement
0    13015
1     8421
Name: count, dtype: int64

Percentage of houses with basement: 39.28%


In [88]:
# Create 'total_rooms' feature (bedrooms + bathrooms gives a rough idea)
df_cleaned['total_rooms'] = df_cleaned['bedrooms'] + df_cleaned['bathrooms']

print("Total rooms feature created:")
print(f"Minimum total rooms: {df_cleaned['total_rooms'].min()}")
print(f"Maximum total rooms: {df_cleaned['total_rooms'].max()}")
print(f"Average total rooms: {df_cleaned['total_rooms'].mean():.1f}")

Total rooms feature created:
Minimum total rooms: 0.0
Maximum total rooms: 16.5
Average total rooms: 5.5


## Step 13: Final Data Check

Let's verify our cleaned dataset is ready for analysis.

In [89]:
# Final shape of the dataset
print("Final dataset summary:")
print(f"Number of rows: {df_cleaned.shape[0]}")
print(f"Number of columns: {df_cleaned.shape[1]}")

Final dataset summary:
Number of rows: 21436
Number of columns: 28


In [90]:
# List all columns in the cleaned dataset
print("All columns in cleaned dataset:")
for i, col in enumerate(df_cleaned.columns, 1):
    print(f"{i}. {col}")

All columns in cleaned dataset:
1. id
2. date
3. price
4. bedrooms
5. bathrooms
6. sqft_living
7. sqft_lot
8. floors
9. waterfront
10. view
11. condition
12. grade
13. sqft_above
14. sqft_basement
15. yr_built
16. yr_renovated
17. zipcode
18. lat
19. long
20. sqft_living15
21. sqft_lot15
22. sale_year
23. sale_month
24. house_age
25. renovated
26. price_per_sqft
27. has_basement
28. total_rooms


In [91]:
# Final check for missing values
print("Missing values check:")
missing = df_cleaned.isnull().sum()
if missing.sum() == 0:
    print("No missing values in the cleaned dataset!")
else:
    print(missing[missing > 0])

Missing values check:
No missing values in the cleaned dataset!


In [92]:
# Final check for duplicates
print("Duplicate check:")
if df_cleaned['id'].duplicated().sum() == 0:
    print("No duplicate house IDs!")
else:
    print(f"Warning: {df_cleaned['id'].duplicated().sum()} duplicates found")

Duplicate check:
No duplicate house IDs!


In [93]:
# Display final data types
print("Data types in cleaned dataset:")
print(df_cleaned.dtypes)

Data types in cleaned dataset:
id                         int64
date              datetime64[ns]
price                    float64
bedrooms                   int64
bathrooms                float64
sqft_living                int64
sqft_lot                   int64
floors                   float64
waterfront                 int64
view                       int64
condition                  int64
grade                      int64
sqft_above                 int64
sqft_basement              int64
yr_built                   int64
yr_renovated               int64
zipcode                    int64
lat                      float64
long                     float64
sqft_living15              int64
sqft_lot15                 int64
sale_year                  int32
sale_month                 int32
house_age                  int64
renovated                  int64
price_per_sqft           float64
has_basement               int64
total_rooms              float64
dtype: object


In [94]:
# Preview of the cleaned dataset
print("Preview of cleaned dataset (first 5 rows):")
df_cleaned.head()

Preview of cleaned dataset (first 5 rows):


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,long,sqft_living15,sqft_lot15,sale_year,sale_month,house_age,renovated,price_per_sqft,has_basement,total_rooms
0,9106000005,2015-05-27,1310000.0,4,2.25,3750,5000,2.0,0,0,...,-122.303,2170,4590,2015,5,91,0,349.333333,1,6.25
1,5101400871,2015-05-24,445500.0,2,1.75,1390,6670,1.0,0,0,...,-122.308,920,6380,2015,5,74,0,320.503597,1,3.75
2,7923600250,2015-05-15,450000.0,5,2.0,1870,7344,1.5,0,0,...,-122.144,1870,7650,2015,5,55,0,240.641711,0,7.0
3,8730000270,2015-05-14,359000.0,2,2.75,1370,1140,2.0,0,0,...,-122.343,1370,1090,2015,5,6,0,262.043796,1,4.75
4,9178601660,2015-05-14,1695000.0,5,3.0,3320,5354,2.0,0,0,...,-122.331,2330,4040,2015,5,11,0,510.542169,0,8.0


## Step 14: Save the Cleaned Dataset

Save the cleaned data to a new CSV file for use in the next phases.

In [95]:
# Save the cleaned dataset
output_file = 'cleaned_house_data.csv'
df_cleaned.to_csv(output_file, index=False)

print(f"Cleaned dataset saved to: {output_file}")
print(f"Total records: {len(df_cleaned)}")
print(f"Total columns: {len(df_cleaned.columns)}")

Cleaned dataset saved to: cleaned_house_data.csv
Total records: 21436
Total columns: 28


In [96]:
# Verify the saved file
df_verify = pd.read_csv(output_file)
print(f"\nVerification - File loaded successfully!")
print(f"Rows: {len(df_verify)}, Columns: {len(df_verify.columns)}")


Verification - File loaded successfully!
Rows: 21436, Columns: 28


## Summary of Data Cleaning

### What we did:
1. **Loaded** the King County housing dataset (21,613 records)
2. **Checked for missing values** - None found
3. **Removed duplicate records** - Kept only the latest sale for houses sold multiple times
4. **Converted date column** - Changed from string to datetime format
5. **Fixed data errors** - Corrected obvious typos (e.g., 33 bedrooms -> 3)
6. **Created new features**:
   - `sale_year` and `sale_month` from date
   - `house_age` (age when sold)
   - `renovated` (binary: 0 or 1)
   - `price_per_sqft`
   - `has_basement` (binary: 0 or 1)
   - `total_rooms`
7. **Saved** the cleaned data to `cleaned_house_data.csv`

### Next Steps:
- Phase 2: Exploratory Data Analysis (EDA)
- Phase 3: Build Linear Regression model for price prediction