In [13]:
import pandas as pd
import numpy as np
import matplotlib as plt
import sklearn.model_selection as train_test_split
import sklearn.linear_model as LinearRegression

In [14]:
df = pd.read_csv("realtor-data.csv")

df.head(5)

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,,67000.0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,,145000.0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,,,65000.0


In [15]:
df.describe()

Unnamed: 0,bed,bath,acre_lot,zip_code,house_size,price
count,904735.0,925066.0,729378.0,1048367.0,706491.0,1048503.0
mean,3.327341,2.503352,15.593482,7017.637,2154.038,903654.7
std,2.086783,1.956613,910.810132,3807.651,3002.978,2849184.0
min,1.0,1.0,0.0,601.0,100.0,0.0
25%,2.0,1.0,0.11,3585.0,1114.0,275000.0
50%,3.0,2.0,0.28,7103.0,1650.0,495000.0
75%,4.0,3.0,1.11,10028.0,2500.0,850000.0
max,123.0,198.0,100000.0,99999.0,1450112.0,875000000.0


In [22]:
df.shape

(924929, 9)

In [16]:
print(df.duplicated().sum())

926465


#### Regression Model and Duplicate Data

While duplicates in a regression model can potentially skew the final interpretation by introducing bias, it's crucial to contextualize these duplicates within the data source. In this specific case, duplicate entries might not necessarily indicate redundant data; rather, they likely reflect the fluctuating nature of the housing market.

Deleting these duplicates could inadvertently remove valuable insights into the dynamics of house prices. Therefore, it may be more beneficial to retain these duplicates in our dataset to ensure a comprehensive analysis of the housing market trends.


In [17]:
print(df.isnull().sum())

status                 0
bed               143840
bath              123509
acre_lot          319197
city                  77
state                  0
zip_code             208
house_size        342084
prev_sold_date    517985
price                 72
dtype: int64


#### Handling Missing Values in the Dataset

It's evident from the dataset that there are numerous missing values across various categories. While the absence of beds in some estates might be expected due to single-room housing, missing values for city, zip code, price and bathrooms are critical omissions. 

Given the size of the dataset, I've opted to exclude rows containing missing values for city, zip code, prices and bathrooms, as this information is essential for our analysis.

For other unspecified values, such as beds, we'll replace them with 0. This ensures consistency in the dataset while acknowledging the absence of specific information.

It's important to note that the "Previously Sold Date" column will not be utilized for analysis in this study. While this information may be relevant in other contexts, it is not pertinent to the specific objectives of our analysis.

In [18]:
# Drop missing values under the bath, city, zip-code, and price columns
df = df.dropna(subset=['bath', 'city', 'zip_code', 'price'])

# Drop "prev_sold_date" column
df = df.drop(columns=['prev_sold_date'])

# Fill the other missing values with 0 values
df = df.fillna(0)

In [19]:
df.head(5)

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,145000.0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,0.0,65000.0


In [20]:
df.isnull().sum()

status        0
bed           0
bath          0
acre_lot      0
city          0
state         0
zip_code      0
house_size    0
price         0
dtype: int64

In [23]:
df.shape

(924929, 9)