In [1]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [None]:
install.packages("ggplot2")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
#Libraries
library(dplyr)
library(data.table)
library(ggplot2)
library(caret)
library(corrplot)


In [None]:
# Paste the copied path here
dataset_path <- "/kaggle/input/usa-real-estate-dataset/realtor-data.zip.csv"

# Read the dataset into a data frame
df <- read.csv(dataset_path)

In [None]:
#the headset
head(df)

In [None]:
#check summary
summary(df)

In [None]:
#check columns name
colnames(df)

In [None]:
#check for duplicates 
duplicates <- df%>%duplicated()
dupli_df <- duplicates%>%table()
print(dupli_df)

In [None]:
df_no_dup <- df%>%distinct()
print(nrow(df))

Although, having duplicates in a regression model could affect the model final's interpretation of the data (by increasing bias in final results), it is imperative to consider the data source. In the present case, data "duplication" may not be an indication of redudant data. Instead, this is a mirror of the current House market prices. Therefore it might be better to not delete those said duplicates and keep them ino our dataset.

In [None]:
#Check for null values
df%>%summarise_all(~sum(is.na(.)))

In [None]:
#delete rows where the house price is null
df <- df%>%filter(!is.na(df$price) & !is.na(zip_code))

In [None]:
#Check for null values
df%>%summarise_all(~sum(is.na(.)))

In [None]:
#delete all null values
df <- df %>% mutate_at(vars(bed, bath, acre_lot, house_size), ~ifelse(is.na(.), 0, .))
df%>%summarise_all(~sum(is.na(.)))

In [None]:
# Create interquartile range plot for the house price
ggplot(df, aes(x = "", y = price)) +
    geom_boxplot(fill = "skyblue", color = "black", outlier.color = "red") + 
    theme_minimal() +  
    labs(x = "", y = "Price", title = "Interquartile Range Plot of price outliers") +
    coord_flip() +  # Flip the coordinates to make the plot horizontal
    theme(
        plot.title = element_text(size = 25),  # Adjust title size
        plot.margin = margin(50, 50, 50, 50)  # Adjust plot margins
    )


There are so many outliers in our dataset, that it is impossible for us to fully visualise the boxplot image. This is not a surprise since the housing market has such a large range prices. Depending on the house size and/or location, price may vary. I have decided to slightly change the graph to have a better view of the data

In [None]:
# Increase plot size
options(repr.plot.width=20, repr.plot.height=7)

# Calculate the range of the y-axis to include more data
y_min <- quantile(df$price, 0.05)  # Set lower limit to the 5th percentile
y_max <- quantile(df$price, 0.95)  # Set upper limit to the 95th percentile

# Create interquartile range plot with larger size, horizontal orientation, and adjusted y-axis scale
ggplot(df, aes(x = "", y = price)) +
    geom_boxplot(fill = "skyblue", color = "black", outlier.color = "red") + 
    theme_minimal() +  
    labs(x = "", y = "Price", title = "Interquartile Range Plot of price outliers") +
    coord_flip() +  # Flip the coordinates to make the plot horizontal
    ylim(y_min, y_max) +  # Set the limits of the y-axis
    theme(
        plot.title = element_text(size = 25),  # Adjust title size
        plot.margin = margin(50, 50, 50, 50)  # Adjust plot margins
    )


After applying some changes to the graph's options, it is now visible to us that the is a large price range price for all of the houses listed. The graph also inform us that the house's at a price range between $ 100,000 - $ 1.25M, and an average house price close to $ 375,000.

The many red dots in the plot above shows us how many outliers are present in your dataset. The above datapoints demonstrate how the house market includes many luxurious houses.

In [None]:
# Increase plot size
options(repr.plot.width=20, repr.plot.height=7)


# Create interquartile range plot with larger size, horizontal orientation, and adjusted y-axis scale
ggplot(df, aes(x = "", y = bed)) +
    geom_boxplot(fill = "skyblue", color = "black", outlier.color = "red") + 
    theme_minimal() +  
    labs(x = "", y = "Bed", title = "Interquartile Range Plot of Bed outliers") +
    coord_flip() +  # Flip the coordinates to make the plot horizontal
    ylim(0, 20) +  # Set the limits of the y-axis
    theme(
        plot.title = element_text(size = 25),  # Adjust title size
        plot.margin = margin(50, 50, 50, 50)  # Adjust plot margins
    )

Just like the price variable, the number of bedrooms in a house seems to contain many outliers. The initial assumption is that those outliers are related to the most luxurious houses presented in the dataset, containing 8 TP 20+ bedrooms per house. It is equally interesting to notice that some of the houses don't contain any bedrooms. While, the average house has 3 bedrooms, houses having up to 7 bedrooms are considered within normal range.

In [None]:
# Increase plot size
options(repr.plot.width=20, repr.plot.height=7)


# Create interquartile range plot with larger size, horizontal orientation, and adjusted y-axis scale
ggplot(df, aes(x = "", y = bath)) +
    geom_boxplot(fill = "skyblue", color = "black", outlier.color = "red") + 
    theme_minimal() +  
    labs(x = "", y = "Bath", title = "Interquartile Range Plot of Bath outliers") +
    coord_flip() +  # Flip the coordinates to make the plot horizontal
    ylim(0, 20) +  # Set the limits of the y-axis
    theme(
        plot.title = element_text(size = 25),  # Adjust title size
        plot.margin = margin(50, 50, 50, 50)  # Adjust plot margins
    )

It seems that the number of bathrooms in houses follow the same schema as the number of bedrooms. Althought, we have many outliers highlighting that some of those houses have 7 to 20+ bathrooms, and that some of those house have no bathrooms. This vvariable's data differ in the average number of bathrooms with a mean of 2 bathrooms per house and a uper ranger of 6 bathrooms. 

In [None]:
# Increase plot size
options(repr.plot.width=20, repr.plot.height=7)

y_min <- quantile(df$acre_lot, 0.10)
y_max <- quantile(df$acre_lot, 0.90)

# Create interquartile range plot with larger size, horizontal orientation, and adjusted y-axis scale
ggplot(df, aes(x = "", y = acre_lot)) +
    geom_boxplot(fill = "skyblue", color = "black", outlier.color = "red") + 
    theme_minimal() +  
    labs(x = "", y = "Acre Lot", title = "Interquartile Range Plot of Acre Lot outliers") +
    coord_flip() +  # Flip the coordinates to make the plot horizontal
    ylim(y_min, y_max) +  # Set the limits of the y-axis
    theme(
        plot.title = element_text(size = 25),  # Adjust title size
        plot.margin = margin(50, 50, 50, 50)  # Adjust plot margins
    )

The number of Acres lot shows a similar outlier patern as the house price, with many outliers starting at a Acre lot size of 0.8 and above. With a range between 0 and 3 acres, and an average of 0.175.

In [None]:
# Increase plot size
options(repr.plot.width=20, repr.plot.height=7)

y_min <- quantile(df$house_size, 0.02)
y_max <- quantile(df$house_size, 0.98)

# Create interquartile range plot with larger size, horizontal orientation, and adjusted y-axis scale
ggplot(df, aes(x = "", y = house_size)) +
    geom_boxplot(fill = "skyblue", color = "black", outlier.color = "red") + 
    theme_minimal() +  
    labs(x = "", y = "house_size", title = "Interquartile Range Plot of House Size outliers") +
    coord_flip() +  # Flip the coordinates to make the plot horizontal
    ylim(y_min, y_max) +  # Set the limits of the y-axis
    theme(
        plot.title = element_text(size = 25),  # Adjust title size
        plot.margin = margin(50, 50, 50, 50)  # Adjust plot margins
    )

The above box plot show that there regular house size in square feets ranges between 0 - 2,000 sq feet, and an upper interquartile range going up to just a little above 5,100 sqr feet. With the average house size at an aproximated 1,250 sqr feet. Following the same pattern that all of the previous variables, the house size also includes a considerable amount of outliers.

After looking at all of the outliers present in our dataset,  it is clear that there are many luxurious (out of the average range) houses present in the market. For the sake of our model analysis, we will be focussing in the "normal" average houses and exclude any of the said luxurious and out of the average house. 

To favor our multi linear regreession model and the creation of our House calculator, I have decided to focus on a certain range of our dataset to push for more accurate results. we will be taking away all of the outliers, which represent luxurious and/or out of the common houses. 
To take away most of the outliers we will be using the domain knowledge and visual inspection of the upper range of our box plot previously executed.

In [None]:
# Determine threshold for outlier detection for each variable
thresholds <- list(
  bed = c(lower_bound = 1, upper_bound = 7),  # Example thresholds, adjust as needed
  bath = c(lower_bound = 1, upper_bound = 6),  
  acre_lot = c(lower_bound = 0, upper_bound = 0.75),  
  house_size = c(lower_bound = 0, upper_bound = 5000),  
  price = c(lower_bound = 0, upper_bound = 1250000)  
)

# Filter out outliers for each variable
df_filtered <- df
for (variable in names(thresholds)) {
  lower_bound <- thresholds[[variable]]["lower_bound"]
  upper_bound <- thresholds[[variable]]["upper_bound"]
  df_filtered <- df_filtered[df_filtered[[variable]] >= lower_bound & df_filtered[[variable]] <= upper_bound, ]
}


In this project I took the choice to proceed with the boxplots first, as a way to clean up the present outliers before proceeding with additional data exploration using further visualizations for the sake of having less noise.

In [None]:
# chack new df summary
summary(df_filtered)

After filtering out some of the data noise present in our dataframe, we can notice how the total number of rows has been reduced to 1,535,357 from the initial 2,501,666 rows. The dataset is still fairly large.


In [None]:
#Histogram of Bed count distribution
hist_bed <- ggplot(df_filtered, aes(x = bed)) +
  geom_histogram(fill = "green", color = "white", bins = 20) +
  labs(title = "Histogram of Bed Counts",
       x = "Number of Beds",
       y = "Count") +
  theme_minimal() +
  #coord_cartesian(xlim = c(1, 7)) +
  scale_y_continuous(labels = scales::comma)

print(hist_bed)

In [None]:
#Histogram of Bath count distribution
hist_bath <- ggplot(df_filtered, aes(x = bath)) +
  geom_histogram(fill = "blue3", color = "white", bins = 20) +
  labs(title = "Histogram of Bath Counts",
       x = "Number of Bath",
       y = "Count") +
  theme_minimal() +
  #coord_cartesian(xlim = c(1, 5)) +
  scale_y_continuous(labels = scales::comma)

print(hist_bath)

In [None]:
# Create sample df for scatterplot analysis
sample_size <- 1000

sample_df <- df_filtered %>% sample_n(sample_size)

In [None]:
#Change the size of the plot
options(repr.plot.width=20, repr.plot.height=15)

# scatterplot Price vs house size
scatter_size_price <- ggplot(sample_df, aes(x = house_size, y = price)) +
  geom_point() +
  labs(title = "Scatterplot Price vs house_size",
       x = "House Size",
       y = "House Price") +
  theme_minimal()

print(scatter_size_price)

Before advancing into the creationg of my Multi linear regression model, I will be creating a matrix to evaluate the importance of the different variables

In [None]:
#Create dummy variables to change categorical var. into numerical
dummy <- model.matrix(~ city + state + zip_code - 1, data = df_filtered)

#merge datasets
df <- cbind(df_filtered, dummy)

In [None]:
#Modify  our filtered df to only include the numerical fields
df_filt_num <- subset(df_filtered, select = c("price", "bed", "bath", "acre_lot", "house_size", "city", "state", "zip_code"))

# Check summary statistics of the new dataframe
summary(df_filt_num)

To be able to use some of the categorical variables present in the dataset, we have to change categorical variablesinto numerical

In [None]:
#Create a correlation matrix to find the variables we will be using in our model
correlation_matrix <- cor(df_filt_num)
corrplot(correlation_matrix, method = "color")

After creating the above correlation matrix, it is interesting to perceive that there are no negative correlation between the different values. However some of them have a strong correlation than others. Based on the above information, our Multi linear regression model will be using the price as the intercept, and the house_size, number of bath and beds. We will not be using the number of acress, since the correlation is not strong enough

In [None]:
#Multi Linear regression model 
#model <- lm(price ~ bed + bath + house_size, data = df_filt_num)

#check regression results
#summary(model)

In [None]:
#plot model results
plot(model)

In [None]:
#Check model

In [None]:
#Create a calculater to calc house prices based on the results of My model