# Home.LLC - Assessment

# Problem Statement

Q1. Find publicly available data for key *supply-demand* factors that influence US home prices *nationally*. Then, build a data science model that explains how these factors impacted home prices over the last 20 years.
Use the S&P Case-Schiller Home Price Index as a proxy for home prices: fred.stlouisfed.org/series/CSUSHPISA.

By considering the problem statement I have decided to find and get some features that will affect the US home prices nationally. The following are the some of the features which affects the US home prices.
1) US GDP Growth
2) US Inflation
3) Crude Oil Prices
4) Property Taxes
5) Consumer Confidence
6) Construction material cost
7) Price to rent ratio
8) US Unemployement rate
9) US GDP per capita
10) US US Economic growth
11) Housing Market Indices
12) National Home Ownership Rate
13) Population growth
14) Poverty Rate

So after deciding the features I have downloaded the dataset of those feature from the following websites

1) https://fred.stlouisfed.org/series/CUUR0000SEHA
2) https://data.world/finance/home-construction-price-index
3) https://www.nar.realtor/research-and-statistics
4) ttps://www.census.gov/construction/nrc/
5) https://www.bea.gov/
6) https://www.bls.gov/
7) https://www.census.gov/construction/nrc/
8) https://www.zillow.com/research/data/

# Data Collection and Pre-Processing works

In [1]:
# import the necessary libraries

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [56]:
# Loading Target Feature Index into a dataframe
df_target = pd.read_csv('CSUSHPISA (Target).csv')

#Changing dtype of date column
df_target['DATE'] = pd.to_datetime(df_target['DATE'])

#Resetting Index
df_target.reset_index(inplace = True)
df_target.drop(columns = ['index'], inplace = True)

# Creating "Year" and "Month" columns
df_target['Year'] = pd.DatetimeIndex(df_target['DATE']).year
df_target['Month'] = pd.DatetimeIndex(df_target['DATE']).month
print(df_target.shape)

df_target.head() # to get the first 5 rows

(276, 4)


Unnamed: 0,DATE,CSUSHPISA,Year,Month
0,2000-01-01,100.551,2000,1
1,2000-02-01,101.339,2000,2
2,2000-03-01,102.126,2000,3
3,2000-04-01,102.922,2000,4
4,2000-05-01,103.677,2000,5


In [57]:
# Loading Crude Oil Data into a dataframe
df_crud = pd.read_csv('Crude Oil.csv')

print(df_crud.shape)
df_crud.head()

(276, 2)


Unnamed: 0,DATE,Crude oil (WTISPLC)
0,01-01-2000,27.18
1,01-02-2000,29.35
2,01-03-2000,29.89
3,01-04-2000,25.74
4,01-05-2000,28.78


In [58]:
# Loading Construction material cost Data into a dataframe
df_con_mat = pd.read_csv('COMPUTSA(construction_mat_cost).csv')

print(df_con_mat.shape)
df_con_mat.head()

(276, 2)


Unnamed: 0,DATE,COMPUTSA(Construction materials cost)
0,01-01-2000,1574
1,01-02-2000,1677
2,01-03-2000,1704
3,01-04-2000,1610
4,01-05-2000,1682


In [59]:
# Loading Real Estate Market Indices Data into a dataframe
df_Remi = pd.read_excel('Real Estate Market Indices.xlsx')

# Rename the "date" column to "DATE"
df_Remi.rename(columns={'Date': 'DATE'}, inplace=True)

print(df_Remi.shape)
df_Remi.head()

(57, 3)


Unnamed: 0,DATE,Q4TR771BIS(Real Estate Market Index,Unnamed: 2
0,2008-10-01,-0.6027,
1,2009-01-01,-1.2372,
2,2009-04-01,-1.1438,DATE
3,2009-07-01,-0.0078,
4,2009-10-01,1.6525,


In [6]:
# here there is an Unnamed column with NaN values so we are dropping the column
df_Remi.drop(columns=['Unnamed: 2'], inplace=True)

In [60]:
# to cross check whether the column is dropped or not
df_Remi.head()

Unnamed: 0,DATE,Q4TR771BIS(Real Estate Market Index,Unnamed: 2
0,2008-10-01,-0.6027,
1,2009-01-01,-1.2372,
2,2009-04-01,-1.1438,DATE
3,2009-07-01,-0.0078,
4,2009-10-01,1.6525,


In [61]:
# Loading 'US Consumer Confidence.xlsx' Data into a dataframe
df_cus_conf = pd.read_excel('US Consumer Confidence.xlsx')

# Rename the "date" column to "DATE"
df_cus_conf.rename(columns={'Date': 'DATE'}, inplace=True)

print(df_cus_conf.shape)
df_cus_conf.head()

(276, 3)


Unnamed: 0,LOCATION,DATE,Value(Consumer Confidence)
0,USA,2000-01,102.8276
1,USA,2000-02,102.881
2,USA,2000-03,102.7992
3,USA,2000-04,102.7805
4,USA,2000-05,102.755


In [63]:
# Loading 'US Economic Growth.xlsx' Data into a dataframe
df_eco_growth = pd.read_excel('US Economic Growth.xlsx')

# Renaming the "Unnamed column which has the date values to to "DATE"
df_eco_growth.rename(columns={'Unnamed: 0': 'DATE'}, inplace=True)


print(df_eco_growth.shape)
df_eco_growth.head()

(77, 4)


Unnamed: 0,DATE,GDP,Per Capita,Growth Rate
0,date,Billions of US $,US $,Real GDP QoQ %
1,2000-03-01 00:00:00,10002.857,35558.88647,1.45
2,2000-06-01 00:00:00,10247.679,36339.02951,7.32
3,2000-09-01 00:00:00,10319.825,36495.60242,0.53
4,2000-12-01 00:00:00,10439.025,36819.61992,2.49


In [10]:
# Convert 'DATE' column to datetime with error handling
df_eco_growth['DATE'] = pd.to_datetime(df_eco_growth['DATE'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Filter out rows with invalid date values
df_eco_growth = df_eco_growth.dropna(subset=['DATE'])


# Extract the date part and overwrite the 'DATE' column
df_eco_growth['DATE'] = df_eco_growth['DATE'].dt.strftime('%Y-%m-%d')

In [62]:
df_eco_growth.head()

Unnamed: 0,DATE,GDP,Per Capita,Growth Rate
1,2000-03-01,10002.857,35558.88647,1.45
2,2000-06-01,10247.679,36339.02951,7.32
3,2000-09-01,10319.825,36495.60242,0.53
4,2000-12-01,10439.025,36819.61992,2.49
5,2001-03-01,10472.879,36854.40354,-1.14


In [64]:
# Loading ''US GDP Growth Rate.xlsx' Data into a dataframe
df_GDP_growth_rate = pd.read_excel('US GDP Growth Rate.xlsx')

# Rename the "date" column to "DATE"
df_GDP_growth_rate.rename(columns={'date': 'DATE'}, inplace=True)

print(df_GDP_growth_rate.shape)
df_GDP_growth_rate.head()

(23, 3)


Unnamed: 0,DATE,GDP Growth (%),Annual Change
0,2000-12-31,4.0772,-0.72
1,2001-12-31,0.9543,-3.12
2,2002-12-31,1.6959,0.74
3,2003-12-31,2.7962,1.1
4,2004-12-31,3.8526,1.06


In [65]:
# Loading 'US GDP per Capita.xlsx' Data into a dataframe
df_GDP_per_capita = pd.read_excel('US GDP per Capita.xlsx')

# Rename the "date" column to "DATE"
df_GDP_per_capita.rename(columns={'Unnamed: 0': 'DATE'}, inplace=True)

df_GDP_per_capita.rename(columns={' ': 'GDP Per Capita(US $)'}, inplace=True) # changing the column names as present in the excel file

df_GDP_per_capita.rename(columns={' .1': 'Annual Growth Rate'}, inplace=True) # changing the column names as present in the excel file


print(df_GDP_per_capita.shape)
df_GDP_per_capita.head()

(24, 4)


Unnamed: 0,DATE,GDP Per Capita(US $),Annual Growth Rate,.2
0,date,GDP Per Capita (US $),Annual Growth Rate (%),
1,2000-12-31 00:00:00,36329.9561,5.26,
2,2001-12-31 00:00:00,37133.6231,2.21,
3,2002-12-31 00:00:00,37997.7597,2.33,
4,2003-12-31 00:00:00,39490.275,3.93,


In [14]:
# Convert 'DATE' column to datetime with error handling
df_GDP_per_capita['DATE'] = pd.to_datetime(df_GDP_per_capita['DATE'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Filter out rows with invalid date values
df_GDP_per_capita = df_GDP_per_capita.dropna(subset=['DATE'])


# Extract the date part and overwrite the 'DATE' column
df_GDP_per_capita['DATE'] = df_GDP_per_capita['DATE'].dt.strftime('%Y-%m-%d')

In [66]:
df_GDP_per_capita.head()

Unnamed: 0,DATE,GDP Per Capita(US $),Annual Growth Rate,.2
0,date,GDP Per Capita (US $),Annual Growth Rate (%),
1,2000-12-31 00:00:00,36329.9561,5.26,
2,2001-12-31 00:00:00,37133.6231,2.21,
3,2002-12-31 00:00:00,37997.7597,2.33,
4,2003-12-31 00:00:00,39490.275,3.93,


In [67]:
# here there is an Unnamed column with NaN values so we are dropping the column
df_GDP_per_capita.drop(columns=[' .2'], inplace=True)

In [68]:
# to cross check
df_GDP_per_capita.head()

Unnamed: 0,DATE,GDP Per Capita(US $),Annual Growth Rate
0,date,GDP Per Capita (US $),Annual Growth Rate (%)
1,2000-12-31 00:00:00,36329.9561,5.26
2,2001-12-31 00:00:00,37133.6231,2.21
3,2002-12-31 00:00:00,37997.7597,2.33
4,2003-12-31 00:00:00,39490.275,3.93


In [69]:
# Loading 'US Inflation data.xlsx' Data into a dataframe
df_inflation = pd.read_excel('US Inflation data.xlsx')

# Rename the "date" column to "DATE"
df_inflation.rename(columns={'Unnamed: 0': 'DATE'}, inplace=True)

df_inflation.rename(columns={' ': 'Inflation Rate(%)'}, inplace=True)

df_inflation.rename(columns={' .1': 'Annual Change(Inflation Rate)'}, inplace=True)

print(df_inflation.shape)
df_inflation.head()

(24, 4)


Unnamed: 0,DATE,Inflation Rate(%),Annual Change(Inflation Rate),.2
0,date,Inflation Rate (%),Annual Change(Inflation Rate),
1,2000-12-31 00:00:00,3.3769,1.19,
2,2001-12-31 00:00:00,2.8262,-0.55,
3,2002-12-31 00:00:00,1.586,-1.24,
4,2003-12-31 00:00:00,2.2701,0.68,


In [19]:
# Convert 'DATE' column to datetime with error handling
df_inflation['DATE'] = pd.to_datetime(df_inflation['DATE'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Filter out rows with invalid date values
df_inflation = df_inflation.dropna(subset=['DATE'])


# Extract the date part and overwrite the 'DATE' column
df_inflation['DATE'] = df_inflation['DATE'].dt.strftime('%Y-%m-%d')

In [70]:
df_inflation.head()

Unnamed: 0,DATE,Inflation Rate(%),Annual Change(Inflation Rate),.2
0,date,Inflation Rate (%),Annual Change(Inflation Rate),
1,2000-12-31 00:00:00,3.3769,1.19,
2,2001-12-31 00:00:00,2.8262,-0.55,
3,2002-12-31 00:00:00,1.586,-1.24,
4,2003-12-31 00:00:00,2.2701,0.68,


In [21]:
# to drop the NaN value column
df_inflation.drop(columns=[' .2'], inplace=True)

In [71]:
# to cross check
df_inflation.head()

Unnamed: 0,DATE,Inflation Rate(%),Annual Change(Inflation Rate),.2
0,date,Inflation Rate (%),Annual Change(Inflation Rate),
1,2000-12-31 00:00:00,3.3769,1.19,
2,2001-12-31 00:00:00,2.8262,-0.55,
3,2002-12-31 00:00:00,1.586,-1.24,
4,2003-12-31 00:00:00,2.2701,0.68,


In [72]:
# Loading 'US National Home Ownership rate.xlsx' Data into a dataframe
df_NH_owner = pd.read_excel('US National Home Ownership rate.xlsx')

print(df_NH_owner.shape)
df_NH_owner.head()

(92, 2)


Unnamed: 0,DATE,RSAHORUSQ156S(US national home ownership)
0,2000-01-01,67.1
1,2000-04-01,67.3
2,2000-07-01,67.5
3,2000-10-01,67.5
4,2001-01-01,67.6


In [73]:
# Loading 'US Population.xlsx' Data into a dataframe
df_population = pd.read_excel('US Population.xlsx')

# Rename the "date" column to "DATE"
df_population.rename(columns={'date': 'DATE'}, inplace=True)

print(df_population.shape)
df_population.head()

(23, 3)


Unnamed: 0,DATE,Population,Annual Growth Rate(Population)
0,2000-12-31,282398554,1.15
1,2001-12-31,285470493,1.09
2,2002-12-31,288350252,1.01
3,2003-12-31,291109820,0.96
4,2004-12-31,293947885,0.97


In [74]:
# Loading 'US Poverty Rate.xlsx' Data into a dataframe
df_poverty_rate = pd.read_excel('US Poverty Rate.xlsx')

# Rename the "date" column to "DATE"
df_poverty_rate.rename(columns={'Unnamed: 0': 'DATE'}, inplace=True)

df_poverty_rate.rename(columns={' ': 'Poverty Rate % Under US $5.50 Per Day'}, inplace=True)

df_poverty_rate.rename(columns={' .1': 'Change(Poverty Rate)'}, inplace=True)



print(df_poverty_rate.shape)
df_poverty_rate.head()

(22, 4)


Unnamed: 0,DATE,Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),.2
0,date,Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),
1,2000-12-31 00:00:00,1.5,0,
2,2001-12-31 00:00:00,1.7,0.2,
3,2002-12-31 00:00:00,1.7,0,
4,2003-12-31 00:00:00,2,0.3,


In [26]:
# Convert 'DATE' column to datetime with error handling
df_poverty_rate['DATE'] = pd.to_datetime(df_poverty_rate['DATE'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Filter out rows with invalid date values
df_poverty_rate = df_poverty_rate.dropna(subset=['DATE'])


# Extract the date part and overwrite the 'DATE' column
df_poverty_rate['DATE'] = df_poverty_rate['DATE'].dt.strftime('%Y-%m-%d')

In [75]:
df_poverty_rate.head()

Unnamed: 0,DATE,Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),.2
0,date,Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),
1,2000-12-31 00:00:00,1.5,0,
2,2001-12-31 00:00:00,1.7,0.2,
3,2002-12-31 00:00:00,1.7,0,
4,2003-12-31 00:00:00,2,0.3,


In [28]:
# to frop the unwanted column
df_poverty_rate.drop(columns=[' .2'], inplace=True)

In [76]:
# to cross check
df_poverty_rate.head()

Unnamed: 0,DATE,Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),.2
0,date,Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),
1,2000-12-31 00:00:00,1.5,0,
2,2001-12-31 00:00:00,1.7,0.2,
3,2002-12-31 00:00:00,1.7,0,
4,2003-12-31 00:00:00,2,0.3,


In [77]:
# Loading 'US price to rent.xlsx' Data into a dataframe
df_price_to_rent = pd.read_excel('US price to rent.xlsx')

print(df_price_to_rent.shape)
df_price_to_rent.head()

(276, 2)


Unnamed: 0,DATE,CUUR0000SEHA(Price to Rent)
0,2000-01-01,181.1
1,2000-02-01,181.5
2,2000-03-01,182.0
3,2000-04-01,182.3
4,2000-05-01,182.7


In [78]:
# Loading 'US Unemployement Rate.xlsx' Data into a dataframe
df_unemp_rate = pd.read_excel('US Unemployement Rate.xlsx')\

# Rename the "date" column to "DATE"
df_unemp_rate.rename(columns={'Unnamed: 0': 'DATE'}, inplace=True)

df_unemp_rate.rename(columns={' ': 'Unemployement Rate %'}, inplace=True)

df_unemp_rate.rename(columns={' .1': 'Annual Change(US unemployment rate)'}, inplace=True)


print(df_unemp_rate.shape)
df_unemp_rate.head()

(24, 4)


Unnamed: 0,DATE,Unemployement Rate %,Annual Change(US unemployment rate),.2
0,date,Unemployment Rate (%),Annual Change(US unemployment rate),
1,2000-12-31 00:00:00,3.99,-0.23,
2,2001-12-31 00:00:00,4.73,0.74,
3,2002-12-31 00:00:00,5.78,1.05,
4,2003-12-31 00:00:00,5.99,0.21,


In [79]:
# to drop the NaN containing column
df_unemp_rate.drop(columns=[' .2'], inplace=True)

In [80]:
# To cross check
df_unemp_rate.head()

Unnamed: 0,DATE,Unemployement Rate %,Annual Change(US unemployment rate)
0,date,Unemployment Rate (%),Annual Change(US unemployment rate)
1,2000-12-31 00:00:00,3.99,-0.23
2,2001-12-31 00:00:00,4.73,0.74
3,2002-12-31 00:00:00,5.78,1.05
4,2003-12-31 00:00:00,5.99,0.21


In [36]:
# Loading 'US Property tax.xlsx' Data into a dataframe
df_property_tax = pd.read_excel('US Property tax.xlsx')

print(df_property_tax.shape)
df_property_tax.tail()

(92, 2)


Unnamed: 0,DATE,BOGZ1FL513178005Q(US Property Tax)
87,2021-10-01,10269
88,2022-01-01,11273
89,2022-04-01,8710
90,2022-07-01,8203
91,2022-10-01,10504


In [37]:
# Concating all the dataframes having monthly data to create one dataframe
df = pd.DataFrame()
df_bymonth = [df_target, df_crud, df_con_mat, df_cus_conf, df_price_to_rent, df_Remi, df_eco_growth, 
              df_GDP_growth_rate, df_GDP_per_capita, df_inflation, df_NH_owner, df_population, df_poverty_rate,
         df_unemp_rate, df_property_tax ]
for df1 in df_bymonth:
    df1["DATE"] = pd.to_datetime(df1["DATE"])
    df1 = df1.set_index("DATE")
    df = pd.concat([df,df1], axis = 1)
print(df.shape)
df.head()

(552, 26)


Unnamed: 0_level_0,CSUSHPISA,Year,Month,Crude oil (WTISPLC),COMPUTSA(Construction materials cost),LOCATION,Value(Consumer Confidence),CUUR0000SEHA(Price to Rent),Q4TR771BIS(Real Estate Market Index,GDP,...,Inflation Rate(%),Annual Change(Inflation Rate),RSAHORUSQ156S(US national home ownership),Population,Annual Growth Rate(Population),Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),Unemployement Rate %,Annual Change(US unemployment rate),BOGZ1FL513178005Q(US Property Tax)
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-01,100.551,2000.0,1.0,27.18,1574.0,USA,102.8276,181.1,,,...,,,67.1,,,,,,,9621.0
2000-01-02,,,,29.35,1677.0,,,,,,...,,,,,,,,,,
2000-01-03,,,,29.89,1704.0,,,,,,...,,,,,,,,,,
2000-01-04,,,,25.74,1610.0,,,,,,...,,,,,,,,,,
2000-01-05,,,,28.78,1682.0,,,,,,...,,,,,,,,,,


In [38]:
# to get the basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 552 entries, 2000-01-01 to 2022-12-31
Data columns (total 26 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   CSUSHPISA                                  276 non-null    float64
 1   Year                                       276 non-null    float64
 2   Month                                      276 non-null    float64
 3   Crude oil (WTISPLC)                        276 non-null    float64
 4   COMPUTSA(Construction materials cost)      276 non-null    float64
 5   LOCATION                                   276 non-null    object 
 6   Value(Consumer Confidence)                 276 non-null    float64
 7   CUUR0000SEHA(Price to Rent)                276 non-null    float64
 8   Q4TR771BIS(Real Estate Market Index        57 non-null     float64
 9    GDP                                       76 non-null     object 
 10   Per Ca

In [81]:
# to get the statistical details
df.describe()

Unnamed: 0,CSUSHPISA,Year,Month,Crude oil (WTISPLC),COMPUTSA(Construction materials cost),Value(Consumer Confidence),CUUR0000SEHA(Price to Rent),Q4TR771BIS(Real Estate Market Index,GDP,Per Capita,...,Inflation Rate(%),Annual Change(Inflation Rate),RSAHORUSQ156S(US national home ownership),Population,Annual Growth Rate(Population),Poverty Rate % Under US $5.50 Per Day,Change(Poverty Rate),Unemployement Rate %,Annual Change(US unemployment rate),BOGZ1FL513178005Q(US Property Tax)
count,552.0,552.0,552.0,552.0,552.0,552.0,552.0,552.0,552.0,552.0,...,552.0,552.0,552.0,552.0,552.0,552.0,552.0,552.0,552.0,552.0
mean,170.388513,2011.019928,4.218297,61.214909,1249.425725,99.886569,262.58382,1.502449,14939.905197,48464.566329,...,2.493209,0.253478,66.397917,313035700.0,0.841446,1.847619,-0.014286,5.854391,-0.026522,8864.797101
std,43.728477,6.638484,3.509907,24.678488,418.383125,1.502286,52.081856,1.315686,1114.890437,2837.811746,...,0.332513,0.323072,1.788661,16565890.0,0.188673,0.044116,0.04439,0.36505,0.301361,1656.343
min,100.551,2000.0,1.0,16.55,520.0,96.12276,181.1,-1.5914,10002.857,35558.88647,...,-0.3555,-4.19,63.1,282398600.0,0.31,1.2,-0.5,3.611,-2.7,5135.0
25%,141.66325,2005.0,1.479167,41.504615,917.096154,98.6688,217.875,1.192661,14939.905197,48464.566329,...,2.493209,0.253478,64.898214,298995200.0,0.797396,1.847619,-0.014286,5.854391,-0.026522,7709.988095
50%,163.862,2011.0,1.958333,58.507308,1262.115385,100.360388,251.817583,1.502449,14939.905197,48464.566329,...,2.493209,0.253478,66.5,313035700.0,0.87,1.847619,-0.014286,5.854391,-0.026522,8700.785714
75%,187.024875,2017.0,7.0,80.088462,1603.25,101.051519,303.7925,2.129375,14939.905197,48464.566329,...,2.493209,0.253478,67.9,327882300.0,0.976771,1.847619,-0.014286,5.854391,-0.026522,9783.964286
max,304.817,2022.0,12.0,133.93,2245.0,102.881,385.649,5.5134,20891.367,63647.20309,...,8.0028,3.46,69.4,338289900.0,1.15,2.2,0.3,9.63,4.38,12919.0


In [39]:
# to check if the data set has any null values
df.isnull().sum()

CSUSHPISA                                    276
Year                                         276
Month                                        276
Crude oil (WTISPLC)                          276
COMPUTSA(Construction materials cost)        276
LOCATION                                     276
Value(Consumer Confidence)                   276
CUUR0000SEHA(Price to Rent)                  276
Q4TR771BIS(Real Estate Market Index          495
 GDP                                         476
 Per Capita                                  476
 Growth Rate                                 476
 GDP Growth (%)                              529
 Annual Change                               529
GDP Per Capita(US $)                         529
Annual Growth Rate                           529
Inflation Rate(%)                            529
Annual Change(Inflation Rate)                529
RSAHORUSQ156S(US national home ownership)    460
 Population                                  529
 Annual Growth Rate(

It shows that the dataset has null values. So, we are going to impute the null values by using the forward fill method, Interpolate method and by using mean.

In [44]:
# Forward fill missing values in the 'LOCATION' column
df['LOCATION'].fillna(method='ffill', inplace=True)
df['LOCATION'].isnull().sum()

0

In [46]:
# Interpolate missing values (linear interpolation)
df.interpolate(method='linear', inplace=True)


In [47]:
# to cross check
df.isnull().sum()

CSUSHPISA                                      0
Year                                           0
Month                                          0
Crude oil (WTISPLC)                            0
COMPUTSA(Construction materials cost)          0
LOCATION                                       0
Value(Consumer Confidence)                     0
CUUR0000SEHA(Price to Rent)                    0
Q4TR771BIS(Real Estate Market Index          212
 GDP                                         476
 Per Capita                                  476
 Growth Rate                                 476
 GDP Growth (%)                               23
 Annual Change                                23
GDP Per Capita(US $)                         529
Annual Growth Rate                           529
Inflation Rate(%)                            529
Annual Change(Inflation Rate)                529
RSAHORUSQ156S(US national home ownership)      0
 Population                                   23
 Annual Growth Rate(

Still some of the columns has NaN values. So the following step is carried out.

In [49]:
# Fill missing numerical values with the mean
df.fillna(df.mean(), inplace=True)


In [50]:
# to croos check
df.isnull().sum()

CSUSHPISA                                    0
Year                                         0
Month                                        0
Crude oil (WTISPLC)                          0
COMPUTSA(Construction materials cost)        0
LOCATION                                     0
Value(Consumer Confidence)                   0
CUUR0000SEHA(Price to Rent)                  0
Q4TR771BIS(Real Estate Market Index          0
 GDP                                         0
 Per Capita                                  0
 Growth Rate                                 0
 GDP Growth (%)                              0
 Annual Change                               0
GDP Per Capita(US $)                         0
Annual Growth Rate                           0
Inflation Rate(%)                            0
Annual Change(Inflation Rate)                0
RSAHORUSQ156S(US national home ownership)    0
 Population                                  0
 Annual Growth Rate(Population)              0
Poverty Rate 

Now the dataset is free from the Null values. We have sucessfully collected the required dataset for this problem statement and pre processed to do further steps.

In [51]:
# to get the shape
df.shape

(552, 26)

In [54]:
# to save the file in csv format
df.to_csv("home_llc_preprocessed_dataset.csv")