# COGS 108 - Data Checkpoint

# Names

- Gavin Zhao
- Simon Zheng
- Tunan Li
- Yufei Zhang

<a id='research_question'></a>
# Research Question

How do domestic economic differences contribute to the migration of American citizens among different states? Specifically, how do in-state median salaries, Gross Domestic Product, and housing prices relate to changes in populations in different states in the US from 2010 to 2019? If so, are there any significant trends we can observe?

# Dataset(s)


**Dataset1 - US Migration Flow in 10 Years**
- Dataset Name: Migration_Flows_from_2010_to_2019.csv
- Link to the dataset: https://www.kaggle.com/datasets/finnegannguyen/statetostate-migration-flows-from-2010-to-2019
- Number of observations: 16,848
- Description: This is the main dataset of the population, including migration flows in the United States from 2010 to 2019 with adequate sample size (the number of states being evaluated). The source is from the US Census, which can be found on the American Community Survey website: https://www.census.gov/data/tables/time-series/demo/geographic-mobility/state-to-state-migration.html. According to the US Census, the sample variability can lead to roughly 90% of margin of error, in addition to some potential nonsampling error.
The data set includes the following variables:
1. Current state that people reside in the year of the measurement
2. Years studied
3. The population of the current state in the year of the measurement
4. Residents number remaining in the same house as the previous year
5. Residents number remaining in the same state as the previous year
6. Number of people migrate to the current state from different states (in total)
7. Number of people from abroad migrate to the current state
8. Original place from where people migrate to the current state
9. Number of people from other states  migrate to the current state
<br>
<br>

**Dataset2 - Median Household Income**

- Dataset Name: Median household income, by state: Selected years, 1990 through 2019
- Link to the dataset: https://nces.ed.gov/programs/digest/d20/tables/dt20_102.30.asp
- Number of observations: 520

This dataset includes 520 observations of Median annual household income by state from 1990-2019. We will combine this dataset to data1 mentioned above based on corresponding state and year. This data would help us answer the question how median household income influence the migration between states.

<br>
<br>

**Dataset 3 - US GDP by State 1997-2020**
- Dataset Name: US GDP by State 1997 - 2020 (How much does each state produce?)
- Link: https://www.kaggle.com/datasets/davidbroberts/us-gdp-by-state-19972020
- Number of observations: 143702 
- Description: This dataset contains the official dataset from Bureau Of Economic Analysis about individual US state GDP as well as the overall US GDP from 1997 to 2020. The data includes information on the specific GDP break downs for different states, including GDP of industries such as agriculture, mining, oil, etc. 
<br>
<br>


**Dataset4 - Average Housing Price**

- Dataset Name: Average housing price - economic research
- Link to the dataset: https://fred.stlouisfed.org/searchresults?st=housing+price&pageID=2
- Number of observations: 16,520
- Description: In each of the data links of states, there are about 280 rows of housing prices of the specific state from 1963 to 2022. The two columns are the time (by quarter) and the housing price index. We will merge the data tables above (each representing a unique state) into one CSV file that contains all information on manipulating the price index, and eventually, find the relationship between housing prices and migration flow. The states of interest (the states that will be included in the main table) will be selected based on the result of the analysis of GDP and average salary. For example, if California has a high flow of incoming people with a strong relation with trends of GDP and average salary in that state, we will look back to the California housing price index and analyze if it plays a crucial role.

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import patsy
import statsmodels.api as sm
import datetime

  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,


# Data Cleaning

### For Dataset 1, we will be...

*  Removing unnecessary columns and rows
*  Renaming column name for clarification
*  Dropping null values in the table

First，  we will import the datasets:

In [18]:
US_migration = pd.read_csv("Migration_Flows_from_2010_to_2019.csv")
US_migration

Unnamed: 0,current_state,year,population,same_house,same_state,from_different_state_Total,abroad_Total,from,number_of_people
0,Alabama,2010,4729509,3987155,620465,108723,13166,Alabama,0
1,Alaska,2010,702974,565031,95878,36326,5739,Alabama,477
2,Arizona,2010,6332786,5069002,1001991,222725,39068,Alabama,416
3,Arkansas,2010,2888304,2387806,412997,79127,8374,Alabama,1405
4,California,2010,36907897,30790221,5413287,444749,259640,Alabama,3364
...,...,...,...,...,...,...,...,...,...
28075,Washington,2019,7527366,6253469,977928,231956,64013,abroad_ForeignCountry,62230
28076,West Virginia,2019,1773280,1563611,164739,39548,5382,abroad_ForeignCountry,4776
28077,Wisconsin,2019,5760481,5001140,634732,107973,16636,abroad_ForeignCountry,16123
28078,Wyoming,2019,572884,473128,68127,30247,1382,abroad_ForeignCountry,1382


Removing unnecessary columns:
- Specifically, we removed the number of people migrated to the current state from abroad and the number of residents remaining in the same house, because we only limit our study in the US domestically, and we don't look into details of how people changed housing within the state.

In [19]:
col_names = []
for col in US_migration:
    col_names.append(col)
print(col_names)

['current_state', 'year', 'population', 'same_house', 'same_state', 'from_different_state_Total', 'abroad_Total', 'from', 'number_of_people']


In [20]:
# removed the column "abroad_Total" and "same_house"
US_migration = US_migration[['current_state', 'year', 'population', 'same_state', 'from_different_state_Total', 'from', 'number_of_people']]
US_migration

Unnamed: 0,current_state,year,population,same_state,from_different_state_Total,from,number_of_people
0,Alabama,2010,4729509,620465,108723,Alabama,0
1,Alaska,2010,702974,95878,36326,Alabama,477
2,Arizona,2010,6332786,1001991,222725,Alabama,416
3,Arkansas,2010,2888304,412997,79127,Alabama,1405
4,California,2010,36907897,5413287,444749,Alabama,3364
...,...,...,...,...,...,...,...
28075,Washington,2019,7527366,977928,231956,abroad_ForeignCountry,62230
28076,West Virginia,2019,1773280,164739,39548,abroad_ForeignCountry,4776
28077,Wisconsin,2019,5760481,634732,107973,abroad_ForeignCountry,16123
28078,Wyoming,2019,572884,68127,30247,abroad_ForeignCountry,1382


Removing unnecessary rows:
For the same reason, by putting constraint on the "from" column, remove the rows in which it records people migrated to the current states from foreign country.

In [21]:
US_migration = US_migration.loc[US_migration['from'] != 'abroad_ForeignCountry']
US_migration

Unnamed: 0,current_state,year,population,same_state,from_different_state_Total,from,number_of_people
0,Alabama,2010,4729509,620465,108723,Alabama,0
1,Alaska,2010,702974,95878,36326,Alabama,477
2,Arizona,2010,6332786,1001991,222725,Alabama,416
3,Arkansas,2010,2888304,412997,79127,Alabama,1405
4,California,2010,36907897,5413287,444749,Alabama,3364
...,...,...,...,...,...,...,...
27555,Washington,2019,7527366,977928,231956,abroad_USIslandArea,1465
27556,West Virginia,2019,1773280,164739,39548,abroad_USIslandArea,0
27557,Wisconsin,2019,5760481,634732,107973,abroad_USIslandArea,0
27558,Wyoming,2019,572884,68127,30247,abroad_USIslandArea,0


Then, we rename the column names for clarification:

In [26]:
US_migration.rename(columns={'from':'migrated_from'}, inplace=True)
US_migration.rename(columns={'number_of_people':'number_from_given_state'}, inplace=True)
US_migration.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_migration.rename(columns={'from':'migrated_from'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_migration.rename(columns={'number_of_people':'number_from_given_state'}, inplace=True)


Unnamed: 0,current_state,year,population,same_state,from_different_state_Total,migrated_from,number_from_given_state
0,Alabama,2010,4729509,620465,108723,Alabama,0
1,Alaska,2010,702974,95878,36326,Alabama,477
2,Arizona,2010,6332786,1001991,222725,Alabama,416
3,Arkansas,2010,2888304,412997,79127,Alabama,1405
4,California,2010,36907897,5413287,444749,Alabama,3364


Lastly, check if there is any potential null value from the table 
- According to the result below, there is no null value, so we don't need to worry about it.

In [31]:
US_migration.isna().any()
US_migration.isnull().sum().sum()

0

Importing Datasets:

In [2]:
US_GDP = pd.read_csv('US_GDP_by State_1997_to_2020.csv', skipinitialspace=True)
Household_income = pd.read_csv('Household_income.csv')
US_GDP
Household_income

Unnamed: 0,column_index,column_level,column_level_1,column_level_1_ref_note,digest_table_sub_id,is_total,location,location_type,row_index,row_level,row_level_1,value,year,deflator,row_level_2,standard_error
0,A,1990,1990,Based on 1989 incomes collected in the 1990 ce...,A,True,United States,country,1,United States,United States,60000,1990,Consumer Price Index Research Series Using Cur...,,
1,A,1990,1990,Based on 1989 incomes collected in the 1990 ce...,A,False,Alabama,State,2,United States:::Alabama,United States,47100,1990,Consumer Price Index Research Series Using Cur...,Alabama,
2,A,1990,1990,Based on 1989 incomes collected in the 1990 ce...,A,False,Alaska,State,3,United States:::Alaska,United States,82700,1990,Consumer Price Index Research Series Using Cur...,Alaska,
3,A,1990,1990,Based on 1989 incomes collected in the 1990 ce...,A,False,Arizona,State,4,United States:::Arizona,United States,55000,1990,Consumer Price Index Research Series Using Cur...,Arizona,
4,A,1990,1990,Based on 1989 incomes collected in the 1990 ce...,A,False,Arkansas,State,5,United States:::Arkansas,United States,42200,1990,Consumer Price Index Research Series Using Cur...,Arkansas,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515,Q,2019,2019,,A,False,Virginia,State,48,United States:::Virginia,United States,76500,2019,Consumer Price Index Research Series Using Cur...,Virginia,510.0
516,Q,2019,2019,,A,False,Washington,State,49,United States:::Washington,United States,78700,2019,Consumer Price Index Research Series Using Cur...,Washington,560.0
517,Q,2019,2019,,A,False,West Virginia,State,50,United States:::West Virginia,United States,48900,2019,Consumer Price Index Research Series Using Cur...,West Virginia,690.0
518,Q,2019,2019,,A,False,Wisconsin,State,51,United States:::Wisconsin,United States,64200,2019,Consumer Price Index Research Series Using Cur...,Wisconsin,360.0


### For Dataset 2, we will be...

*  Removing unnecessary columns and rows
*  Selecting obersvations from 2010-2019
*  Renaming column name for clarification
*  Dropping null values


#### Selecting observations, removing columns and rows

In [3]:
Household_income=Household_income.loc[Household_income['year']>=2010]
Household_income=Household_income.loc[Household_income['location']!='United States']
Household_income=Household_income.reset_index()
Household_income=Household_income[['location','value','year']]
Household_income

Unnamed: 0,location,value,year
0,Alabama,47600,2010
1,Alaska,75900,2010
2,Arizona,55000,2010
3,Arkansas,45000,2010
4,California,67800,2010
...,...,...,...
352,Virginia,76500,2019
353,Washington,78700,2019
354,West Virginia,48900,2019
355,Wisconsin,64200,2019


#### Renaming column name for clarification

In [4]:
Household_income.rename(columns={'value':'median_hhi'}, inplace=True)
Household_income.rename(columns={'location':'state'}, inplace=True)
Household_income

Unnamed: 0,state,median_hhi,year
0,Alabama,47600,2010
1,Alaska,75900,2010
2,Arizona,55000,2010
3,Arkansas,45000,2010
4,California,67800,2010
...,...,...,...
352,Virginia,76500,2019
353,Washington,78700,2019
354,West Virginia,48900,2019
355,Wisconsin,64200,2019


#### Dropping null values

In [12]:
Household_income.dropna()

Unnamed: 0,state,median_hhi,year
0,Alabama,47600,2010
1,Alaska,75900,2010
2,Arizona,55000,2010
3,Arkansas,45000,2010
4,California,67800,2010
...,...,...,...
352,Virginia,76500,2019
353,Washington,78700,2019
354,West Virginia,48900,2019
355,Wisconsin,64200,2019


### For Dataset 3, we will be...

*  Removing unnecessary columns
*  Renaming some columns
*  Examining and converting GDP values to floats
*  Dropping null values


#### Removing columns

In [5]:
column_names = []
for col in US_GDP.columns:
    column_names.append(col)
print(column_names)

['GeoFIPS', 'GeoName', 'Region', 'TableName', 'LineCode', 'IndustryClassification', 'Description', 'Unit', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']


In [6]:
US_GDP = US_GDP[['GeoName', 'Description', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']]
US_GDP

Unnamed: 0,GeoName,Description,1997,1998,1999,2000,2001,2002,2003,2004,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,United States *,All industry total,8577552.0,9062817.0,9630663.0,10252347.0,10581822.0,10936418.0,11458246.0,12213730.0,...,15542582.0,16197007.0,16784851.0,17527258.0,18238301.0,18745075.0,19542980.0,20611861.0,21433226.0,20936558.0
1,United States *,Private industries,7431992.0,7871500.0,8378315.0,8929320.0,9188887.0,9462020.0,9905899.0,10582459.0,...,13405520.0,14037519.0,14572341.0,15255889.0,15898859.0,16360179.0,17094245.0,18062184.0,18793750.0,18290860.0
2,United States *,"Agriculture, forestry, fishing and hunting",108637.0,99756.0,92590.0,98312.0,99836.0,95629.0,113953.0,142945.0,...,180945.0,179573.0,215601.0,201003.0,182283.0,166571.0,176625.0,178569.0,175373.0,175802.0
3,United States *,Farms,88136.0,79030.0,70934.0,76043.0,78093.0,74033.0,91105.0,119356.0,...,152249.0,148939.0,184621.0,168147.0,147384.0,130639.0,140053.0,140271.0,136080.0,(NA)
4,United States *,"Forestry, fishing, and related activities",20501.0,20726.0,21656.0,22269.0,21743.0,21596.0,22848.0,23589.0,...,28696.0,30634.0,30980.0,32856.0,34899.0,35932.0,36571.0,38298.0,39293.0,(NA)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5523,Far West,Private services-providing industries 3/,985596.4,1064211.3,1160608.6,1244783.3,1289426.5,1358322.0,1432417.9,1523846.3,...,1976062.0,2075109.3,2181136.7,2313170.7,2486666.3,2618453.8,2789660.2,2954243.6,3138895.7,3104513.5
5524,,,,,,,,,,,...,,,,,,,,,,
5525,,,,,,,,,,,...,,,,,,,,,,
5526,,,,,,,,,,,...,,,,,,,,,,


#### Renaming columns

In [7]:
US_GDP.rename(columns={'Description':'Industry'}, inplace=True)
US_GDP

Unnamed: 0,GeoName,Industry,1997,1998,1999,2000,2001,2002,2003,2004,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,United States *,All industry total,8577552.0,9062817.0,9630663.0,10252347.0,10581822.0,10936418.0,11458246.0,12213730.0,...,15542582.0,16197007.0,16784851.0,17527258.0,18238301.0,18745075.0,19542980.0,20611861.0,21433226.0,20936558.0
1,United States *,Private industries,7431992.0,7871500.0,8378315.0,8929320.0,9188887.0,9462020.0,9905899.0,10582459.0,...,13405520.0,14037519.0,14572341.0,15255889.0,15898859.0,16360179.0,17094245.0,18062184.0,18793750.0,18290860.0
2,United States *,"Agriculture, forestry, fishing and hunting",108637.0,99756.0,92590.0,98312.0,99836.0,95629.0,113953.0,142945.0,...,180945.0,179573.0,215601.0,201003.0,182283.0,166571.0,176625.0,178569.0,175373.0,175802.0
3,United States *,Farms,88136.0,79030.0,70934.0,76043.0,78093.0,74033.0,91105.0,119356.0,...,152249.0,148939.0,184621.0,168147.0,147384.0,130639.0,140053.0,140271.0,136080.0,(NA)
4,United States *,"Forestry, fishing, and related activities",20501.0,20726.0,21656.0,22269.0,21743.0,21596.0,22848.0,23589.0,...,28696.0,30634.0,30980.0,32856.0,34899.0,35932.0,36571.0,38298.0,39293.0,(NA)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5523,Far West,Private services-providing industries 3/,985596.4,1064211.3,1160608.6,1244783.3,1289426.5,1358322.0,1432417.9,1523846.3,...,1976062.0,2075109.3,2181136.7,2313170.7,2486666.3,2618453.8,2789660.2,2954243.6,3138895.7,3104513.5
5524,,,,,,,,,,,...,,,,,,,,,,
5525,,,,,,,,,,,...,,,,,,,,,,
5526,,,,,,,,,,,...,,,,,,,,,,


#### Converting GDP values to floats

In [8]:
US_GDP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Data columns (total 26 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   GeoName   5524 non-null   object 
 1   Industry  5524 non-null   object 
 2   1997      5524 non-null   object 
 3   1998      5524 non-null   object 
 4   1999      5524 non-null   object 
 5   2000      5524 non-null   object 
 6   2001      5524 non-null   object 
 7   2002      5524 non-null   object 
 8   2003      5524 non-null   object 
 9   2004      5524 non-null   object 
 10  2005      5524 non-null   object 
 11  2006      5524 non-null   object 
 12  2007      5524 non-null   object 
 13  2008      5524 non-null   object 
 14  2009      5524 non-null   object 
 15  2010      5524 non-null   object 
 16  2011      5524 non-null   object 
 17  2012      5524 non-null   object 
 18  2013      5524 non-null   object 
 19  2014      5524 non-null   object 
 20  2015      5524 non-null   obje

In [9]:
years = ['1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']
US_GDP[years] = US_GDP[years].apply(pd.to_numeric, errors='coerce', axis=1)
US_GDP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Data columns (total 26 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   GeoName   5524 non-null   object 
 1   Industry  5524 non-null   object 
 2   1997      5522 non-null   float64
 3   1998      5521 non-null   float64
 4   1999      5521 non-null   float64
 5   2000      5521 non-null   float64
 6   2001      5520 non-null   float64
 7   2002      5522 non-null   float64
 8   2003      5522 non-null   float64
 9   2004      5522 non-null   float64
 10  2005      5521 non-null   float64
 11  2006      5521 non-null   float64
 12  2007      5521 non-null   float64
 13  2008      5521 non-null   float64
 14  2009      5520 non-null   float64
 15  2010      5521 non-null   float64
 16  2011      5520 non-null   float64
 17  2012      5522 non-null   float64
 18  2013      5523 non-null   float64
 19  2014      5523 non-null   float64
 20  2015      5523 non-null   floa

#### Dropping null values

In [10]:
US_GDP.isna().any()

GeoName     True
Industry    True
1997        True
1998        True
1999        True
2000        True
2001        True
2002        True
2003        True
2004        True
2005        True
2006        True
2007        True
2008        True
2009        True
2010        True
2011        True
2012        True
2013        True
2014        True
2015        True
2016        True
2017        True
2018        True
2019        True
2020        True
dtype: bool

We only want to drop the last few rows whose GeoName contains null values and retain all other rows.

In [11]:
US_GDP = US_GDP[US_GDP['GeoName'].notna()]
US_GDP

Unnamed: 0,GeoName,Industry,1997,1998,1999,2000,2001,2002,2003,2004,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,United States *,All industry total,8577552.0,9062817.0,9630663.0,10252347.0,10581822.0,10936418.0,11458246.0,12213730.0,...,15542582.0,16197007.0,16784851.0,17527258.0,18238301.0,18745075.0,19542980.0,20611861.0,21433226.0,20936558.0
1,United States *,Private industries,7431992.0,7871500.0,8378315.0,8929320.0,9188887.0,9462020.0,9905899.0,10582459.0,...,13405520.0,14037519.0,14572341.0,15255889.0,15898859.0,16360179.0,17094245.0,18062184.0,18793750.0,18290860.0
2,United States *,"Agriculture, forestry, fishing and hunting",108637.0,99756.0,92590.0,98312.0,99836.0,95629.0,113953.0,142945.0,...,180945.0,179573.0,215601.0,201003.0,182283.0,166571.0,176625.0,178569.0,175373.0,175802.0
3,United States *,Farms,88136.0,79030.0,70934.0,76043.0,78093.0,74033.0,91105.0,119356.0,...,152249.0,148939.0,184621.0,168147.0,147384.0,130639.0,140053.0,140271.0,136080.0,
4,United States *,"Forestry, fishing, and related activities",20501.0,20726.0,21656.0,22269.0,21743.0,21596.0,22848.0,23589.0,...,28696.0,30634.0,30980.0,32856.0,34899.0,35932.0,36571.0,38298.0,39293.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5519,Far West,Trade,194006.4,209026.9,218719.4,234837.1,238374.6,243523.8,255942.6,272344.2,...,322777.6,336736.0,355227.0,374147.0,400509.8,403823.2,423391.3,435665.6,462623.3,464777.6
5520,Far West,Transportation and utilities,68505.6,70529.0,73433.9,77359.2,73970.8,78855.7,81984.4,89126.6,...,118476.5,120508.7,125869.3,132602.4,142351.0,151508.6,161271.6,172189.2,189817.3,170679.6
5521,Far West,Manufacturing and information,296955.4,319341.4,361094.4,383468.2,357476.4,353194.7,373908.0,398769.7,...,515643.0,535530.9,586212.2,618307.5,671138.4,699937.9,753013.2,801925.5,839734.0,849532.8
5522,Far West,Private goods-producing industries 2/,308331.2,319019.0,342259.9,387417.5,355880.3,344168.7,374210.9,414142.9,...,493509.5,515674.2,542474.8,566819.1,584587.0,590830.7,629723.0,670682.2,684935.3,666697.4


### For Dataset 4...

The data is mostly clean, where one column is the time and the other is the price index in floats. One thing that needs to be carried out is that we have to delete some messy information at the top of the tables in order to make the data actually clean. 


In addition, we have to pay attention to the missing values. One way to deal with this is by using “np.nan()” to drop all the empty values to make sure the data is reliable. It is also necessary to slice the data year from 2010 to 2019 since that is the time period we are interested in by using “iloc” functions in python.

Lastly, we have to rename the column of housing price to “housing_CPI” instead of original abbreviation name in data to make sure the data is straightforward and clear by using “df.rename()” functions. Other than that, we don’t have to do any further data cleaning since all the columns are useful and every data set are in the correct form(price in float, date in string), and no need to merge different columns as each column works separately to provide time and housing price data, they are totally independent.

Again, the reason why we are not including the data cleaning process (in terms of codes) is because our selection of the table for dataset 4 (from https://fred.stlouisfed.org/searchresults?st=housing+price&pageID=2
) will depend on the result of analysis of the previous tables, thus we cannot decide it yet.