DS312 Homework 2: Linear Regression with PCA
============================================
### Calvin Henggeler
### Fall 2023

Use the data county_info.csv under Files > data.

1. Compute the annual population growth rate (%) over the time period.
2. Compute the annual house value growth rate (%) over the time period
3. Compute the annual birth rate over the measured time period.
4. Use features: pop_foreigen_born, adult obesity rate, pop/sq mi, pop pct urban, poverty rate, median income, cost of living, annual pop growth rate,
   annual house value growth rate. You should drop any rows lacking entries for any of the features (but do not drop any rows which have figures for all these features).
   Adjust all dollar figures to 2022/03/01 using python's CPI package.
5. Use linear regression to model the median house value in 2000 as a function of the features. Don't forget to split the data into training, validation, and test sets.
6. Report the coefficient for each feature. Are there any which make you suspect multicollinearity?
7. Investigate multicollinearity among the features by plotting the correlation heatmap.
8. Normalize the data, then use PCA to transform the features into a new linearly independent feature set.
9. Describe the principal (first) component as a linear combination of the given features.
10. For k = 1, ...,n, train a linear model on the first k components, and score on the validation set.
11. Graph the score vs number of components.
12. Include on the same graph variance explained vs number of components.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

In [20]:
df = pd.read_csv('county_info.csv')

In [4]:
df.head()

Unnamed: 0,fips,state,county,pop_in_later_year,pop_ref_later_year,pop_f,pop_m,pop_in_2000,median_age,median_age_f,...,births_to_yr_int2,pop_foreign_born,land_area_km2,land_area_mi2,water_area_km2,water_area_mi2,total_area_km2,total_area_mi2,latitude,longitude
0,1001,AL,Autauga,55308.0,2017,28306.0,27002.0,43671.0,38.0,39.2,...,2006.0,1170.0,1539.582,594.436,25.776,9.952,1565.358,604.388,32.536382,-86.64449
1,1003,AL,Baldwin,212628.0,2017,107930.0,104698.0,140415.0,42.6,44.3,...,2006.0,10881.0,4117.522,1589.784,1133.19,437.527,5250.712,2027.311,30.659218,-87.746067
2,1005,AL,Barbour,26330.0,2017,12301.0,14029.0,29038.0,39.9,43.3,...,2006.0,701.0,2291.819,884.876,50.865,19.639,2342.684,904.515,31.87067,-85.405456
3,1007,AL,Bibb,22691.0,2017,10393.0,12298.0,20826.0,40.0,43.6,...,2006.0,232.0,1612.481,622.582,9.289,3.587,1621.77,626.169,33.015893,-87.127148
4,1009,AL,Blount,57952.0,2017,29352.0,28600.0,51024.0,41.1,42.6,...,2006.0,2638.0,1669.962,644.776,15.157,5.852,1685.119,650.628,33.977448,-86.567246


In [5]:
df.describe()

Unnamed: 0,fips,pop_in_later_year,pop_ref_later_year,pop_f,pop_m,pop_in_2000,median_age,median_age_f,median_age_m,median_house_income_2017,...,births_to_yr_int2,pop_foreign_born,land_area_km2,land_area_mi2,water_area_km2,water_area_mi2,total_area_km2,total_area_mi2,latitude,longitude
count,3144.0,3143.0,3144.0,3143.0,3143.0,3140.0,3144.0,3138.0,3138.0,3144.0,...,3142.0,3143.0,3144.0,3144.0,3142.0,3142.0,3144.0,3144.0,3144.0,3144.0
mean,30426.81584,104632.8,2016.538486,53110.72,51522.09,90289.27,41.477195,42.799968,40.241714,51610.303753,...,2006.0,14359.47,2909.230355,1123.260163,218.300708,84.286379,3127.392193,1207.492925,38.446272,-92.255665
std,15162.977553,333950.3,0.657162,169927.8,164061.7,293415.2,5.389288,5.474568,5.463137,13653.386549,...,0.0,93466.99,9352.259303,3610.927669,1217.424055,470.050078,9912.130123,3827.09501,5.292499,12.937532
min,1001.0,79.0,2005.0,38.0,41.0,67.0,21.7,22.6,21.3,20025.0,...,2006.0,0.0,5.177,1.999,0.003,0.001,5.177,1.999,19.597764,-178.338813
25%,18182.5,11191.0,2016.0,5555.0,5646.0,11274.0,38.2,39.5,36.9,42632.75,...,2006.0,192.0,1115.376,430.64925,7.156,2.76325,1154.60775,445.79675,34.699969,-98.217051
50%,29182.0,26262.0,2017.0,13107.0,13136.0,24752.0,41.5,43.0,40.1,49810.0,...,2006.0,668.0,1594.3895,615.5975,19.373,7.48,1686.8765,651.307,38.363697,-90.360508
75%,45087.5,67870.5,2017.0,34141.5,34125.0,62138.25,44.5,46.0,43.3,57890.25,...,2006.0,2961.5,2392.93225,923.916,60.647,23.416,2552.592,985.56175,41.808187,-83.417609
max,56045.0,10163510.0,2017.0,5153936.0,5009571.0,9519338.0,67.6,67.8,67.2,135842.0,...,2006.0,3482367.0,376855.656,145504.789,25190.644,9726.162,382812.22,147804.631,69.449343,-67.609354


In [21]:
df.shape

(3144, 50)

In [9]:
# 1) Annual population growth rate percentage
# Compound Annual Growth Rate = ( (EV/BV)^(1/n) - 1 ) * 100 where
# BV: Beginning value
# EV: Ending value
# n : Number of years
# https://www.investopedia.com/terms/c/cagr.asp#:~:text=Divide%20the%20value%20of%20an,the%20answer%20into%20a%20percentage.
df['annual_pop_growth_rate'] = ((df['pop_in_later_year'] - df['pop_in_2000']) / df['pop_in_2000']) * 100
df['annual_pop_growth_rate']

0       26.646974
1       51.428266
2       -9.325711
3        8.955152
4       13.577924
          ...    
3139    18.379284
3140    23.160375
3141     5.151454
3142     0.096513
3143     7.299819
Name: growth_rate_perc, Length: 3144, dtype: float64

In [17]:
# 2) Annual House value growth rate percentage
df['annual_house_value_growth_rate'] = ((df['median_house_value_2017'] - df['median_house_value_2000']) / df['median_house_value_2000']) * 100
df['annual_house_value_growth_rate'].head()

0    85.404848
1    89.743590
2    66.991259
3    96.597561
4    69.975293
Name: med_house_value_growth_rate_perc, dtype: float64

In [18]:
# 3) Annual birth rate over measured period

In [None]:
# 4)
features = ['pop_foreigen_born' , 'adult_obes_rate', 'pop_per_sq_mi', 'pop_percent_urban', 'poverty_pct' , 'median_house_income_2017',
            'cost_of_living_usd', annual pop growth rate, annual house value growth rate']

In [19]:
df.columns

Index(['fips', 'state', 'county', 'pop_in_later_year', 'pop_ref_later_year',
       'pop_f', 'pop_m', 'pop_in_2000', 'median_age', 'median_age_f',
       'median_age_m', 'median_house_income_2017',
       'median_house_income_ref_val', 'median_house_income_ref_yr',
       'median_house_value_2017', 'median_house_value_2000',
       'avg_household_size', 'mar_coup_w_children', 'cost_of_living_usd',
       'cost_of_living_yr', 'poverty_pct', 'adult_obes_rate',
       'presch_obes_rate', 'commute_minutes', 'pop_per_sq_mi',
       'pop_percent_urban', 'unemploy_rate', 'unemploy_date', 'rent_1br_usd',
       'rent_2br_usd', 'rent_3br_usd', 'avg_farm_size', 'avg_farm_sales_usd',
       'pct_farms_fam_op', 'avg_farm_mach_val_usd', 'birth_per_1000_int1',
       'births_from_yr_int1', 'births_from_yr_int2', 'birth_per_1000_int2',
       'births_to_yr_int1', 'births_to_yr_int2', 'pop_foreign_born',
       'land_area_km2', 'land_area_mi2', 'water_area_km2', 'water_area_mi2',
       'total_area_km