# Web Scraping MIT site: Alameda County Cost of Living

<a id='guide'></a>
# Guide
### 1. Import libraries
### 2. [Scrape data from MIT cost of living site](#scrape)
### 3. [Create cost of living table](#livingwage)
 - a. [Import MIT living wage data by year (html files)](#importmit)
 - b. [Clean 2019 table (different format than following years)](#clean2019)
 - c. Concatenate 2020 - current year tables and then add 2019 table
 - d. [Extract number of adults and children](#extractnumerical)
 - e. [Calculate costs by person and category](#personcost)
 - f. [Create circumstance cost and disposable unit parameter levels](#parameterlevels)
 - g. [Create the cost dataframe](#costdf)
 - h. Export cost table file


<br>**Columns/Questions needed on application**
<br>Name
<br>Parent/Contact Name
<br>Did participant apply?
<br>Household income before taxes (include income from all contributing household members)
<br>Number of children supported on this income
<br>Number of adults supported on this income
<br>
<br>If you would like to disclose any other factors that we should consider, please check all that apply and add info as needed. 
<br>**Supporting additional family should be reported above as additional child or adult household members**
<br>
<br>Medical: medical expenses and cost of caring for a sick family member
<br>Divorce: divorce or separation in past year
<br>Education: college/education expenses and tuition
<br>Employment: unemployed or recent loss of income
<br>Housing: recently moved or had significant change in housing
<br>Disability: you or a household member has a disability
<br>Immigration: costs associated with immigration, visas, necessary foreign travel
<br>Other: _______________________

**Modeling Resources**
<br>http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/
<br>https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
<br>
<br> **MIT California living wage calculations:**
<br>https://livingwage.mit.edu/states/06
<br>
<br>**base_income** = basic needs income for one adult with no child in Alameda County
<br>**add_adult** = cost per each additional adult
<br>**add_child** = cost per each additional child
<br>**single_cost** = additional cost per child for single parent
<br>**circumstance_cost** = cost of each additional circumstance (medical, tuition, etc.)
<br>**ex_inc** = disposable income above basic needs, when a family has more than 10k in disposable income per adult AND more than 20k in dosposable income per child, the scholarship award falls to zero.

# 1. Import necessary libraries

In [1]:
# Import the libraries
import requests
import pandas as pd
import numpy as np
import re

# 2. Scrape data from MIT cost of living site 

## scrape past years using Google wayback machine and store the files

In [2]:
# 2019 url for mit living wage, Alameda County, California
#URL19 = "http://web.archive.org/web/20190605210135/https://livingwage.mit.edu/counties/06001"
#page = requests.get(URL19)
#print(page.text)

#Read HTML tables into a list of DataFrame objects
#df_list19 = pd.read_html(URL19)
#req_inc_19 = df_list19[1]
#req_inc_19['year'] = '2019'
#print(req_inc_19['year'][0])

#export html of web page for 2022 to local file
#page = requests.get(URL19)
#with open('/LOCAL PATH/MIT_alameda_2019.html', 'wb+') as f:
#    f.write(page.content)


# 2020 url for mit living wage, Alameda County, California
#URL20 = "http://web.archive.org/web/20200727024043/https://livingwage.mit.edu/counties/06001"
#page = requests.get(URL20)
#print(page.text)

#Read HTML tables into a list of DataFrame objects
#df_list20 = pd.read_html(URL20)
#req_inc_20 = df_list20[1]
#req_inc_20['year'] = '2020'
#print(req_inc_20['year'][0])

#export html of web page for 2020 to local file
#page = requests.get(URL20)
#with open('//LOCAL PATH/MIT_alameda_2020.html', 'wb+') as f:
#    f.write(page.content)


# 2021 url for mit living wage, Alameda County, California
#URL21 = "http://web.archive.org/web/20211214073524/https://livingwage.mit.edu/counties/06001"
#page = requests.get(URL21)
#print(page.text)

#Read HTML tables into a list of DataFrame objects
#df_list21 = pd.read_html(URL21)
#req_inc_21 = df_list21[1]
#req_inc_21['year'] = '2021'
#print(req_inc_21['year'][0])

#export html of web page for 2121 to local file
#page = requests.get(URL21)
#with open('/LOCAL PATH/MIT_alameda_2021.html', 'wb+') as f:
#    f.write(page.content)

#URL22 = "https://livingwage.mit.edu/counties/06001"
#page = requests.get(URL22)

#print(page.text)

#Read HTML tables into a list of DataFrame objects
#df_list22 = pd.read_html(URL22)
#req_inc_22 = df_list22[1]
#req_inc_22['year'] = '2022'
#print(req_inc_22['year'][0])

#export html of web page for 2022 to local file
#page = requests.get(URL)
#with open('/LOCAL PATH/MIT_alameda_2022.html', 'wb+') as f:
#    f.write(page.content)

#url for mit living wage, Alameda County, California
#URL23 = "https://livingwage.mit.edu/counties/06001"
#page = requests.get(URL23)

#print(page.text)

#Read HTML tables into a list of DataFrame objects
#df_list23 = pd.read_html(URL23)
#req_inc_23 = df_list23[1]
#req_inc_23['year'] = '2023'
#print(req_inc_23['year'][0])

#export html of web page for 2023 to local file
#page = requests.get(URL23)
#with open('/Users/sandidge/Desktop/Python_Projects/BGS/MIT_livingwage_html/MIT_alameda_2023.html', 'wb+') as f:
#    f.write(page.content)



## scrape current year living wage data from MIT website

In [3]:
#url for mit living wage, Alameda County, California
URL23 = "https://livingwage.mit.edu/counties/06001"
page = requests.get(URL23)

print(page.text)

#Read HTML tables into a list of DataFrame objects
df_list23 = pd.read_html(URL23)
req_inc_23 = df_list23[1]
req_inc_23['year'] = '2023'
print(req_inc_23['year'][0])

#export html of web page for 2023 to local file
page = requests.get(URL23)
with open('/Users/sandidge/Desktop/Python_Projects/BGS/MIT_livingwage_html/MIT_alameda_2023.html', 'wb+') as f:
    f.write(page.content)


<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1"/>
    <title>
      Living Wage Calculator
       - Living Wage Calculation for Alameda County, California
    </title>
    <link href='https://fonts.googleapis.com/css?family=Merriweather:400,700,900' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" media="all" href="/assets/application-3a1563aa8478f951cc94e2ed65955ba926f11b81e40ff0eaffe25a1c6f852f97.css" />
    <script src="/assets/application-39fdf2eb0b29d1b212982a9694cf845fc3387c753591a313f1c56434319c3d2f.js"></script>
    <meta name="csrf-param" content="authenticity_token" />
<meta name="csrf-token" content="udGp9UltjPPrChtxPE2dOru5M07sMd+P0Ud++XH1OmQlVCuMM9u0d8c30vQxAYkUNnOiJp4DQyiR6qB8pZpycQ==" />
    <link rel="shortcut icon" type="image/x-icon" href="/assets/favicon/favicon-cc2ffe9c2a813278798e8d13dca12d98a3916d26107ce5afb4caab6c60ce2405.ico" />
  </head>
  <body class="displaying_resul

2023


In [4]:
req_inc_23

Unnamed: 0_level_0,Unnamed: 0_level_0,1 ADULT,1 ADULT,1 ADULT,1 ADULT,2 ADULTS(1 WORKING),2 ADULTS(1 WORKING),2 ADULTS(1 WORKING),2 ADULTS(1 WORKING),2 ADULTS(BOTH WORKING),2 ADULTS(BOTH WORKING),2 ADULTS(BOTH WORKING),2 ADULTS(BOTH WORKING),year
Unnamed: 0_level_1,Unnamed: 0_level_1.1,0 Children,1 Child,2 Children,3 Children,0 Children,1 Child,2 Children,3 Children,0 Children,1 Child,2 Children,3 Children,Unnamed: 14_level_1
0,Food,"$4,686","$6,916","$10,392","$13,774","$8,591","$10,702","$13,802","$16,795","$8,591","$10,702","$13,802","$16,795",2023
1,Child Care,$0,"$15,894","$31,789","$47,683",$0,$0,$0,$0,$0,"$15,894","$31,789","$47,683",2023
2,Medical,"$3,136","$9,476","$9,486","$9,411","$7,018","$9,486","$9,411","$9,564","$7,018","$9,486","$9,411","$9,564",2023
3,Housing,"$18,947","$28,014","$28,014","$37,031","$22,840","$28,014","$28,014","$37,031","$22,840","$28,014","$28,014","$37,031",2023
4,Transportation,"$5,316","$9,561","$11,691","$14,058","$9,561","$11,691","$14,058","$15,073","$9,561","$11,691","$14,058","$15,073",2023
5,Civic,"$2,920","$5,801","$6,480","$8,835","$5,801","$6,480","$8,835","$7,025","$5,801","$6,480","$8,835","$7,025",2023
6,Other,"$4,596","$8,020","$9,463","$10,386","$8,020","$9,463","$10,386","$11,617","$8,020","$9,463","$10,386","$11,617",2023
7,Required annual income after taxes,"$39,733","$83,814","$107,446","$141,310","$61,962","$75,967","$84,637","$97,237","$61,962","$91,861","$116,426","$144,920",2023
8,Annual taxes,"$6,755","$16,335","$24,971","$38,406","$9,369","$12,269","$13,963","$16,783","$9,369","$16,229","$22,949","$33,380",2023
9,Required annual income before taxes,"$46,488","$100,148","$132,417","$179,716","$71,332","$88,236","$98,601","$114,020","$71,332","$108,090","$139,375","$178,300",2023


## import an html file stored locally


In [5]:
#import/read html file downloaded from web
with open('/Users/sandidge/Desktop/Python_Projects/BGS/MIT_livingwage_html/MIT_alameda_2023.html', 'rb') as f:
    df_list = pd.read_html('/Users/sandidge/Desktop/Python_Projects/BGS/MIT_livingwage_html/MIT_alameda_2023.html')


In [6]:
    df_list

[  Unnamed: 0_level_0    1 ADULT                                \
   Unnamed: 0_level_1 0 Children 1 Child 2 Children 3 Children   
 0        Living Wage     $22.35  $48.15     $63.66     $86.40   
 1       Poverty Wage      $6.53   $8.80     $11.07     $13.34   
 2       Minimum Wage     $15.50  $15.50     $15.50     $15.50   
 
   2 ADULTS(1 WORKING)                               2 ADULTS(BOTH WORKING)  \
            0 Children 1 Child 2 Children 3 Children             0 Children   
 0              $34.29  $42.42     $47.40     $54.82                 $17.15   
 1               $8.80  $11.07     $13.34     $15.61                  $4.40   
 2              $15.50  $15.50     $15.50     $15.50                 $15.50   
 
                                  
   1 Child 2 Children 3 Children  
 0  $25.98     $33.50     $42.86  
 1   $5.54      $6.67      $7.81  
 2  $15.50     $15.50     $15.50  ,
                     Unnamed: 0_level_0    1 ADULT                       \
                    

# 3. Create cost of living from stored files
**Living wage files by year are stored at:**
https://github.com/Floydworks/Scholarship_Allocation_Tool/tree/main/MIT_cost_of_living_by_year

<a id='importmit'></a>
______________________
# a. Import cost of living html 2019 - present from github

In [7]:
# import from GitHub:Floydworks

url_19 = ('https://raw.githubusercontent.com/Floydworks/Scholarship_Allocation_Tool/main/MIT_cost_of_living_by_year/MIT_alameda_2019.html')
download = requests.get(url_19).content
#Read HTML tables into a list of DataFrame objects
df_list19 = pd.read_html(url_19)
req_inc_19 = df_list19[1]
req_inc_19['year'] = '2019'
print(req_inc_19['year'][0])


url_20 = ('https://raw.githubusercontent.com/Floydworks/Scholarship_Allocation_Tool/main/MIT_cost_of_living_by_year/MIT_alameda_2020.html')
download = requests.get(url_20).content
#Read HTML tables into a list of DataFrame objects
df_list20 = pd.read_html(url_20)
req_inc_20 = df_list20[1]
req_inc_20['year'] = '2020'
print(req_inc_20['year'][0])


url_21 = ('https://raw.githubusercontent.com/Floydworks/Scholarship_Allocation_Tool/main/MIT_cost_of_living_by_year/MIT_alameda_2021.html')
download = requests.get(url_21).content
#Read HTML tables into a list of DataFrame objects
df_list21 = pd.read_html(url_21)
req_inc_21 = df_list21[1]
req_inc_21['year'] = '2021'
print(req_inc_21['year'][0])


url_22 = ('https://raw.githubusercontent.com/Floydworks/Scholarship_Allocation_Tool/main/MIT_cost_of_living_by_year/MIT_alameda_2022.html')
download = requests.get(url_22).content
#Read HTML tables into a list of DataFrame objects
df_list22 = pd.read_html(url_22)
req_inc_22 = df_list22[1]
req_inc_22['year'] = '2022'
print(req_inc_22['year'][0])

url_23 = ('https://raw.githubusercontent.com/Floydworks/Scholarship_Allocation_Tool/main/MIT_cost_of_living_by_year/MIT_alameda_2023.html')
download = requests.get(url_23).content
#Read HTML tables into a list of DataFrame objects
df_list23 = pd.read_html(url_23)
req_inc_23 = df_list23[1]
req_inc_23['year'] = '2023'
print(req_inc_23['year'][0])


2019
2020
2021
2022
2023


In [8]:
# format for year 2019 is different with no multi level indexing in column names
#req_inc_22
#req_inc_19

<a id='clean2019'></a>
## b. Clean 2019 table (different format than following years)

In [9]:
#get just the table with 'required annual income before taxes'
full_income_table =  req_inc_19         #  df_list[1]
#transpose table 
table_19 = full_income_table.T
#make first row into column headers
table_19.columns = table_19.iloc[0]
#drop first row of text that is now column headers
table_19 = table_19.iloc[1: , :]
#reset the index, move fam size to column
table_19 = table_19.reset_index()

table_19

Annual Expenses,index,Food,Child Care,Medical,Housing,Transportation,Other,Required annual income after taxes,Annual taxes,Required annual income before taxes
0,1 Adult,"$3,573",$0,"$2,121","$18,480","$4,206","$2,976","$31,356","$4,975","$36,331"
1,1 Adult 1 Child,"$5,267","$8,311","$6,965","$27,948","$7,664","$4,951","$61,105","$10,965","$72,070"
2,1 Adult 2 Children,"$7,929","$13,997","$6,622","$27,948","$9,011","$5,375","$70,882","$13,238","$84,121"
3,1 Adult 3 Children,"$10,517","$19,683","$6,704","$38,628","$10,425","$6,256","$92,214","$17,911","$110,124"
4,2 Adults (1 Working),"$6,551",$0,"$5,271","$22,260","$7,664","$4,951","$46,696","$7,893","$54,589"
5,2 Adults (1 Working) 1 Child,"$8,154",$0,"$6,622","$27,948","$9,011","$5,375","$57,110","$10,060","$67,170"
6,2 Adults (1 Working) 2 Children,"$10,529",$0,"$6,704","$27,948","$10,425","$6,256","$61,863","$11,141","$73,004"
7,2 Adults (1 Working) 3 Children,"$12,820",$0,"$6,423","$38,628","$10,307","$6,121","$74,300","$13,746","$88,046"
8,2 Adults (1 Working Part Time) 1 Child*,,,,,,,,,"$75,760"
9,2 Adults,"$6,551",$0,"$5,271","$22,260","$7,664","$4,951","$46,696","$7,893","$54,589"


In [10]:
print(table_19['index'][1])

#Remove unicode '\xa0' from pandas column if present
table_19['index'] = table_19['index'].str.split().str.join(' ')

print(table_19['index'][1])

1 Adult 1 Child
1 Adult 1 Child


In [11]:
#rename index column to hold count of adults in household
table_19 = table_19.rename(columns={"index": "adults"}, errors="raise")
#make duplicate adults column that will become count for children in household
table_19['children'] = table_19.loc[:, 'adults']
#add year column
table_19['year'] = '2019'
req_inc_19 = table_19[['adults', 'children','Required annual income before taxes', 'year']]
#drop row the has year in it
req_inc_19 = req_inc_19[req_inc_19['adults'].str.contains('year')==False]
#drop row that doesn't match other year formats and is not needed
req_inc_19 = req_inc_19.drop(labels=8, axis=0)


In [12]:
req_inc_19

Annual Expenses,adults,children,Required annual income before taxes,year
0,1 Adult,1 Adult,"$36,331",2019
1,1 Adult 1 Child,1 Adult 1 Child,"$72,070",2019
2,1 Adult 2 Children,1 Adult 2 Children,"$84,121",2019
3,1 Adult 3 Children,1 Adult 3 Children,"$110,124",2019
4,2 Adults (1 Working),2 Adults (1 Working),"$54,589",2019
5,2 Adults (1 Working) 1 Child,2 Adults (1 Working) 1 Child,"$67,170",2019
6,2 Adults (1 Working) 2 Children,2 Adults (1 Working) 2 Children,"$73,004",2019
7,2 Adults (1 Working) 3 Children,2 Adults (1 Working) 3 Children,"$88,046",2019
9,2 Adults,2 Adults,"$54,589",2019
10,2 Adults 1 Child,2 Adults 1 Child,"$77,389",2019


# Concatenate 2020 - 2022 tables and then add 2019 table

In [13]:
tables = [req_inc_20, req_inc_21, req_inc_22, req_inc_23] #req_inc_19 removed for different format
tables_df = pd.DataFrame()

for t in tables:
    #get the year
    year = t['year'][0]
    t = t.drop(['year'],axis=1)
    #transpose table 
    df_table = t.T
    #make first row into column headers
    df_table.columns = df_table.iloc[0]
    #drop first row of text that is now column headers
    df_table = df_table.iloc[1: , :]
    df_table['year'] = year
    tables_df = pd.concat([tables_df, df_table])
    #print(df_table)

tables_df.head()

  t = t.drop(['year'],axis=1)
  t = t.drop(['year'],axis=1)
  t = t.drop(['year'],axis=1)
  t = t.drop(['year'],axis=1)


Unnamed: 0,"(Unnamed: 0_level_0, Unnamed: 0_level_1)",Food,Child Care,Medical,Housing,Transportation,Other,Required annual income after taxes,Annual taxes,Required annual income before taxes,year,Civic
1 ADULT,0 Children,"$3,592",$0,"$2,211","$16,908","$4,094","$2,734","$29,540","$4,748","$34,288",2020,
1 ADULT,1 Child,"$5,306","$8,448","$7,364","$25,512","$7,982","$4,558","$59,170","$10,856","$70,026",2020,
1 ADULT,2 Children,"$7,976","$14,228","$7,076","$25,512","$10,126","$4,732","$69,650","$13,293","$82,942",2020,
1 ADULT,3 Children,"$10,578","$20,007","$7,196","$35,100","$11,032","$5,953","$89,865","$17,844","$107,709",2020,
2 ADULTS(1 WORKING),0 Children,"$6,586",$0,"$5,455","$20,472","$7,982","$4,558","$45,053","$7,733","$52,786",2020,


In [14]:
# Get only the columns needed
df_income_table = pd.DataFrame(tables_df[['Required annual income before taxes', 'year']])
# move index values to columns
df_income_table = df_income_table.reset_index()
df_income_table.columns = ['adults', 'children', 'Required annual income before taxes', 'year']

print(df_income_table.columns)
#df_income_table.head()


Index(['adults', 'children', 'Required annual income before taxes', 'year'], dtype='object')


In [15]:
#concatenate 2019
df_income_table = pd.concat([df_income_table,req_inc_19]).reset_index(drop=True)


In [16]:
df_income_table

Unnamed: 0,adults,children,Required annual income before taxes,year
0,1 ADULT,0 Children,"$34,288",2020
1,1 ADULT,1 Child,"$70,026",2020
2,1 ADULT,2 Children,"$82,942",2020
3,1 ADULT,3 Children,"$107,709",2020
4,2 ADULTS(1 WORKING),0 Children,"$52,786",2020
5,2 ADULTS(1 WORKING),1 Child,"$65,688",2020
6,2 ADULTS(1 WORKING),2 Children,"$71,396",2020
7,2 ADULTS(1 WORKING),3 Children,"$86,442",2020
8,2 ADULTS(BOTH WORKING),0 Children,"$52,786",2020
9,2 ADULTS(BOTH WORKING),1 Child,"$76,105",2020


<a id='extractnumerical'></a>
## d. Extract number of adults and children

In [17]:
# define a function to clean up characters, case, and spacing
def text_cleaning(txt):

  # Take care of non-ascii characters
  txt = (txt.encode('ascii','replace')).decode("utf-8")

  # Convert text to Lower case
  txt = txt.lower()

  # Remove multiple spacing
  space = re.sub(r"\s+"," ",txt, flags = re.I)

  # Remove special characters
  clean_text = re.sub(r'[!"#$%&()*+,-./:;?@[\]^_`{|}~]',' ',space)
  clean_text = clean_text.replace(" ", "")
    
  space1 = re.sub(r"\s+"," ",clean_text, flags = re.I) 

  #return print('Output : ', space1)
  return space1

In [18]:
#make empty lists for storing numbers of family members
n_adults = []
n_children = []

# extract number of adults and number of children
for r in df_income_table['adults']:
    text = text_cleaning(str(r))
    #text = re.sub('\s+','',str(r).lower())
    #print(text)
    find_a = text.find("a")
    a = text[find_a - 1]
    #n_adults.append(int(a))
    n_adults.append(a)

for r in df_income_table['children']:
    text = text_cleaning(str(r))
    find_c = text.find("c")
    c = text[find_c - 1]
    #print(a, c)
    #n_children.append(int(c))
    n_children.append(c)

#reset index after concatenation
df_income_table['num_adults'] = n_adults
df_income_table['num_children'] = n_children
df_income_table['req_income_pretax'] = df_income_table['Required annual income before taxes'].astype('str')\
                                       .str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)

#get rid of the letters in 2019 in place of 0 children
#define a function that converts letters to '0' in the num_children column
def convert_numb(x, chil):
    if x.isnumeric() != True:
        chil.append('0')
        #print("Not Numeric")
    else:
        chil.append(x)
        #print("numeric")

#get num_children column data in list form
kids = list(df_income_table['num_children'])
#create empty list to store returned values
chld = []
#loop through the list and convert letters to '0'
for k in kids:
    convert_numb(k, chld)

#assign corrected values to the num_children column
df_income_table['num_children'] = chld


In [19]:
df_income_table

Unnamed: 0,adults,children,Required annual income before taxes,year,num_adults,num_children,req_income_pretax
0,1 ADULT,0 Children,"$34,288",2020,1,0,34288
1,1 ADULT,1 Child,"$70,026",2020,1,1,70026
2,1 ADULT,2 Children,"$82,942",2020,1,2,82942
3,1 ADULT,3 Children,"$107,709",2020,1,3,107709
4,2 ADULTS(1 WORKING),0 Children,"$52,786",2020,2,0,52786
5,2 ADULTS(1 WORKING),1 Child,"$65,688",2020,2,1,65688
6,2 ADULTS(1 WORKING),2 Children,"$71,396",2020,2,2,71396
7,2 ADULTS(1 WORKING),3 Children,"$86,442",2020,2,3,86442
8,2 ADULTS(BOTH WORKING),0 Children,"$52,786",2020,2,0,52786
9,2 ADULTS(BOTH WORKING),1 Child,"$76,105",2020,2,1,76105


## Create table of family structures and required incomes

In [23]:
mit_income_table = df_income_table
mit_income_table = mit_income_table[~mit_income_table['adults'].str.contains('(1 WORKING)')]
mit_income_table = mit_income_table[~mit_income_table['adults'].str.contains('(1 Working)')]
#mit_income_table


  mit_income_table = mit_income_table[~mit_income_table['adults'].str.contains('(1 WORKING)')]
  mit_income_table = mit_income_table[~mit_income_table['adults'].str.contains('(1 Working)')]


In [24]:
mit_income_table = mit_income_table[['year','num_adults','num_children','req_income_pretax']].sort_values(by=['year', 'num_adults', 'num_children'])
#mit_income_table.head(60)


In [25]:
#export file as csv and manually check for accuracy in caluculations
#mit_income_table.to_csv('/Users/sandidge/Desktop/Python_projects/BGS/exports_imports/MIT_income_table.csv')


<a id='personcosts'></a>
## e. Calculate costs by person and category

In [20]:
#get list of years in df
years = list(df_income_table['year'].unique())

#create empty lists for variable storage
yr = []
base = []
add_child = []
add_adult = []
single_cost = []

for y in years:
    #get df for particular year
    df_annual_income_table = df_income_table[df_income_table['year']==y].reset_index()
    
    #pull base income for one adult no kids for that year
    base_annual = df_annual_income_table['req_income_pretax'][0]
    
    # calculate the cost of an additional child with one adult in household
    one_child1_annual = df_annual_income_table['req_income_pretax'][1] - df_annual_income_table['req_income_pretax'][0]
    two_child1_annual = df_annual_income_table['req_income_pretax'][2] - df_annual_income_table['req_income_pretax'][1]
    three_child1_annual = df_annual_income_table['req_income_pretax'][3] - df_annual_income_table['req_income_pretax'][2]
    avg_child1_annual = (one_child1_annual + two_child1_annual + three_child1_annual)/3
    
    # calculate the cost of an additional child with two adults in household
    # uses values for both adults working which estimates required income to be higher
    one_child2_annual = df_annual_income_table['req_income_pretax'][9] - df_annual_income_table['req_income_pretax'][8]
    two_child2_annual = df_annual_income_table['req_income_pretax'][10] - df_annual_income_table['req_income_pretax'][9]
    three_child2_annual = df_annual_income_table['req_income_pretax'][11] - df_annual_income_table['req_income_pretax'][10]
    avg_child2_annual = (one_child2_annual+two_child2_annual+three_child2_annual)/3
    
    # calculate the additional cost per child for a single parent
    single_cost_annual = avg_child1_annual - avg_child2_annual
    
    # calculate cost of additional adult (both adults working)
    add_adult1_annual = df_annual_income_table['req_income_pretax'][9] - df_annual_income_table['req_income_pretax'][1]
    add_adult2_annual = df_annual_income_table['req_income_pretax'][10] - df_annual_income_table['req_income_pretax'][2]
    add_adult3_annual = df_annual_income_table['req_income_pretax'][11] - df_annual_income_table['req_income_pretax'][3]
    avg_adult_annual = (add_adult1_annual+add_adult2_annual+add_adult3_annual)/3
    
    # set values for additional household members
    add_adult_annual = avg_adult_annual
    add_child_annual = avg_child2_annual
    
    #print some things as a sanity check
    print(y, ': ', base_annual, add_adult_annual.round(2), add_child_annual.round(2),
          single_cost_annual.round(2), len(df_annual_income_table['req_income_pretax']))
    
    #do all the appendings
    yr.append(y)
    base.append(base_annual)
    add_adult.append(add_adult_annual)
    add_child.append(add_child_annual)
    single_cost.append(single_cost_annual)
    

2020 :  34288 5153.33 19438.0 5035.67 12
2021 :  45520 5407.0 29891.0 8265.67 12
2022 :  50463 7671.33 33949.67 7704.67 12
2023 :  46488 4494.67 35656.0 8753.33 12
2019 :  36331 4544.67 19238.67 5359.0 12


<a id='parameterlevels'></a>
## f. Create circumstance cost and disposable unit parameter levels

In [26]:
# set the circumstance cost for 2019, 2020, 2021, 2022, 2023
circumstance_cost_dict_low = {'2019':1000, '2020':1000, '2021':2000, '2022':2000, '2023':2000}
circumstance_cost_dict_mid = {'2019':2000, '2020':2000, '2021':3000, '2022':3000, '2023':3000}    

# set value for disposable income unit for 2019, 2020, 2021, 2022, 2023
disposable_unit_dict_low = {'2019':3000, '2020':3000, '2021':5000, '2022':5000, '2023':5000}      
disposable_unit_dict_mid = {'2019':4000, '2020':4000, '2021':6500, '2022':6500, '2023':6500}
disposable_unit_dict_hi = {'2019':5000, '2020':5000, '2021':7000, '2022':7000, '2023':7000}


<a id='costdf'></a>
## g. Create the cost dataframe

In [27]:
#create the cost dataframe
cost_df = pd.DataFrame(columns = ['year'])
cost_df['year'] = yr
cost_df['base_income'] = base
cost_df['add_adult'] = add_adult
cost_df['add_child'] = add_child
cost_df['single_cost'] = single_cost
cost_df = cost_df.sort_values(by = 'year').reset_index(drop=True)

#add circumstances and disposable units to cost_df, the cost dataframe
cost_df['circumstance_cost_low'] = cost_df['year'].map(circumstance_cost_dict_low)
cost_df['circumstance_cost_mid'] = cost_df['year'].map(circumstance_cost_dict_mid)

cost_df['disposable_unit_low'] = cost_df['year'].map(disposable_unit_dict_low)
cost_df['disposable_unit_mid'] = cost_df['year'].map(disposable_unit_dict_mid)
cost_df['disposable_unit_hi'] = cost_df['year'].map(disposable_unit_dict_hi)

#print(len(cost_df))
#print(cost_df.columns)
cost_df.head()

Unnamed: 0,year,base_income,add_adult,add_child,single_cost,circumstance_cost_low,circumstance_cost_mid,disposable_unit_low,disposable_unit_mid,disposable_unit_hi
0,2019,36331,4544.666667,19238.666667,5359.0,1000,2000,3000,4000,5000
1,2020,34288,5153.333333,19438.0,5035.666667,1000,2000,3000,4000,5000
2,2021,45520,5407.0,29891.0,8265.666667,2000,3000,5000,6500,7000
3,2022,50463,7671.333333,33949.666667,7704.666667,2000,3000,5000,6500,7000
4,2023,46488,4494.666667,35656.0,8753.333333,2000,3000,5000,6500,7000


In [28]:
#create long version of cost dataframe, cost_df
#long version of disposable_unit
#cost_df_long2 = pd.melt(cost_df, id_vars=['year', 'base_income', 'add_adult', 'add_child', 'single_cost',
#                                         'circumstance_cost_mid', 'circumstance_cost_low', ], 
#                       value_vars=['disposable_unit_low',
#                                   'disposable_unit_mid', 'disposable_unit_hi'],
#                       var_name=['disposable_level'],
#                       value_name='disposable_value'
#                      )
#print(len(cost_df_long2))
#print(cost_df_long2.columns)
#cost_df_long2.head(5)

#add circumstance levels to long version
#cost_df_long = pd.melt(cost_df_long2, id_vars=['year', 'base_income', 'add_adult', 'add_child', 'single_cost',
#                                               'disposable_level', 'disposable_value'], 
#                       value_vars=['circumstance_cost_mid', 'circumstance_cost_low',],
#                       var_name=['circumstance_level'],
#                       value_name='circumstance_value'
#                      )

#print(len(cost_df_long))
#print(cost_df_long.columns)

#check that long version has correct length and format
#print(len(cost_df))
#print((len(cost_df_long2)), 'value_vars= disposable_unit_low, disposable_unit_mid, disposable_unit_hi')
#print((len(cost_df_long)), 'value_vars= circumstance_cost_mid, circumstance_cost_low')
#cost_df_long.head(26)

## h. Export cost table file 

In [29]:
#export file as csv and manually check for accuracy in caluculations
#cost_df.to_csv('/Users/sandidge/Desktop/Python_projects/BGS/exports_imports/cost_df.csv')


[Got to Top](#guide)

# Create a table for a single year

In [None]:
#get just the table with 'required annual income before taxes'
#full_income_table = df_list[1]
#transpose table 
#df_table = full_income_table.T
#make first row into column headers
#df_table.columns = df_table.iloc[0]
#drop first row of text that is now column headers
#df_table = df_table.iloc[1: , :]

#df_table

In [None]:
#get table of adults, children and required annual income, save as a dataframe
#df_income_table = pd.DataFrame(df_table['Required annual income before taxes'])
#move index values to columns
#df_income_table = df_income_table.reset_index()
#df_income_table.columns = ['num_adults_str', 'num_children_str', 'req_income_str']

#get integer values from string columns
#df_income_table['num_adults'] = df_income_table['num_adults_str'].astype('str').str.extract('(\d+)')
#df_income_table['num_children'] = df_income_table['num_children_str'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
#df_income_table['req_income_pretax'] = df_income_table['req_income_str'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)

#print('2 ADULTS(1 WORKING) assumes second adult cares for children, it is not used in further calculations')
#df_income_table
