# Understanding Homelessness Rates

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
### Key Homelessness Issues
- What is homelessness and definition challenges
- Causes of homelessness
- Homelessness classifications including chronic, sheltered/un

### Point in Time Counts
- History of counts
- Methodological issues
- Rationale for category inclusions

### Data Background
- Originally was going to use a dataset from Kaggle but decided to pull straight from HUD-CoC site. 
- Used [2007 - 2017 PIT Counts by State](https://www.hudexchange.info/resource/3031/pit-and-hic-data-since-2007/) and converted to single database

(borrowed from: https://www.kaggle.com/bltxr9/eda-of-total-homeless-population)
This dataset was generated by CoC and provided to HUD. Note: HUD did not conduct a full data quality review on the data submitted by each CoC.

What is the [Continuum of Care (CoC) Program](https://www.hudexchange.info/programs/coc/)?

Original Data: [PIT and HIC Data Since 2007](https://www.hudexchange.info/resource/3031/pit-and-hic-data-since-2007/)

CoC-HUD Summary Reports: [CoC Homeless Populations and Subpopulations Reports](https://www.hudexchange.info/programs/coc/coc-homeless-populations-and-subpopulations-reports/)

**Other Resources**

[Funding Awards](https://www.hudexchange.info/programs/coc/awards-by-component/)

[CoC Dashboard Reports](https://www.hudexchange.info/programs/coc/coc-dashboard-reports/)

[CoC Housing Inventory Count Reports](https://www.hudexchange.info/programs/coc/coc-housing-inventory-count-reports/)

<a id='wrangling'></a>
## Data Wrangling

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline

### Inspect Data

In [2]:
df = pd.read_csv('homeless-pit-by-state.csv')
df.head()

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
0,AK,2017,2,1845,1551,294,1354,1060,294,491,...,0,0,0,0,22,22,0,39,39,0
1,AL,2017,8,3793,2656,1137,2985,1950,1035,808,...,3,6,6,0,23,20,3,39,35,4
2,AR,2017,6,2467,1273,1194,2068,937,1131,399,...,0,0,0,0,10,10,0,13,13,0
3,AZ,2017,3,8947,5781,3166,6488,3423,3065,2459,...,0,0,0,0,81,81,0,112,112,0
4,CA,2017,43,134278,42636,91642,112756,25022,87734,21522,...,234,16,11,5,874,645,229,1058,782,276


#### Check Data Types and Missing Data

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605 entries, 0 to 604
Data columns (total 45 columns):
State                                                          605 non-null object
Year                                                           605 non-null int64
Number of CoCs                                                 605 non-null int64
Total Homeless                                                 605 non-null object
Sheltered Homeless                                             605 non-null object
Unsheltered Homeless                                           605 non-null object
Homeless Individuals                                           605 non-null object
Sheltered Homeless Individuals                                 605 non-null object
Unsheltered Homeless Individuals                               605 non-null object
Homeless People in Families                                    605 non-null object
Sheltered Homeless People in Families                          605 

_Observations_
- Missing data is consistent across groups of categories. Visual inspection of data confirms that this is due to additional categories added in subsequent years. 
- Data is all in object format and will need to be converted to float. 
- Not all categories are multually exclusive, confirmation is required to confirm how data is summed.

Visualization of data confirmed that all columns are available from 2015 onwards, 2011 - 2014 contains columns up to Unsheltered Homeless Veterans and 2007 - 2013 contains columns up to Unsheltered Chronically Homeless Individuals.

#### Check unique values

In [10]:
df[['State', 'Year']].nunique()

State    56
Year     11
dtype: int64

_Observations_
- There are 11 years of data contained in the set (2007 - 2017)
- Need to confirm what states are covered within state
- Not worried about unique values for the other columns

In [13]:
df['State'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD',
       'ME', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
       'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', 'SC',
       'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY',
       'KS*'], dtype=object)

- It appears that the list includes some US territories and DC. I am less familiar with state abbreviations - will need to get state names for reference.
- One of states is KS*, note from dataset says: The number of CoCs in 2017 was 399. However, MO-604 merged in 2016 and covers territory in both MO and KS, contributing to the PIT count in both states. This will need to be inspected individually to understand.

### Clean Data
#### Convert Coumns to `float`

In [64]:
df.columns

Index(['State', 'Year', 'Number of CoCs', 'Total Homeless',
       'Sheltered Homeless', 'Unsheltered Homeless', 'Homeless Individuals',
       'Sheltered Homeless Individuals', 'Unsheltered Homeless Individuals',
       'Homeless People in Families', 'Sheltered Homeless People in Families',
       'Unsheltered Homeless People in Families', 'Chronically Homeless',
       'Sheltered Chronically Homeless', 'Unsheltered Chronically Homeless',
       'Chronically Homeless Individuals',
       'Sheltered Chronically Homeless Individuals',
       'Unsheltered Chronically Homeless Individuals',
       'Chronically Homeless People in Families',
       'Sheltered Chronically Homeless People in Families',
       'Unsheltered Chronically Homeless People in Families',
       'Homeless Veterans', 'Sheltered Homeless Veterans',
       'Unsheltered Homeless Veterans',
       'Homeless Unaccompanied Youth (Under 25)',
       'Sheltered Homeless Unaccompanied Youth (Under 25)',
       'Unsheltered Home

It was discovered that missing data for states was reported as `.` For example:

In [97]:
df.query('State == "MP"')[:5]

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
26,MP,2017,1,672,24,648,208,11,197,464,...,0,0,0,0,0,0,0,0,0,0
81,MP,2016,0,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.
136,MP,2015,0,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.
191,MP,2014,0,.,.,.,.,.,.,.,...,,,,,,,,,,
246,MP,2013,0,.,.,.,.,.,.,.,...,,,,,,,,,,


These needed to be converted to NaN to allow for further data transformation.

In [3]:
df.replace('.', np.NaN, inplace=True)
df.query('State == "MP"')[:5]

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
26,MP,2017,1,672.0,24.0,648.0,208.0,11.0,197.0,464.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
81,MP,2016,0,,,,,,,,...,,,,,,,,,,
136,MP,2015,0,,,,,,,,...,,,,,,,,,,
191,MP,2014,0,,,,,,,,...,,,,,,,,,,
246,MP,2013,0,,,,,,,,...,,,,,,,,,,


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605 entries, 0 to 604
Data columns (total 45 columns):
State                                                          605 non-null object
Year                                                           605 non-null int64
Number of CoCs                                                 605 non-null int64
Total Homeless                                                 595 non-null object
Sheltered Homeless                                             595 non-null object
Unsheltered Homeless                                           595 non-null object
Homeless Individuals                                           595 non-null object
Sheltered Homeless Individuals                                 595 non-null object
Unsheltered Homeless Individuals                               595 non-null object
Homeless People in Families                                    595 non-null object
Sheltered Homeless People in Families                          595 

This increased the amount of missing data for the columns - there is need to further examine the patterns of missing data. It is suggested that the most valuable method of investigation will be a year by year examination of what is missing.

All commas needed to be removed from the str values to allow conversion to `float` while managing `NaN` values.

In [4]:
df[df.columns] = df[df.columns].replace({',':''}, regex = True)

In [5]:
df[df.columns[3:]] = df[df.columns[3:]].astype(float)
df.head()

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
0,AK,2017,2,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0
1,AL,2017,8,3793.0,2656.0,1137.0,2985.0,1950.0,1035.0,808.0,...,3.0,6.0,6.0,0.0,23.0,20.0,3.0,39.0,35.0,4.0
2,AR,2017,6,2467.0,1273.0,1194.0,2068.0,937.0,1131.0,399.0,...,0.0,0.0,0.0,0.0,10.0,10.0,0.0,13.0,13.0,0.0
3,AZ,2017,3,8947.0,5781.0,3166.0,6488.0,3423.0,3065.0,2459.0,...,0.0,0.0,0.0,0.0,81.0,81.0,0.0,112.0,112.0,0.0
4,CA,2017,43,134278.0,42636.0,91642.0,112756.0,25022.0,87734.0,21522.0,...,234.0,16.0,11.0,5.0,874.0,645.0,229.0,1058.0,782.0,276.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605 entries, 0 to 604
Data columns (total 45 columns):
State                                                          605 non-null object
Year                                                           605 non-null int64
Number of CoCs                                                 605 non-null int64
Total Homeless                                                 595 non-null float64
Sheltered Homeless                                             595 non-null float64
Unsheltered Homeless                                           595 non-null float64
Homeless Individuals                                           595 non-null float64
Sheltered Homeless Individuals                                 595 non-null float64
Unsheltered Homeless Individuals                               595 non-null float64
Homeless People in Families                                    595 non-null float64
Sheltered Homeless People in Families                       

This confirms that all data has been adjusted as desired.

#### Confirm Column Configurations
To confirm how each of the individual columns are grouped, the values were summed and then tested.

In [94]:
test = df.iloc[:, 3:].sum()
test

total                                                       6.57365e+06
sheltered                                                   4.27227e+06
unsheltered                                                 2.30138e+06
individual                                                  4.14040e+06
sh_ind                                                      2.24844e+06
uns_ind                                                     1.89196e+06
family                                                      2.43324e+06
sh_fam                                                      2.02383e+06
uns_fam                                                          409413
chronic                                                     1.13412e+06
sh_chronic                                                       434523
uns_chronic                                                      699598
chronic_ind                                                      875860
sh_chronic_ind                                                  

A series of tests were completed to confirm the expected combinations:

`Total Homeless` = `Sheltered Homeless` + `Unsheltered Homeless`

In [40]:
(test['Sheltered Homeless'] + test['Unsheltered Homeless']) == test['Total Homeless']

True

`Homeless Individuals` = `Sheltered Homeless Individuals` + `Unsheltered Homeless Individuals`

In [41]:
(test['Sheltered Homeless Individuals'] + test['Unsheltered Homeless Individuals']) == test['Homeless Individuals']

True

`Homeless People in Families` = `Sheltered Homeless People in Families` + `Unsheltered Homeless People in Families`

In [42]:
(test['Sheltered Homeless People in Families'] + test['Unsheltered Homeless People in Families']) == test['Homeless People in Families']

True

`Total Homeless` = `Homeless Individuals` + `Homeless People in Families`

In [43]:
(test['Homeless Individuals'] + test['Homeless People in Families']) == test['Total Homeless']

True

`Chronically Homeless` = `Sheltered Chronically Homeless` + `Unsheltered Chronically Homeless`

In [44]:
(test['Sheltered Chronically Homeless'] + test['Unsheltered Chronically Homeless']) == test['Chronically Homeless']

True

It is also expected that `Chronically Homeless` is a subset of `Total Homeless` and so the values for `Chronically Homeless` should be less than that of the total.

In [45]:
test['Chronically Homeless'] < test['Total Homeless']

True

As it is not captured, a column for `Not Chronically Homeless` should be added.

`Sheltered Chronically Homeless` = `Sheltered Chronically Homeless Individuals` + `Sheltered Chronically Homeless People in Families`

In [47]:
(test['Sheltered Chronically Homeless Individuals'] + test['Sheltered Chronically Homeless People in Families']) == test['Sheltered Chronically Homeless']

False

In [48]:
print(test['Sheltered Chronically Homeless Individuals'] + test['Sheltered Chronically Homeless People in Families'])
print(test['Sheltered Chronically Homeless'])

428331.0
439453.0


This is an unexpected result. Need to look into where the inequalities lie.

In [53]:
not_equal = (df['Sheltered Chronically Homeless Individuals'] + df['Sheltered Chronically Homeless People in Families']) != df['Sheltered Chronically Homeless']
df[['Year', 'State', 'Sheltered Chronically Homeless Individuals', 'Sheltered Chronically Homeless People in Families', 'Sheltered Chronically Homeless']][not_equal]

Unnamed: 0,Year,State,Sheltered Chronically Homeless Individuals,Sheltered Chronically Homeless People in Families,Sheltered Chronically Homeless
81,2016,MP,,,
136,2015,MP,,,
191,2014,MP,,,
246,2013,MP,,,
301,2012,MP,,,
356,2011,MP,,,
385,2010,AK,270.0,,119.0
386,2010,AL,911.0,,502.0
387,2010,AR,247.0,,145.0
388,2010,AZ,1052.0,,544.0


In all cases where it does not match, no information has been reported for `Sheltered Chronically Homeless People in Families`. This suggests that the CoCs in the state do not have the ability to identify people as in a family. Typically, the number of `Sheltered Chronically Homeless Individuals` is less than `Sheltered Chronically Homeless People in Families`. But in some cases it is not.

In [61]:
ind_greater = df['Sheltered Chronically Homeless Individuals'] > df['Sheltered Chronically Homeless']
results_greater = df[['Year', 'State', 'Sheltered Chronically Homeless Individuals', 'Sheltered Chronically Homeless']][ind_greater]
results_greater

Unnamed: 0,Year,State,Sheltered Chronically Homeless Individuals,Sheltered Chronically Homeless
385,2010,AK,270.0,119.0
386,2010,AL,911.0,502.0
387,2010,AR,247.0,145.0
388,2010,AZ,1052.0,544.0
390,2010,CO,853.0,510.0
396,2010,GU,5.0,1.0
397,2010,HI,176.0,96.0
398,2010,IA,207.0,182.0
399,2010,ID,128.0,37.0
401,2010,IN,633.0,478.0


All of these results occurred between 2007 and 2010. It suggests that there may have been some errors in counting methodology in some CoCs and it looks as if this was corrected for 2011 onwards. This suggests that interpretation of `Sheltered Chronically Homeless` before 2010 may not be accurate. It maybe worth investigating how CoC/HUD interpreted the results for these years from applicable states.

Want to check how many states are involved for each year.

In [58]:
print(results_greater.query('Year == 2007').shape[0])
print(results_greater.query('Year == 2008').shape[0])
print(results_greater.query('Year == 2009').shape[0])
print(results_greater.query('Year == 2010').shape[0])

19
17
21
28


The number of states where this occurred is not consistent over the years, but may also reflect states correcting their methodology and additional states reporting. 

`Unsheltered Chronically Homeless` = `Unsheltered Chronically Homeless Individuals` + `Unsheltered Chronically Homeless People in Families`

In [59]:
(test['Unsheltered Chronically Homeless Individuals'] + test['Unsheltered Chronically Homeless People in Families']) == test['Unsheltered Chronically Homeless']

False

It appears that a similar issue to what was happening with the `Sheltered Chronically Homeless` numbers is happening with the `Unsheltered Chronically Homeless` numbers.

In [60]:
ind_greater = df['Unsheltered Chronically Homeless Individuals'] > df['Unsheltered Chronically Homeless']
results_greater = df[['Year', 'State', 'Unsheltered Chronically Homeless Individuals', 'Unsheltered Chronically Homeless People in Families', 'Unsheltered Chronically Homeless']][ind_greater]
results_greater

Unnamed: 0,Year,State,Unsheltered Chronically Homeless Individuals,Unsheltered Chronically Homeless People in Families,Unsheltered Chronically Homeless
399,2010,ID,111.0,,74.0
402,2010,KS,51.0,,42.0
413,2010,MT,106.0,,49.0
415,2010,ND,5.0,,3.0
438,2010,WV,183.0,,158.0
454,2009,ID,76.0,,53.0
457,2009,KS,60.0,,42.0
468,2009,MT,78.0,,53.0
500,2008,CO,693.0,,633.0
509,2008,ID,53.0,,35.0


Interestingly, there are far fewer instances where the number of individuals is greater than the total count of people who were `Unsheltered Chronically Homeless` as compared with `Sheltered Chronically Homeless` numbers. Again, the discrepancies only occur between 2007 to 2010. 

Will need to examine further to decide how to address these numbers.

`Homeless Veterans` = `Sheltered Homeless Veterans` + `Unsheltered Homeless Vetereans`

In [62]:
(test['Sheltered Homeless Veterans'] + test['Unsheltered Homeless Veterans']) == test['Homeless Veterans']

True

It is expected that `Homeless Veterans` is a subset of `Total Homeless` and should be lower.

In [63]:
test['Homeless Veterans'] < test['Total Homeless']

True

As it is not captured, a column for `Homeless Non-Veteran` should be added.

`Homeless Unaccompanied Youth (Under 25)` = `Sheltered Homeless Unaccompanied Youth (Under 25)` + `Unsheltered Homeless Unaccompanied Youth (Under 25)`

In [64]:
(test['Sheltered Homeless Unaccompanied Youth (Under 25)'] + test['Unsheltered Homeless Unaccompanied Youth (Under 25)']) == test['Homeless Unaccompanied Youth (Under 25)']

True

It is expected that `Homeless Unaccompanied Youth (Under 25)` is a subset of `Homeless Individuals` and should be lower.

In [65]:
test['Homeless Unaccompanied Youth (Under 25)'] < test['Homeless Individuals']

True

As it is not captured, a column for `Homeless Adult` should be added.

`Homeless Unaccompanied Youth (Under 25)` = `Homeless Unaccompanied Children (Under 18)` + `Homeless Unaccompanied Young Adults (Age 18-24)`

In [66]:
(test['Homeless Unaccompanied Children (Under 18)'] + test['Homeless Unaccompanied Young Adults (Age 18-24)']) == test['Homeless Unaccompanied Youth (Under 25)']

True

`Homeless Unaccompanied Children (Under 18)` = `Sheltered Homeless Unaccompanied Children (Under 18)` + `Unsheltered Homeless Unaccompanied Children (Under 18)`

In [67]:
(test['Sheltered Homeless Unaccompanied Children (Under 18)'] + test['Unsheltered Homeless Unaccompanied Children (Under 18)']) == test['Homeless Unaccompanied Children (Under 18)']

True

`Homeless Unaccompanied Young Adults (Age 18 - 24)` = `Sheltered Homeless Unaccompanied Young Adults (Age 18 - 24)` + `Unsheltered Homeless Unaccompanied Young Adults (Age 18 - 24)`

In [68]:
(test['Sheltered Homeless Unaccompanied Young Adults (Age 18-24)'] + test['Unsheltered Homeless Unaccompanied Young Adults (Age 18-24)']) == test['Homeless Unaccompanied Young Adults (Age 18-24)']

True

`Parenting Youth (Under 25)` = `Sheltered Parenting Youth (Under 25)` + `Unsheltered Parenting Youth (Under 25)`

In [69]:
(test['Sheltered Parenting Youth (Under 25)'] + test['Unsheltered Parenting Youth (Under 25)']) == test['Parenting Youth (Under 25)']

True

`Parenting Youth (Under 25)` = `Parenting Youth Under 18` + `Parenting Youth Age 18-24`

In [71]:
(test['Parenting Youth Under 18'] + test['Parenting Youth Age 18-24']) == test['Parenting Youth (Under 25)']

True

`Parenting Youth Age 18-24` = `Sheltered Parenting Youth Age 18-24` + `Unsheltered Parenting Youth Age 18-24`

In [73]:
(test['Sheltered Parenting Youth Age 18-24'] + test['Unsheltered Parenting Youth Age 18-24']) == test['Parenting Youth Age 18-24']

True

`Parenting Youth Under 18` = `Sheltered Parenting Youth Under18` + `Unsheltered Parenting Youth Under 18`

In [74]:
(test['Sheltered Parenting Youth Under 18'] + test['Unsheltered Parenting Youth Under 18']) == test['Parenting Youth Under 18']

True

`Children of Parenting Youth` = `Sheltered Children of Parenting Youth` + `Unsheltered Children of Parenting Youth`

In [75]:
(test['Sheltered Children of Parenting Youth'] + test['Unsheltered Children of Parenting Youth']) == test['Children of Parenting Youth']

True

Children of parenting youth and their parents create a subset of `Homeless People in Families` and their sum should be less than this. 

In [76]:
(test['Parenting Youth (Under 25)'] + test['Children of Parenting Youth']) < test['Homeless People in Families']

True

This is considered sufficient comparisons to confirm that the data is structured in the way intended. In all cases, except those of Chronic Homelessness, the data configuration aligned with expectations. 

It is noted that there are a number of different ways that data can be divided by on factors of sheltering type (sheltered or unsheltered), homelessness type (chronic or not), family status (individual or family), veteran status (veteran or not), age (under 25 or not), but not all segmentations are carried across each category.

The groupings are as follows.

**By Homelessness Type**

In [19]:
%%html
<img src="img/data_by_type.JPG", width=700, height=700>

**By Family Status**

In [15]:
%%html
<img src="img/data_by_family.JPG", width=900, height=900>

**By Age**

In [16]:
%%html
<img src="img/data_by_age.JPG", width=700, height=700>

**By Veteran Status**

In [20]:
%%html
<img src="img/data_by_veteran.JPG", width=500, height=500>

Given the various combinations, the plan is to construct multiple databases that appropriately group the data to allow examination of different questions.

#### Add State Name info

In [22]:
df_state = pd.read_csv("states.csv")
df_state.head()

Unnamed: 0,abbr,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona
3,AR,Arkansas
4,CA,California


In [28]:
df = df.merge(df_state, left_on='State', right_on='abbr', how='left')
df.head()

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth,abbr,name
0,AK,2017,2,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0,AK,Alaska
1,AL,2017,8,3793.0,2656.0,1137.0,2985.0,1950.0,1035.0,808.0,...,6.0,0.0,23.0,20.0,3.0,39.0,35.0,4.0,AL,Alabama
2,AR,2017,6,2467.0,1273.0,1194.0,2068.0,937.0,1131.0,399.0,...,0.0,0.0,10.0,10.0,0.0,13.0,13.0,0.0,AR,Arkansas
3,AZ,2017,3,8947.0,5781.0,3166.0,6488.0,3423.0,3065.0,2459.0,...,0.0,0.0,81.0,81.0,0.0,112.0,112.0,0.0,AZ,Arizona
4,CA,2017,43,134278.0,42636.0,91642.0,112756.0,25022.0,87734.0,21522.0,...,11.0,5.0,874.0,645.0,229.0,1058.0,782.0,276.0,CA,California


In [30]:
df.drop("abbr", axis=1, inplace=True)
df.shape

(605, 46)

In [31]:
df.to_csv("homeless-clean.csv", index=False)

Change the columns names so that they are easier to reference.

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 605 entries, 0 to 604
Data columns (total 46 columns):
State                                                          605 non-null object
Year                                                           605 non-null int64
Number of CoCs                                                 605 non-null int64
Total Homeless                                                 595 non-null float64
Sheltered Homeless                                             595 non-null float64
Unsheltered Homeless                                           595 non-null float64
Homeless Individuals                                           595 non-null float64
Sheltered Homeless Individuals                                 595 non-null float64
Unsheltered Homeless Individuals                               595 non-null float64
Homeless People in Families                                    595 non-null float64
Sheltered Homeless People in Families                       

In [37]:
labels = ['state', 'year', 'cocs', 'total', 'sheltered', 'unsheltered', 'individual', 'sh_ind', 'uns_ind', 'family', 'sh_fam', 'uns_fam', 'chronic', 'sh_chronic', 'uns_chronic', 'chronic_ind', 'sh_chronic_ind', 'uns_chronic_ind', 'chronic_fam', 'sh_chronic_fam', 'uns_chronic_fam', 'veteran', 'sh_veteran', 'uns_veteran', 'youth', 'sh_youth', 'uns_youth', 'child', 'sh_child', 'uns_child', 'yadult', 'sh_yadult', 'uns_yadult', 'yparent', 'sh_yparent', 'uns_yparent', 'yparent_u18', 'sh_yparent_u18', 'uns_yparent_u18', 'yparent_18to24', 'sh_yparent_18to24', 'uns_yparent_18to24', 'ypchild', 'sh_ypchild', 'uns_ypchild', 'state_name']
df = pd.read_csv('df_clean.csv', header=0, names=labels)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605 entries, 0 to 604
Data columns (total 46 columns):
state                 605 non-null object
year                  605 non-null int64
cocs                  605 non-null int64
total                 595 non-null float64
sheltered             595 non-null float64
unsheltered           595 non-null float64
individual            595 non-null float64
sh_ind                595 non-null float64
uns_ind               595 non-null float64
family                595 non-null float64
sh_fam                595 non-null float64
uns_fam               595 non-null float64
chronic               595 non-null float64
sh_chronic            595 non-null float64
uns_chronic           595 non-null float64
chronic_ind           595 non-null float64
sh_chronic_ind        595 non-null float64
uns_chronic_ind       595 non-null float64
chronic_fam           379 non-null float64
sh_chronic_fam        379 non-null float64
uns_chronic_fam       379 non-null float

In [38]:
df.to_csv("homeless-rename.csv", index=False)

#### Address Missing Values

In [39]:
df = pd.read_csv("homeless-rename.csv")
df.head()

Unnamed: 0,state,year,cocs,total,sheltered,unsheltered,individual,sh_ind,uns_ind,family,...,yparent_u18,sh_yparent_u18,uns_yparent_u18,yparent_18to24,sh_yparent_18to24,uns_yparent_18to24,ypchild,sh_ypchild,uns_ypchild,state_name
0,AK,2017,2,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0,Alaska
1,AL,2017,8,3793.0,2656.0,1137.0,2985.0,1950.0,1035.0,808.0,...,6.0,6.0,0.0,23.0,20.0,3.0,39.0,35.0,4.0,Alabama
2,AR,2017,6,2467.0,1273.0,1194.0,2068.0,937.0,1131.0,399.0,...,0.0,0.0,0.0,10.0,10.0,0.0,13.0,13.0,0.0,Arkansas
3,AZ,2017,3,8947.0,5781.0,3166.0,6488.0,3423.0,3065.0,2459.0,...,0.0,0.0,0.0,81.0,81.0,0.0,112.0,112.0,0.0,Arizona
4,CA,2017,43,134278.0,42636.0,91642.0,112756.0,25022.0,87734.0,21522.0,...,16.0,11.0,5.0,874.0,645.0,229.0,1058.0,782.0,276.0,California


**State Names**

Remebered that there are still some abbreviations that don't have State Names.

In [45]:
df[['state', 'state_name']][df['state_name'].isnull()].groupby('state').count()

Unnamed: 0_level_0,state_name
state,Unnamed: 1_level_1
DC,0
GU,0
KS*,0
MP,0
PR,0
VI,0


According to  [this site](http://www.stateabbreviations.us/) the additional details are:
- DC = Washington DC
- GU = Guam
- KS* = (as outlined above) Kansas*
- MP = Northern Mariana Islands
- PR = Puerto Rico
- VI = Virgin Islands

In [47]:
d = {'state': ['DC', 'GU', 'KS*', 'MP', 'PR', 'VI'], 'addstate_name': ['Washington DC', 'Guam', 'Kansas*', 'Northern Mariana Islands', 'Puerto Rico', 'Virgin Islands']}
state_df = pd.DataFrame(data=d)

In [48]:
df = df.merge(state_df, on='state', how='left')

Unnamed: 0,state,year,cocs,total,sheltered,unsheltered,individual,sh_ind,uns_ind,family,...,sh_yparent_u18,uns_yparent_u18,yparent_18to24,sh_yparent_18to24,uns_yparent_18to24,ypchild,sh_ypchild,uns_ypchild,state_name,addstate_name
0,AK,2017,2,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0,Alaska,
1,AL,2017,8,3793.0,2656.0,1137.0,2985.0,1950.0,1035.0,808.0,...,6.0,0.0,23.0,20.0,3.0,39.0,35.0,4.0,Alabama,
2,AR,2017,6,2467.0,1273.0,1194.0,2068.0,937.0,1131.0,399.0,...,0.0,0.0,10.0,10.0,0.0,13.0,13.0,0.0,Arkansas,
3,AZ,2017,3,8947.0,5781.0,3166.0,6488.0,3423.0,3065.0,2459.0,...,0.0,0.0,81.0,81.0,0.0,112.0,112.0,0.0,Arizona,
4,CA,2017,43,134278.0,42636.0,91642.0,112756.0,25022.0,87734.0,21522.0,...,11.0,5.0,874.0,645.0,229.0,1058.0,782.0,276.0,California,


In [50]:
df['state_name'].fillna(df['addstate_name'], inplace=True)
df.drop('addstate_name', axis=1, inplace=True)
df.head()

Unnamed: 0,state,year,cocs,total,sheltered,unsheltered,individual,sh_ind,uns_ind,family,...,yparent_u18,sh_yparent_u18,uns_yparent_u18,yparent_18to24,sh_yparent_18to24,uns_yparent_18to24,ypchild,sh_ypchild,uns_ypchild,state_name
0,AK,2017,2,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0,Alaska
1,AL,2017,8,3793.0,2656.0,1137.0,2985.0,1950.0,1035.0,808.0,...,6.0,6.0,0.0,23.0,20.0,3.0,39.0,35.0,4.0,Alabama
2,AR,2017,6,2467.0,1273.0,1194.0,2068.0,937.0,1131.0,399.0,...,0.0,0.0,0.0,10.0,10.0,0.0,13.0,13.0,0.0,Arkansas
3,AZ,2017,3,8947.0,5781.0,3166.0,6488.0,3423.0,3065.0,2459.0,...,0.0,0.0,0.0,81.0,81.0,0.0,112.0,112.0,0.0,Arizona
4,CA,2017,43,134278.0,42636.0,91642.0,112756.0,25022.0,87734.0,21522.0,...,16.0,11.0,5.0,874.0,645.0,229.0,1058.0,782.0,276.0,California


In [52]:
df.state_name.isnull().sum()

0

After further consideration it was decided just to keep the 50 states plus Washington DC and the additional Kansas info to limit the comparisons to just US states.

In [58]:
df.shape

(605, 46)

In [65]:
drop_rows = df.query('state == "GU" or state == "MP" or state == "PR" or state == "VI"').index
drop_rows

Int64Index([ 11,  26,  41,  49,  66,  81,  96, 104, 121, 136, 151, 159, 176,
            191, 206, 214, 231, 246, 261, 269, 286, 301, 316, 324, 341, 356,
            371, 379, 396, 411, 426, 434, 451, 466, 481, 489, 506, 521, 536,
            544, 561, 576, 591, 599],
           dtype='int64')

In [66]:
df.drop(drop_rows, inplace=True)
df.shape

(561, 46)

**Missing Data**

In [76]:
pd.isnull(df.query('year == 2007')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam           51
sh_chronic_fam        51
uns_chronic_fam       51
veteran               51
sh_veteran            51
uns_veteran           51
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


In [78]:
pd.isnull(df.query('year == 2008')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam           51
sh_chronic_fam        51
uns_chronic_fam       51
veteran               51
sh_veteran            51
uns_veteran           51
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


In [79]:
pd.isnull(df.query('year == 2009')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam           51
sh_chronic_fam        51
uns_chronic_fam       51
veteran               51
sh_veteran            51
uns_veteran           51
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


In [80]:
pd.isnull(df.query('year == 2010')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam           51
sh_chronic_fam        51
uns_chronic_fam       51
veteran               51
sh_veteran            51
uns_veteran           51
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


For 2007 to 2010, no data is reported for columns `chronic_fam` to `uns_ypchild`. There is no other missing row data.

In [81]:
pd.isnull(df.query('year == 2011')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam            0
sh_chronic_fam         0
uns_chronic_fam        0
veteran                0
sh_veteran             0
uns_veteran            0
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


In [82]:
pd.isnull(df.query('year == 2012')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam            0
sh_chronic_fam         0
uns_chronic_fam        0
veteran                0
sh_veteran             0
uns_veteran            0
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


In [83]:
pd.isnull(df.query('year == 2013')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam            0
sh_chronic_fam         0
uns_chronic_fam        0
veteran                0
sh_veteran             0
uns_veteran            0
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


In [88]:
pd.isnull(df.query('year == 2014')).sum()

state                  0
year                   0
cocs                   0
total                  0
sheltered              0
unsheltered            0
individual             0
sh_ind                 0
uns_ind                0
family                 0
sh_fam                 0
uns_fam                0
chronic                0
sh_chronic             0
uns_chronic            0
chronic_ind            0
sh_chronic_ind         0
uns_chronic_ind        0
chronic_fam            0
sh_chronic_fam         0
uns_chronic_fam        0
veteran                0
sh_veteran             0
uns_veteran            0
youth                 51
sh_youth              51
uns_youth             51
child                 51
sh_child              51
uns_child             51
yadult                51
sh_yadult             51
uns_yadult            51
yparent               51
sh_yparent            51
uns_yparent           51
yparent_u18           51
sh_yparent_u18        51
uns_yparent_u18       51
yparent_18to24        51


For 2011 to 2014 no data is reported for columns `youth` to `uns_ypchild` but there is no other missing row data.

In [85]:
pd.isnull(df.query('year == 2015')).sum()

state                 0
year                  0
cocs                  0
total                 0
sheltered             0
unsheltered           0
individual            0
sh_ind                0
uns_ind               0
family                0
sh_fam                0
uns_fam               0
chronic               0
sh_chronic            0
uns_chronic           0
chronic_ind           0
sh_chronic_ind        0
uns_chronic_ind       0
chronic_fam           0
sh_chronic_fam        0
uns_chronic_fam       0
veteran               0
sh_veteran            0
uns_veteran           0
youth                 0
sh_youth              0
uns_youth             0
child                 0
sh_child              0
uns_child             0
yadult                0
sh_yadult             0
uns_yadult            0
yparent               0
sh_yparent            0
uns_yparent           0
yparent_u18           0
sh_yparent_u18        0
uns_yparent_u18       0
yparent_18to24        0
sh_yparent_18to24     0
uns_yparent_18to

In [89]:
pd.isnull(df.query('year == 2016')).sum()

state                 0
year                  0
cocs                  0
total                 0
sheltered             0
unsheltered           0
individual            0
sh_ind                0
uns_ind               0
family                0
sh_fam                0
uns_fam               0
chronic               0
sh_chronic            0
uns_chronic           0
chronic_ind           0
sh_chronic_ind        0
uns_chronic_ind       0
chronic_fam           0
sh_chronic_fam        0
uns_chronic_fam       0
veteran               0
sh_veteran            0
uns_veteran           0
youth                 0
sh_youth              0
uns_youth             0
child                 0
sh_child              0
uns_child             0
yadult                0
sh_yadult             0
uns_yadult            0
yparent               0
sh_yparent            0
uns_yparent           0
yparent_u18           0
sh_yparent_u18        0
uns_yparent_u18       0
yparent_18to24        0
sh_yparent_18to24     0
uns_yparent_18to

In [90]:
pd.isnull(df.query('year == 2017')).sum()

state                 0
year                  0
cocs                  0
total                 0
sheltered             0
unsheltered           0
individual            0
sh_ind                0
uns_ind               0
family                0
sh_fam                0
uns_fam               0
chronic               0
sh_chronic            0
uns_chronic           0
chronic_ind           0
sh_chronic_ind        0
uns_chronic_ind       0
chronic_fam           0
sh_chronic_fam        0
uns_chronic_fam       0
veteran               0
sh_veteran            0
uns_veteran           0
youth                 0
sh_youth              0
uns_youth             0
child                 0
sh_child              0
uns_child             0
yadult                0
sh_yadult             0
uns_yadult            0
yparent               0
sh_yparent            0
uns_yparent           0
yparent_u18           0
sh_yparent_u18        0
uns_yparent_u18       0
yparent_18to24        0
sh_yparent_18to24     0
uns_yparent_18to

There is no missing data for 2015 to 2017.

The missing data does not represent missing data in the truest sense of the word, instead, the missing data represents additional categories counted over the years. Therefore it does not need to be managed. 

The missing columns from 2007 to 2010 also give indications of changed/improved methodology from 2010 onwards that resultedin appropriately combining the counts for individuals identified as Chronic Homeless. That is, beyond 2010, all data configurations are as expected.

In [91]:
df.to_csv("homeless-final.csv", index=False)

### Group Data

As described above, there are a number of different configurations of data that can be made based on classifications. It was determined that data would be grouped into different datasets based on classification for ease of analysis. Additional classifications are possible but these can all be drawn from the main dataset.

The test set was reconstructed to ensure that any additional assumptions could be tested.

In [106]:
test = df.iloc[:, 3:-1].sum()

#### By Sheltering Type

In [92]:
df_shelter = df[['state', 'state_name', 'year', 'total', 'sheltered', 'unsheltered']]
df_shelter.head()

Unnamed: 0,state,state_name,year,total,sheltered,unsheltered
0,AK,Alaska,2017,1845.0,1551.0,294.0
1,AL,Alabama,2017,3793.0,2656.0,1137.0
2,AR,Arkansas,2017,2467.0,1273.0,1194.0
3,AZ,Arizona,2017,8947.0,5781.0,3166.0
4,CA,California,2017,134278.0,42636.0,91642.0


In [93]:
df_shelter.to_csv('homeless-shelter.csv', index=False)

#### By Family Status

Confirm assumption about composition of `family` data: `yparent` and `ypchild` should be a subset of `family`

In [96]:
test['family'] > (test['yparent'] + test['ypchild'])

True

In [111]:
df_family = df[['state', 'state_name', 'year', 'total', 'individual', 'family', 'yparent', 'ypchild']]
df_family['non_ypfam'] = df_family.family - (df_family.yparent + df_family.ypchild)
df_family = df_family.join(df[['sh_ind', 'uns_ind', 'sh_fam', 'uns_fam', 'sh_yparent', 'uns_yparent', 'sh_ypchild', 'uns_ypchild']])
df_family['sh_non_ypfam'] = df_family.sh_fam - (df_family.sh_yparent + df_family.sh_ypchild)
df_family['uns_non_ypfam'] = df_family.uns_fam - (df_family.uns_yparent + df_family.uns_ypchild)
df_family.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,state,state_name,year,total,individual,family,yparent,ypchild,non_ypfam,sh_ind,uns_ind,sh_fam,uns_fam,sh_yparent,uns_yparent,sh_ypchild,uns_ypchild,sh_non_ypfam,uns_non_ypfam
0,AK,Alaska,2017,1845.0,1354.0,491.0,22.0,39.0,430.0,1060.0,294.0,491.0,0.0,22.0,0.0,39.0,0.0,430.0,0.0
1,AL,Alabama,2017,3793.0,2985.0,808.0,29.0,39.0,740.0,1950.0,1035.0,706.0,102.0,26.0,3.0,35.0,4.0,645.0,95.0
2,AR,Arkansas,2017,2467.0,2068.0,399.0,10.0,13.0,376.0,937.0,1131.0,336.0,63.0,10.0,0.0,13.0,0.0,313.0,63.0
3,AZ,Arizona,2017,8947.0,6488.0,2459.0,81.0,112.0,2266.0,3423.0,3065.0,2358.0,101.0,81.0,0.0,112.0,0.0,2165.0,101.0
4,CA,California,2017,134278.0,112756.0,21522.0,890.0,1058.0,19574.0,25022.0,87734.0,17614.0,3908.0,656.0,234.0,782.0,276.0,16176.0,3398.0


In [112]:
df_family.to_csv('homeless-family.csv', index=False)

#### By Homeless Type

In [113]:
df_type = df[['state', 'state_name', 'year', 'total', 'chronic', 'sh_chronic', 'uns_chronic']]
df_type['non_chronic'] = df_type.total - df.chronic
df_type = df_type.join(df[['sh_chronic_ind', 'uns_chronic_ind', 'sh_chronic_fam', 'uns_chronic_fam']])
df_type.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,state,state_name,year,total,chronic,sh_chronic,uns_chronic,non_chronic,sh_chronic_ind,uns_chronic_ind,sh_chronic_fam,uns_chronic_fam
0,AK,Alaska,2017,1845.0,257.0,158.0,99.0,1588.0,117.0,99.0,41.0,0.0
1,AL,Alabama,2017,3793.0,363.0,171.0,192.0,3430.0,152.0,192.0,19.0,0.0
2,AR,Arkansas,2017,2467.0,473.0,147.0,326.0,1994.0,138.0,312.0,9.0,14.0
3,AZ,Arizona,2017,8947.0,1552.0,546.0,1006.0,7395.0,492.0,971.0,54.0,35.0
4,CA,California,2017,134278.0,37360.0,5235.0,32125.0,96918.0,4430.0,31368.0,805.0,757.0


In [114]:
df_type.to_csv('homeless-type.csv', index=False)

#### By Veteran Status

In [116]:
df_veteran = df[['state', 'state_name', 'year', 'total', 'veteran']]
df_veteran['non_veteran'] = df_veteran.total - df_veteran.veteran
df_veteran = df_veteran.join(df[['sh_veteran', 'uns_veteran']])
df_veteran.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,state,state_name,year,total,veteran,non_veteran,sh_veteran,uns_veteran
0,AK,Alaska,2017,1845.0,124.0,1721.0,95.0,29.0
1,AL,Alabama,2017,3793.0,269.0,3524.0,202.0,67.0
2,AR,Arkansas,2017,2467.0,239.0,2228.0,130.0,109.0
3,AZ,Arizona,2017,8947.0,970.0,7977.0,641.0,329.0
4,CA,California,2017,134278.0,11472.0,122806.0,3815.0,7657.0


In [117]:
df_veteran.to_csv('homeless-veteran.csv', index=False)

#### By Age

Young people are unaccompanied young people (youth, yadult, child), and young parents (of all ages) and their children. However, the remainder are not always adults - there are children found in families not led by young parents.

Therefore, only unaccompanied youth can be compared to individuals to determine the number of individual adults.

In [119]:
test['individual'] > test['youth']

True

In [120]:
test['sh_ind'] > test['sh_youth']

True

In [121]:
test['uns_ind'] > test['uns_youth']

True

In [122]:
df_age = df[['state', 'state_name', 'year', 'total', 'youth', 'child', 'yadult', 'yparent', 'yparent_u18', 'yparent_18to24', 'ypchild']]
df_age['adult'] = df.individual - df.youth
df_age['non_ypfam'] = df_family.family - (df_family.yparent + df_family.ypchild)
df_age = df_age.join(df[[
                        'sh_youth', 'uns_youth', 'sh_child', 'uns_child', 'sh_yadult', 'uns_yadult', 'sh_yparent', 'uns_yparent', 
                        'sh_yparent_u18', 'uns_yparent_u18', 'sh_yparent_18to24', 'uns_yparent_18to24', 'sh_ypchild', 'uns_ypchild'
                        ]])
df_age['sh_adult'] = df.sh_ind - df.sh_youth
df_age['uns_adult'] = df.uns_ind - df.uns_youth
df_age['sh_non_ypfam'] = df_family.sh_fam - (df_family.sh_yparent + df_family.sh_ypchild)
df_age['uns_non_ypfam'] = df_family.uns_fam - (df_family.uns_yparent + df_family.uns_ypchild)
df_age.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,state,state_name,year,total,youth,child,yadult,yparent,yparent_u18,yparent_18to24,...,sh_yparent_u18,uns_yparent_u18,sh_yparent_18to24,uns_yparent_18to24,sh_ypchild,uns_ypchild,sh_adult,uns_adult,sh_non_ypfam,uns_non_ypfam
0,AK,Alaska,2017,1845.0,162.0,15.0,147.0,22.0,0.0,22.0,...,0.0,0.0,22.0,0.0,39.0,0.0,918.0,274.0,430.0,0.0
1,AL,Alabama,2017,3793.0,294.0,36.0,258.0,29.0,6.0,23.0,...,6.0,0.0,20.0,3.0,35.0,4.0,1768.0,923.0,645.0,95.0
2,AR,Arkansas,2017,2467.0,208.0,17.0,191.0,10.0,0.0,10.0,...,0.0,0.0,10.0,0.0,13.0,0.0,850.0,1010.0,313.0,63.0
3,AZ,Arizona,2017,8947.0,578.0,55.0,523.0,81.0,0.0,81.0,...,0.0,0.0,81.0,0.0,112.0,0.0,3078.0,2832.0,2165.0,101.0
4,CA,California,2017,134278.0,15458.0,1649.0,13809.0,890.0,16.0,874.0,...,11.0,5.0,645.0,229.0,782.0,276.0,22313.0,74985.0,16176.0,3398.0


In [123]:
df_age.to_csv('homeless-age.csv', index=False)

<a id='eda'></a>
## Exploratory Data Analysis