#Comparing Median Income by County in Texas from 2014-2015

## Overview: This notebook will outline the steps needed in order to assemble a dataset overviewing median incomes across counties in Texas from 2014-2015. The subdata compiled was culled from the 2014-2015 Public Health Data records.





  



Below is a step by step instruction manual on how to cull this data.

### ***Getting Started***



**Starting with Pandas**

Pandas is the package that will help you sort through the dataset.


Add code saying `import pandas as pd` and `import numpy as np` and run it


In [1]:
import pandas as pd
import numpy as np

**Reading the File in**

 Import the csv file called "CountyHealthData_2014-2015.csv". To do this go to the files section on the left side of your notebook and select upload file, and proceed to upload said file.
  

To read the file into the notebook insert the code: `df=pd.read_csv("CountyHealthData_2014-2015.csv")`. The file was given to us by Professor Gotzler and is linked in the file section of this notebook.

**Important:**

Before running the code below ensure that the csv is in your file folder

In [2]:
df=pd.read_csv("CountyHealthData_2014-2015.csv")


### ***Understanding the Size of the Original Dataset***

Use `df.shape`, and `df.size` to get the data about the size of the file. The size will be the number of blocks and the shape will have number of (rows,columns)

In [3]:
df.shape


(6109, 64)

In [4]:
df.size

390976

Use `df.columns` to get an overview of all the columns of information in the file

In [5]:
df.columns


Index(['State', 'Region', 'Division', 'County', 'FIPS', 'GEOID', 'SMS Region',
       'Year', 'Premature death', 'Poor or fair health',
       'Poor physical health days', 'Poor mental health days',
       'Low birthweight', 'Adult smoking', 'Adult obesity',
       'Food environment index', 'Physical inactivity',
       'Access to exercise opportunities', 'Excessive drinking',
       'Alcohol-impaired driving deaths', 'Sexually transmitted infections',
       'Teen births', 'Uninsured', 'Primary care physicians', 'Dentists',
       'Mental health providers', 'Preventable hospital stays',
       'Diabetic screening', 'Mammography screening', 'High school graduation',
       'Some college', 'Unemployment', 'Children in poverty',
       'Income inequality', 'Children in single-parent households',
       'Social associations', 'Violent crime', 'Injury deaths',
       'Air pollution - particulate matter', 'Drinking water violations',
       'Severe housing problems', 'Driving alone to work'

### ***Creating the Texas Subset***

Once you have decided your chosen topic, use code `df[df["State"] == "TX"]` to narrow it down to your chosen information, in which I chose Texas to look at.

In [6]:
df[df["State"] == "TX"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
4914,TX,South,West South Central,Anderson County,48001,48001,Region 10,1/1/2014,10526.0,0.213,...,15.62,0.317,0.146,10329.0,0.178,24.0,41530,0.521,6.70,0.303
4915,TX,South,West South Central,Anderson County,48001,48001,Region 10,1/1/2015,10106.0,0.213,...,14.02,0.300,0.145,10784.0,0.178,28.0,41279,0.534,7.00,
4916,TX,South,West South Central,Andrews County,48003,48003,Region 10,1/1/2014,7970.0,0.282,...,,0.315,0.188,9089.0,,37.0,57690,0.200,,
4917,TX,South,West South Central,Andrews County,48003,48003,Region 10,1/1/2015,8960.0,0.282,...,,0.290,0.162,9262.0,,54.0,62781,0.173,,
4918,TX,South,West South Central,Angelina County,48005,48005,Region 10,1/1/2014,8965.0,0.259,...,7.11,0.309,0.134,11316.0,0.250,40.0,39747,0.588,5.76,0.257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5378,TX,South,West South Central,Young County,48503,48503,Region 10,1/1/2015,9829.0,0.251,...,24.16,0.311,0.169,12149.0,,33.0,44372,0.457,,
5379,TX,South,West South Central,Zapata County,48505,48505,Region 10,1/1/2014,7267.0,0.275,...,,0.466,0.172,12973.0,,14.0,33462,0.707,,
5380,TX,South,West South Central,Zapata County,48505,48505,Region 10,1/1/2015,6366.0,0.275,...,,0.431,0.155,11651.0,,21.0,34506,0.671,,
5381,TX,South,West South Central,Zavala County,48507,48507,Region 10,1/1/2014,7854.0,,...,,0.376,0.130,10271.0,,50.0,25157,0.732,,


Make a subset of data for Texas from the larger data pool calling it TX_subset, making it easier to access. Type in code `TX_subset=df[df["State"] == "TX"].copy()` to create the subset. Then on the next line type `TX_subset` to see the data below

In [7]:
TX_subset=df[df["State"] == "TX"].copy()


TX_subset



Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
4914,TX,South,West South Central,Anderson County,48001,48001,Region 10,1/1/2014,10526.0,0.213,...,15.62,0.317,0.146,10329.0,0.178,24.0,41530,0.521,6.70,0.303
4915,TX,South,West South Central,Anderson County,48001,48001,Region 10,1/1/2015,10106.0,0.213,...,14.02,0.300,0.145,10784.0,0.178,28.0,41279,0.534,7.00,
4916,TX,South,West South Central,Andrews County,48003,48003,Region 10,1/1/2014,7970.0,0.282,...,,0.315,0.188,9089.0,,37.0,57690,0.200,,
4917,TX,South,West South Central,Andrews County,48003,48003,Region 10,1/1/2015,8960.0,0.282,...,,0.290,0.162,9262.0,,54.0,62781,0.173,,
4918,TX,South,West South Central,Angelina County,48005,48005,Region 10,1/1/2014,8965.0,0.259,...,7.11,0.309,0.134,11316.0,0.250,40.0,39747,0.588,5.76,0.257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5378,TX,South,West South Central,Young County,48503,48503,Region 10,1/1/2015,9829.0,0.251,...,24.16,0.311,0.169,12149.0,,33.0,44372,0.457,,
5379,TX,South,West South Central,Zapata County,48505,48505,Region 10,1/1/2014,7267.0,0.275,...,,0.466,0.172,12973.0,,14.0,33462,0.707,,
5380,TX,South,West South Central,Zapata County,48505,48505,Region 10,1/1/2015,6366.0,0.275,...,,0.431,0.155,11651.0,,21.0,34506,0.671,,
5381,TX,South,West South Central,Zavala County,48507,48507,Region 10,1/1/2014,7854.0,,...,,0.376,0.130,10271.0,,50.0,25157,0.732,,


**Still, this subset is too big and needs to be narrowed down to what we actually want to focus on, median incomes**

Within your `TX_subset` of the data you can sort through it by columns located above when we used `df.columns`. I searched by the columns called "Median household income", "County", and "Year" to narrow my data to the specifc set I want

There are different ways you can see the data. You can look at the head or the tail of the data adding `.head()` or `.tail()` to the end of the line of code

In [8]:
TX_subset.loc[:,["Median household income","County","Year"]].head()


Unnamed: 0,Median household income,County,Year
4914,41530,Anderson County,1/1/2014
4915,41279,Anderson County,1/1/2015
4916,57690,Andrews County,1/1/2014
4917,62781,Andrews County,1/1/2015
4918,39747,Angelina County,1/1/2014


In [9]:
TX_subset.loc[:,["Median household income","County","Year"]].tail()

Unnamed: 0,Median household income,County,Year
5378,44372,Young County,1/1/2015
5379,33462,Zapata County,1/1/2014
5380,34506,Zapata County,1/1/2015
5381,25157,Zavala County,1/1/2014
5382,25291,Zavala County,1/1/2015


Although these both show important information, we need more randomness and a larger sample to see wider trends

Another way to look at the data is by random sample. Instead of using `.head()` or `.tail()` you can use `.sample(n=x)` with the x being the size you want the random sample to be

In [10]:
TX_subset.loc[:,["Median household income","County","Year"]].sample(n=50)


Unnamed: 0,Median household income,County,Year
5251,49792,Panola County,1/1/2015
4990,38866,Coke County,1/1/2015
4959,43561,Burleson County,1/1/2014
5199,33330,Marion County,1/1/2014
5359,43791,Wichita County,1/1/2014
5342,43220,Van Zandt County,1/1/2015
4970,33905,Cameron County,1/1/2015
5368,60844,Wilson County,1/1/2015
5117,37610,Hill County,1/1/2014
5159,40919,Kerr County,1/1/2014


Within the TX_subset we created earlier, make a smaller data subset with the specific Median household income data by county, which will be your final data.

Create a name for this final data, in which I chose "Medianincometx". To make "Medianincometx" an actual subset you need to use the code
`Medianincometx= TX_subset.loc[:,["Median household income","County","Year"]].sample(n=50)`

Then on a new line of code type `Medianincometx` which is your finished sorted data and that will present the data you have just sorted

In [11]:
Medianincometx= TX_subset.loc[:,["Median household income","County","Year"]].sample(n=50)

Medianincometx


Unnamed: 0,Median household income,County,Year
4963,42963,Caldwell County,1/1/2014
5194,46223,Lubbock County,1/1/2015
5192,45669,Llano County,1/1/2015
5295,36030,San Saba County,1/1/2015
5257,43766,Pecos County,1/1/2015
5084,46501,Gregg County,1/1/2015
5160,42456,Kerr County,1/1/2015
5262,32861,Presidio County,1/1/2014
5315,37076,Swisher County,1/1/2014
4947,40574,Bowie County,1/1/2014


### ***Exporting the Data***

Finally to export your final data use the line of code `Medianincometx.to_csv('Medianincometx.csv')`

In [12]:
Medianincometx.to_csv('Medianincometx.csv')

With that code enetered, you should have a new file called Medianincometx in your files within your notebook

Once you see this file in your notebook, you can download it to your computer for whatever use you may have for it.