## Computational Data Instructions

The following will provide instructions to use computational methods to create a new subset of data using the County Health Data. It will indicate the programs used, the codes to explore the data, and the process of creating the new subset.

### Tools

The program used for this particular set was Juypter Lab which allowed the use of python. It also uses the Pandas package to manipulate the data.

1. Create a folder on your computer for easy access to the files you will use. Your notebook, data, and other related files should be saved here.
2. Download the [County Health Data_2014-2015](https://uncch.instructure.com/courses/11001/files/1951171?wrap=1) and save it to the folder you created on your computer. This particular data set is a .csv file.


### Get Started

The next steps give instruction about creating the notebook and accessing the data.

3. Begin by using the appropriate directory to launch your program. This is demonstrated on `Jupyter Lab`.
4. Create a new notebook according to the graphical user interphase.

The Pandas package needs to be imported in order to manipulate the data set. The command to import Pandas is `import pandas as pd'.

In [23]:
import pandas as pd

Once Pandas has been imported, the next step is to access your data. The County Health Data should have been saved to the same folder as your notebook. If the data cvs file is already in the directory it only needs to use the function `pd.read_csv()` to create a dataframe. 

- The dataframe needs the object (the data set) to be defined. For this we'll use `rawdata.` to define the object.

5. To retrieve this data use the command:  `rawdata=pd.read_csv(CountyHealthData_2014-2015.csv")`

In [24]:
rawdata=pd.read_csv("CountyHealthData_2014-2015.csv")

### Explore the Data

This portion introduces the commands that will help explore and understand the data in our set. 
- One thing to note about this particular dataset is the column "States" is organized in alphabetical order.

6. The `rawdata.shape` function allows us to know the amount of rows and columns in the data set. 
7. The `rawdata.size` function gives us the number of cells in our total data set. 

In [11]:
rawdata.shape

(6109, 64)

In [12]:
rawdata.size

390976

There are two different functions that retrieve the column names.

- The function `rawdata.columns` provides the column names one after the other.
- The function `rawdata.dytpes` provides the column names alongside the datatype in each. This type of list may be easier to read at a glance.

In [14]:
rawdata.columns

Index(['State', 'Region', 'Division', 'County', 'FIPS', 'GEOID', 'SMS Region',
       'Year', 'Premature death', 'Poor or fair health',
       'Poor physical health days', 'Poor mental health days',
       'Low birthweight', 'Adult smoking', 'Adult obesity',
       'Food environment index', 'Physical inactivity',
       'Access to exercise opportunities', 'Excessive drinking',
       'Alcohol-impaired driving deaths', 'Sexually transmitted infections',
       'Teen births', 'Uninsured', 'Primary care physicians', 'Dentists',
       'Mental health providers', 'Preventable hospital stays',
       'Diabetic screening', 'Mammography screening', 'High school graduation',
       'Some college', 'Unemployment', 'Children in poverty',
       'Income inequality', 'Children in single-parent households',
       'Social associations', 'Violent crime', 'Injury deaths',
       'Air pollution - particulate matter', 'Drinking water violations',
       'Severe housing problems', 'Driving alone to work'

In [32]:
rawdata.dtypes 

State                                object
Region                               object
Division                             object
County                               object
FIPS                                  int64
                                     ...   
Other primary care providers        float64
Median household income               int64
Children eligible for free lunch    float64
Homicide rate                       float64
Inadequate social support           float64
Length: 64, dtype: object

## Creating the New Subset

The next section focuses on manipulating the dataset to create digestable data tables. The data will focus on the prevalence of diabetes by state and region. 

- Using the `.value_counts()` function we can summon the data in a specific column and figure out how many times it is present. This function is useful to gauge the frequency and contribution of the "Region" and "State" columns in our data set.

8. To code for the frequency of the "Region" column use `rawdata.Region.value_counts()` 

9. To code for the frequnecy of the "State" column use `rawdata.State.value_counts()`

In [6]:
rawdata.Region.value_counts()

South        2803
Midwest      2038
West          834
Northeast     434
Name: Region, dtype: int64

In [30]:
rawdata.State.value_counts()

TX    469
GA    318
VA    266
KY    240
MO    229
IL    204
NC    200
KS    199
IA    198
TN    190
IN    184
OH    176
MN    174
MI    164
MS    163
NE    157
OK    154
AR    150
WI    144
FL    134
PA    134
AL    134
LA    128
NY    124
CO    119
SD    117
CA    114
WV    110
ND     92
MT     92
SC     92
ID     84
WA     78
OR     67
NM     64
UT     54
MD     48
AK     46
WY     46
NJ     42
NV     32
ME     32
AZ     30
MA     28
VT     28
NH     20
CT     16
RI     10
HI      8
DE      6
DC      1
Name: State, dtype: int64

The data that will be looked at more closely involves the state of North Carolina. After finding the first ten rows under the "State" column related to NC, the command `["State][ : ]` was used to confirm the cell numbers matched.

10. Use the command `rawdata ["State"][3244:3253]` to confirm the state and dtype.


In [35]:
rawdata ["State"][3244:3253]

3244    NC
3245    NC
3246    NC
3247    NC
3248    NC
3249    NC
3250    NC
3251    NC
3252    NC
Name: State, dtype: object

The following commands help visualize and read the data in a table. It includes the data under the columns "State", "Region", "County", and "Diabetes". The data shows the prevalence of diabetes in the state of North Carolina as well as the county name. Summoning specific columns can be done by using the command `.loc`

11. Use the command `rawdata.loc[3244:3253,["State", "Region", "County", "Diabetes"]]` to create a table.

In [36]:
rawdata.loc[3244:3253,["State", "Region", "County", "Diabetes"]]

Unnamed: 0,State,Region,County,Diabetes
3244,NC,South,Alamance County,0.122
3245,NC,South,Alexander County,0.108
3246,NC,South,Alexander County,0.106
3247,NC,South,Alleghany County,0.114
3248,NC,South,Alleghany County,0.113
3249,NC,South,Anson County,0.133
3250,NC,South,Anson County,0.136
3251,NC,South,Ashe County,0.096
3252,NC,South,Ashe County,0.1
3253,NC,South,Avery County,0.105


- For the purpose of the next table, the "Region" column was included in this data table to stay consistent with the next one.

The last step involves a random sample using the same columns from the table above. In steps 8 & 9 the command helped guage the amount of times the data appeared in the set. That information will be relevant when analyzing the random sample you are provided with in the following table. The last table will require the `.loc` attribute. It will also use `.sample(n= )` to generate a random sample of rows from the dataset.

12. Use the command `rawdata.loc[:,["State", "Region", "County", "Diabetes"]].sample(n=10)` to generate the table with a random sample from the data.

In [31]:
rawdata.loc[:,["State","Region","County","Diabetes"]].sample(n=10)

Unnamed: 0,State,Region,County,Diabetes
3853,NY,Northeast,Allegany County,0.109
4066,OH,Midwest,Lorain County,0.116
5613,VA,South,Norton city,0.099
5691,VA,South,Westmoreland County,0.128
1900,KS,Midwest,Riley County,0.06
2572,MI,Midwest,St. Clair County,0.101
4978,TX,South,Castro County,0.103
4162,OK,South,Bryan County,0.118
1503,IL,Midwest,Perry County,0.1
3550,NE,Midwest,Burt County,0.109


In [50]:
rawdata.to_csv("rawdata.csv")

In [52]:
rawdata.to_csv("rawdata.csv", index=False)