# Health Data Set for the Southern Region of the United States

The purpose of the Southern Health Data Set is to examine several relevant health care variables across the Southern region of the United States. Specifically, it compiles the average number of poor mental health days, the percentage of people with poor or fair health, and health care costs across different counties. It also shows which states the counties are in to aid in comparing outcomes across the region. 

### Overview of Tutorial

This notebook will show you the steps to reproduce `SO_HealthDataSet.csv`. Its data comes from the `CountyHealthData_2014-2015.csv`, which provides county-based healthcare statistics across the United States.

There are several major steps to recreate the "Southern Health Data Set." Over the tutorial, you will;
1. Create a data frame from `CountyHealthData_2014-2015.csv`
2. Isolate the data from the south
3. Extract only the required variables
4. Convert that data into a new csv.

### System Requirements 

Besides having access to Jupyter labs, to reproduce this data set you will also need to have [Pandas](https://pandas.pydata.org/docs/getting_started/install.html) and [NumPy](https://numpy.org/install/) installed.

You also need to download the [CountyHealthData_2014-2015.csv](https://uncch.instructure.com/courses/4844/assignments/96397?module_item_id=188006).

### 1. Import NumPy and Pandas

First, to reproduce the the South Health Dataset you must import both NumPy and Pandas.

In [1]:
import numpy as np
import pandas as pd

Now, you should be able to use the necessary features from NumPy and Pandas. Using `as` makes it so you do not have to write the full title of each program when coding in their commands.

NumPy, will support Pandas solve equations.

### 2. Create the Data Frame

The "Southern Health Dataset" comes from the "CountyHealthData_2014-2015.csv." To create the data frame, use the command `pd.read_csv()`. It reads Comma Separated Values (csv). In the parentheses enter CountyHealthData_2014-2015.csv.

In [33]:
df=pd.read_csv("CountyHealthData_2014-2015.csv")

### 3. Create a Data Subset for the South

Now that we have the data for all regions, we must isolate the data for the South. Essentially, in this step we are creating a subset by extracting a smaller data frame from the already existing larger data frame. 

First we must name the new subset using `SO_subset =`. 

Then, we must use the code `df[df["Region"] == "South"]`. That will tell Jupyter Labs the new subset comes from the data frame we made in step two. However, it will only show entries the region directly after `==`.

The final part of the code `.copy()` tells Jupyter Labs to create it.

In [26]:
SO_subset = df[df["Region"] == "South"].copy() 

### 4. Extract the Relevant Health Data

This is an important step because it will 
1. Name the final data set
2. Extract our chosen variables

The purpose of the Southern Health Dataset is not to analyze all the available variables but some of them. We also want to compare it across different states. So, we must further isolate more data. 

Similar to the last step, we are creating another set of data. In the last step, we selected from which cases we will pull our data. Begin the code by naming it using `SO_HealthSubset=`. 

Then, we must state from where we will pull our variables, in this case "SO_subset". So, right after `=`, enter `SO_subset`.

Since we have our cases now, we will extract the variables we want in our final product. `.loc` will allow us to locate the specific variables while keeping the columns header. After `.loc` use `[:,"State","County", "Poor or fair health", "Poor mental health days", "Health care costs"]]`. That will tell Jupyter the variables we want.

In [27]:
SO_HealthSubset=SO_subset.loc[:,["State","County", "Poor or fair health", "Poor mental health days", "Health care costs"]]

In [28]:
SO_HealthSubset

Unnamed: 0,State,County,Poor or fair health,Poor mental health days,Health care costs
46,AL,Autauga County,0.228,3.6,10219.0
47,AL,Autauga County,0.228,3.6,9939.0
48,AL,Baldwin County,0.127,3.8,9624.0
49,AL,Baldwin County,0.127,3.8,9502.0
50,AL,Barbour County,0.234,4.3,10809.0
...,...,...,...,...,...
6058,WV,Wirt County,0.161,2.5,10459.0
6059,WV,Wood County,0.205,5.1,10707.0
6060,WV,Wood County,0.205,5.1,10355.0
6061,WV,Wyoming County,0.348,7.7,10662.0




### 5. Create a new csv. with the Datasubset made in the last step

For this step, we convert the "SO_HealthSubset" to a csv using the command `.to_csv("", index=False)`. Before the period enter "SO_HealthSubset", to tell jupyter what to convert. In the parentheses after the command, we put the name for the new file "SO_HealthDataset".

We must also use `index=False` because by default, pandas adds a row of indices, when the file is transferred. The command will eliminate that.

In [7]:
SO_HealthSubset.to_csv("SO_HealthDataSet", index=False)