# **Comparing the Rate of Teen Births in North Carolina With the Rate in Georgia From 2014 to 2015**

##Overview of Notebook:
### *This notebook will document the steps necessary to compile a dataset comparing the rate of teen births in North Carolina with the rate in Georgia by culling data from the 2014 - 2015 National Public Health Data reports.*
###*Follow the step-by-step instructions below in order to reproduce these datasets.*

##Coding Jargon Used in the Notebook:

**Run** = (Shift + Enter)

**Cell** = Row containing data

**Dataframe model** = df[ ]

###**Warning** -- *Do **not** add spacing unless directed otherwise*

------

#**Getting Started**

##Running Pandas
*To get started, first import the pandas package in order to store your data in dataframes with rows and columns.*
1. Type **import pandas as pd**
2. Press enter
3. Type **import numpy as np**
4. Run the cell


In [4]:
import pandas as pd
import numpy as np

##Uploading the National Public Health Data
*In order to use the National Public Health Data, you first must upload it to this notebook.*
1. Upload the csv file titled "CountyHealthData_2014-2015" to the content folder on the left-hand side of this notebook
2. Define the dataframe (**df**) by typing and running the following command:
  - df=pd.read_csv("CountyHealthData_2014-2015.csv")

In [6]:
df=pd.read_csv("CountyHealthData_2014-2015.csv")

###The **CountyHealthData_2014-2015** folder should now appear in your content folder.
###*Make sure you check before continuing!*

###*Now you can filter the dataset by separating the relevant information from the other data.*

-----

##Finding Value Counts
*It may be valuable to see how many pieces of information the dataset has for each state.*

1. Type and run the following command in the cell to pull up value counts for each state in the dataset:

In [7]:
df.State.value_counts()

TX    469
GA    318
VA    266
KY    240
MO    229
IL    204
NC    200
KS    199
IA    198
TN    190
IN    184
OH    176
MN    174
MI    164
MS    163
NE    157
OK    154
AR    150
WI    144
FL    134
PA    134
AL    134
LA    128
NY    124
CO    119
SD    117
CA    114
WV    110
ND     92
MT     92
SC     92
ID     84
WA     78
OR     67
NM     64
UT     54
MD     48
AK     46
WY     46
NJ     42
NV     32
ME     32
AZ     30
MA     28
VT     28
NH     20
CT     16
RI     10
HI      8
DE      6
DC      1
Name: State, dtype: int64

---

#**North Carolina Data Subsets**

## Filtering North Carolina Data
*Follow these steps to separate North Carolina data from the other states.*
1. Start with the dataframe model **df[]**
2. Run the command **df[df["State"] == "NC"]** to pull up the data for all columns with data pertaining to NC.

In [8]:
df[df["State"] == "NC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


###*Good job! You have created your first chart containing NC data.*
####If you got an error message, go back and make sure the command was typed correctly.

####*Still getting an error? Make sure you have the csv file downloaded correctly by going back to the "Uploading the National Public Health Data" instructions at the beginning of the notebook.*

## Narrowing Down NC Columns
*Now narrow the data down further to see the rate of teen births in NC.*
1. Specify which rows you want to pull data from:
  - In this case, the rows including NC data are **3245** to **3443**
2. Type the command **df[["State", "Year", "Teen births"]][3245:3443]**
  - This will pull data from the State, Year, and Teen births recorded in those specified rows
3. Run the cell

In [9]:
df[["State", "Year", "Teen births"]][3245:3443]

Unnamed: 0,State,Year,Teen births
3245,NC,1/1/2014,44.2
3246,NC,1/1/2015,42.1
3247,NC,1/1/2014,53.8
3248,NC,1/1/2015,57.5
3249,NC,1/1/2014,60.7
...,...,...,...
3438,NC,1/1/2015,57.3
3439,NC,1/1/2014,48.8
3440,NC,1/1/2015,46.8
3441,NC,1/1/2014,40.2


###*Do you see a chart with three columns: State, Year, and Teen births?*
###*Is it revealing only **North Carolina** data?*
###***Perfect!***

## Defining the North Carolina Data
*Follow these steps to define this dataset as NCbirths in order to recall it easily later.*
1. Type **NCbirths=df[["State", "Year", "Teen births"]][3245:3443]**
2. Run the cell

In [10]:
NCbirths=df[["State", "Year", "Teen births"]][3245:3443]

*Now this data can be recalled.*
1. Type **print(NCbirths)**
2. Run the cell

In [11]:
print(NCbirths)

     State      Year  Teen births
3245    NC  1/1/2014         44.2
3246    NC  1/1/2015         42.1
3247    NC  1/1/2014         53.8
3248    NC  1/1/2015         57.5
3249    NC  1/1/2014         60.7
...    ...       ...          ...
3438    NC  1/1/2015         57.3
3439    NC  1/1/2014         48.8
3440    NC  1/1/2015         46.8
3441    NC  1/1/2014         40.2
3442    NC  1/1/2015         41.7

[198 rows x 3 columns]


###*You should see a graph similar to the one you made in the previous step. It should include the State, Year, and Teen birth data of **NC**, only without the lines.*
###*If you see the error **NameError**, go back to the "Defining the North Carolina Data" step to make sure everything is typed correctly.*

##***Now it's time to move on to culling the Georgia data.***

---

#**Georgia Data Subsets**

## Filtering Georgia Data
*Follow these steps to separate Georgia data from the other states.*
1. Start with the dataframe model **df[]**
2. Run the command **df[df["State"] == "NC"]** to pull up the data for all columns with data pertaining to NC.

In [12]:
 df[df["State"] == "GA"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
750,GA,South,South Atlantic,Appling County,13001,13001,Region 15,1/1/2014,11116.0,0.239,...,,0.305,0.128,9763.0,0.242,71.0,35757,0.588,9.54,
751,GA,South,South Atlantic,Appling County,13001,13001,Region 15,1/1/2015,11232.0,0.239,...,8.66,0.302,0.104,10563.0,0.242,76.0,36915,0.601,12.60,
752,GA,South,South Atlantic,Atkinson County,13003,13003,Region 24,1/1/2014,10394.0,0.296,...,,0.416,0.183,10290.0,,12.0,28693,0.674,,
753,GA,South,South Atlantic,Atkinson County,13003,13003,Region 24,1/1/2015,11413.0,0.296,...,,0.383,0.141,10454.0,,12.0,30883,0.743,,
754,GA,South,South Atlantic,Bacon County,13005,13005,Region 1,1/1/2014,10792.0,0.132,...,,0.328,0.125,12040.0,,36.0,32460,0.529,,0.153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1063,GA,South,South Atlantic,Wilkes County,13317,13317,Insuff Data,1/1/2015,10365.0,,...,,0.250,0.078,10019.0,0.153,10.0,33662,0.663,,
1064,GA,South,South Atlantic,Wilkinson County,13319,13319,Region 24,1/1/2014,10247.0,0.230,...,,0.245,0.104,9525.0,,42.0,36219,0.746,,
1065,GA,South,South Atlantic,Wilkinson County,13319,13319,Region 24,1/1/2015,10280.0,0.230,...,,0.238,0.089,10054.0,,32.0,35745,0.763,,
1066,GA,South,South Atlantic,Worth County,13321,13321,Region 24,1/1/2014,7936.0,0.349,...,,0.285,0.095,9232.0,0.288,28.0,35815,0.616,8.54,


###*Good! You have created your first chart containing GA data.*

## Narrowing Down GA Columns
*Now narrow the data down further to see the rate of teen births in GA.*
1. Specify which rows you want to pull data from:
  - In this case, the rows including GA data are **752** to **1068**
2. Type the command **df[["State", "Year", "Teen births"]][752:1068]**
  - This will pull data from the State, Year, and Teen births recorded in those specified rows
3. Run the cell

In [13]:
df[["State", "Year", "Teen births"]][752:1068]

Unnamed: 0,State,Year,Teen births
752,GA,1/1/2014,91.6
753,GA,1/1/2015,90.2
754,GA,1/1/2014,91.6
755,GA,1/1/2015,93.3
756,GA,1/1/2014,41.3
...,...,...,...
1063,GA,1/1/2015,49.2
1064,GA,1/1/2014,61.5
1065,GA,1/1/2015,53.7
1066,GA,1/1/2014,48.4


###*Do you see a chart with three columns: State, Year, and Teen births?*
###*Is it revealing only **Georgia** data?*
###***Great!***

## Defining the Georgia Data
*Follow these steps to define this dataset as GAbirths in order to recall it easily later.*
1. Type **GAbirths=df[["State", "Year", "Teen births"]][752:1068]**
2. Run the cell

In [14]:
GAbirths=df[["State", "Year", "Teen births"]][752:1068]

*Now this data can be recalled.*
1. Type **print(GAbirths)**
2. Run the cell

In [15]:
print(GAbirths)

     State      Year  Teen births
752     GA  1/1/2014         91.6
753     GA  1/1/2015         90.2
754     GA  1/1/2014         91.6
755     GA  1/1/2015         93.3
756     GA  1/1/2014         41.3
...    ...       ...          ...
1063    GA  1/1/2015         49.2
1064    GA  1/1/2014         61.5
1065    GA  1/1/2015         53.7
1066    GA  1/1/2014         48.4
1067    GA  1/1/2015         49.2

[316 rows x 3 columns]


###*You should see a graph similar to the one you made in the previous step. It should include the State, Year, and Teen birth data of **GA**, only without the lines.*
###*If you see the error **NameError**, go back to the "Defining the Georgia Data" step to make sure everything is typed correctly.*

----

## Linking NC and GA Teen Birth Data
*Since these two dataframes are almost identical, you can combine them by using the **pd.concat** function, which is useful when concatenating, or linking, data.*
  - This is where your newly defined data subsets come in handy -- **(GAbirths)** and **(NCbirths)**
1. Type **pd.concat([GAbirths,NCbirths],axis=0,ignore_index=True,sort=False)**
2. Run the cell

In [16]:
pd.concat([GAbirths,NCbirths],axis=0,ignore_index=True,sort=False)

Unnamed: 0,State,Year,Teen births
0,GA,1/1/2014,91.6
1,GA,1/1/2015,90.2
2,GA,1/1/2014,91.6
3,GA,1/1/2015,93.3
4,GA,1/1/2014,41.3
...,...,...,...
509,NC,1/1/2015,57.3
510,NC,1/1/2014,48.8
511,NC,1/1/2015,46.8
512,NC,1/1/2014,40.2


###*Great job! Hopefully you can now see a side-by-side look at the Georgia and North Carolina teen birth rates.*

---

## Separated by Year
*For a more in-depth comparison of the data, you can also separate the data by year...*
1. Repeat the previous step by typing **pd.concat([GAbirths,NCbirths],axis=0,ignore_index=True,sort=False)** and running the cell.
2. Click on the **"Convert this dataframe to an interactive table"** button on the top right corner of the data
  - Should look like a calculator
3. Click on the **"Filter"** button
4. Type either **2014** or **2015** in the **"Year"** box
  - You can also sort the data by typing **NC** or **GA** in the **"State"** box

In [17]:
pd.concat([GAbirths,NCbirths],axis=0,ignore_index=True,sort=False)

Unnamed: 0,State,Year,Teen births
0,GA,1/1/2014,91.6
1,GA,1/1/2015,90.2
2,GA,1/1/2014,91.6
3,GA,1/1/2015,93.3
4,GA,1/1/2014,41.3
...,...,...,...
509,NC,1/1/2015,57.3
510,NC,1/1/2014,48.8
511,NC,1/1/2015,46.8
512,NC,1/1/2014,40.2


---

#***Well done! Now you have all the tools you need to use these datasets for further research.***

#***Best of luck!***