#**Creating a Subset from Raw Data**


## **Overview**
- This notebook is a guide on creating a subset of data using python
- The finished product will provide information on children's health and education in North Carolina.


## **Getting Started**

1. Download the raw data as a .csv file to your computer
2. Create a folder in your Google Drive and upload your data to that folder.
3. Mount this data from Google Drive to Colab using this code:

    from google.colab import drive
    drive.mount('/content/gdrive')

**NOTE** the code in the parenthesis matches where in your Google Drive you uploaded your data to.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# **Importing Necessary Packages**

Import panda packages as "pd" for convience when coding
          
          ie pd.(something)

The numpy package will help panda packages to do math so, import that as well (use "np" as an abbreviation)


In [None]:
import pandas as pd
import numpy as np


# **Create a Dataframe**

Our data frame will be abbreviated to "df" and we will define it by inserting the data into the "pd.read_csv" function like this:

        df=pd.read_csv (file)

In [66]:
df=pd.read_csv('/content/gdrive/My Drive/CountyHealthData_2014-2015.csv')

We can now see all of our data

In [67]:
df

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.250,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.160
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.60,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.200,169.0,41722,0.668,12.77,0.477
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6104,WY,West,Mountain,Uinta County,56041,56041,Insuff Data,1/1/2015,7436.0,0.135,...,18.66,0.192,0.090,7600.0,0.123,47.0,60953,0.273,,
6105,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2014,6580.0,0.106,...,,0.225,0.086,8202.0,0.099,47.0,49533,0.328,,0.133
6106,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2015,7572.0,0.106,...,,0.226,0.101,7940.0,0.099,47.0,50740,0.309,,
6107,WY,West,Mountain,Weston County,56045,56045,Insuff Data,1/1/2014,5633.0,0.162,...,,0.201,0.084,6906.0,0.130,28.0,53665,0.232,,0.171


# **Exploring the data**

To figure out which chunk of data you want to use, first you have to look at the size of the data and what it contains. you can do this in several ways:

To see how many columns and rows there are in the data set, use the "df.shape" function

In [None]:
df.shape

(6109, 64)

To see how many cells there are in the dataset use the "df.size" function

In [None]:
df. size

390976

To see what the columns in the data set are, use the "df.columns" function

In [None]:
df. columns

Index(['State', 'Region', 'Division', 'County', 'FIPS', 'GEOID', 'SMS Region',
       'Year', 'Premature death', 'Poor or fair health',
       'Poor physical health days', 'Poor mental health days',
       'Low birthweight', 'Adult smoking', 'Adult obesity',
       'Food environment index', 'Physical inactivity',
       'Access to exercise opportunities', 'Excessive drinking',
       'Alcohol-impaired driving deaths', 'Sexually transmitted infections',
       'Teen births', 'Uninsured', 'Primary care physicians', 'Dentists',
       'Mental health providers', 'Preventable hospital stays',
       'Diabetic screening', 'Mammography screening', 'High school graduation',
       'Some college', 'Unemployment', 'Children in poverty',
       'Income inequality', 'Children in single-parent households',
       'Social associations', 'Violent crime', 'Injury deaths',
       'Air pollution - particulate matter', 'Drinking water violations',
       'Severe housing problems', 'Driving alone to work'

you can also use the "df.head()" funtion to view the columns in the dataset as well. this funtion also allows you to see the firs 5 row in your data, which can be a helpful preview.

In [None]:
df. head()

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.25,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.16
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.6,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.2,169.0,41722,0.668,12.77,0.477


# **Filtering the data**

Since the dataset is so large, we want to narrow down specific rows and columns and we can do that by filtering our data.
 To create our subset, we will first
 1. define a new datatframe, named 'southren_children'
 2. use the '.loc' function to filter out the columns that we want


For the first step, we will use this code to creat our filtered data frame

    southern_children= df[df["State"]=="NC"]

In [None]:
southern_children= df[df["State"]=="NC"]

Now, for the second step, we wlil create our subset containing only the following variables:


*   **Teen births**
* **High school graduaiton**
*   **Some college**
*   **Infant mortality**
*  **Child mortality**
*   **Children eligible for free lunch**

Use this code to do so

    south_subset = southern_children.loc[:,["Teen births", 'High school graduation', 'Some college', 'Infant mortality', 'Child mortality', 'Children eligible for free lunch']]



In [None]:
south_subset = southern_children.loc[:,["Teen births", 'High school graduation', 'Some college', 'Infant mortality', 'Child mortality', 'Children eligible for free lunch']]

In [None]:
south_subset

Unnamed: 0,Teen births,High school graduation,Some college,Infant mortality,Child mortality,Children eligible for free lunch
3243,42.4,0.763,0.578,8.3,62.7,0.444
3244,40.3,0.758,0.575,7.7,57.7,0.455
3245,44.2,0.770,0.419,8.6,50.2,0.417
3246,42.1,0.850,0.433,8.0,33.1,0.449
3247,53.8,0.825,0.464,,,0.523
...,...,...,...,...,...,...
3438,57.3,0.730,0.515,11.4,61.4,0.556
3439,48.8,0.830,0.474,9.5,69.0,0.422
3440,46.8,0.820,0.492,9.7,53.9,0.455
3441,40.2,0.775,0.510,,,0.477


# **Exporting the data**#

Now that we have rceated a new suset, we can now export the data as a .csv file using the '.to_csv' function. Your code should look like this:

    south_subset.to_csv("CountyHealthData_2014-2015.csv", index=False)

**NOTE** Make sure to specify 'index=false' so that the file won't contain the default index numnbers.

In [68]:
south_subset.to_csv("CountyHealthData_2014-2015.csv", index=False)