# Lesson I 

## DataFrames and Series

The goal of exploratory data analysis is to use data to answer questions and guide decision making.

## Using data to answer questions

What is the average birth weight of babies in the United States?

* Find appropriate data, or collect it
* Read data in your development environment
* Clean and validate data

### National Survey of Family Growth (NSFG)

For this question we'll use data from the National Survey of Family Growth, which is available from the National Center for Health Statistics. The 2013-2015 dataset includes information about a representative sample of women in the U.S. and their children.

> [NSFG](https://www.cdc.gov/nchs/nsfg/index.htm) data, from the National Center for Health statistics, nationally representative of women 15-44 years of age in the ... United states. has "Information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and general and reproductive health."


## Reading Data

``head()`` shows the first 5 rows of the DataFrame, which contains one row for each pregnancy for each of the women who participated in the survey, and one column for each variable.

### Columns and rows

The DataFrame has an *attribute* called ``.shape``, which is the number of rows and columns; there are *9358* rows in this dataset, one for each pregnancy, and *10* columns, one for each variable. 

The DataFrame also has an attribute called ```columns```, which is an Index. That's another Pandas data structure, similar to a list; in this case it's a list of variables names, which are strings.

<img src='pictures/birthweight.jpg' />

This figure shows an entry from the codebook for ``birthwgt_lb1``, which is the weight in pounds of the first baby from this pregnancy, for cases of live birth.

#### Each Column is a Series

In many ways a DataFrame is like a Python dictionary, where the variable names are the keys and the columns are the values. You can select a column from a DataFrame using the bracket``[]`` operator, with a *string* as the key. The result is a ``Series``, which is another Pandas data structure.

``.head()`` shows the first five values in the *series*, the **name** of the series, and the **datatype**; ``float64`` means that these values are 64-bit floating-point numbers.

``NaN``, which stands for *"Not a Number"*. ``NaN`` is a special value that can indicate invalid or missing data. 

## Exercise

### Exploring the NSFG data

To get the number of rows and columns in a DataFrame, you can read its ``shape`` attribute.

To get the column names, you can read the ``columns`` attribute. 
The result is an Index, which is a Pandas data structure that is similar to a list. Let's begin exploring the NSFG data!

In [10]:
# Import pandas
import pandas as pd
# NSFG data
filename = 'datasets/nsfg.hdf5'
nsfg = pd.read_hdf(filename)

In [11]:
# Display the number of rows and columns
nsfg.shape 

(9358, 10)

In [12]:
# Display the names of the columns
nsfg.columns

Index(['caseid', 'outcome', 'birthwgt_lb1', 'birthwgt_oz1', 'prglngth',
       'nbrnaliv', 'agecon', 'agepreg', 'hpagelb', 'wgt2013_2015'],
      dtype='object')

In [16]:
# Select Column birthwgt_oz1 : ounces
ounces = nsfg['birthwgt_oz1']
# Print the first 5 elements of ounces
print(ounces.head())

0     4.0
1    12.0
2     4.0
3     NaN
4    13.0
Name: birthwgt_oz1, dtype: float64
