## Lab 1: Introduction to Pandas

Pandas is a widely used, open source library that is used for data manipulation and analysis. The goal of this lab is to gain familiarity with some of the basics of the Pandas library such as:
 
 * Dataframes
 * Indexes
 * Series
 * Filtering

### Question 1

In order to use the Pandas library, you first need to import it. In the cell below, import Pandas using its common abbreviation `pd`. 

In [17]:
import pandas as pd

## Creating a DataFrame

Dataframes are the primary Pandas datastructure. They are a tabular way of organizing data where each table has an index for its rows and for its columns. While entries in the row index do not have to be unique, entries in the column index should be unique.

### Question 2(a)

A csv, or comma-separated values file, stores tabular data in plain text where each line of the file is a data entry.

Today you will be examining popular baby names by sex and ethnic group from the city of New York. According to Data.gov, the data was collected through civil birth registration; each entry represents the ranking of a baby name in the oder of frequency. You will need to use this data file: `'Popular_Baby_Names.csv'`.

Read in the above file as a dataframe.

In [18]:
babynames = pd.read_csv('Popular_Baby_Names.csv')

### Question 2(b)
Use the [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) command to view only the first 10 rows of the `demographics`dataframe. The `babynames` dataframe actually has 19,418 rows which would be unwieldy to disply at one time.

In [19]:
babynames.head(10)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Olivia,172.0,1.0
1,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Chloe,112.0,2.0
2,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Sophia,104.0,3.0
3,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Emily,99.0,4.0
4,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Emma,99.0,4.0
5,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Mia,79.0,5.0
6,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Charlotte,,
7,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Sarah,57.0,7.0
8,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Isabella,56.0,8.0
9,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Hannah,56.0,8.0


## Series

### Question 3(a)
Series are 1D data. A dataframe is actually a set of series where each column is a series. There are multiple ways to access the series that make up a dataframe.

Using the [`.loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) method, set `years` to the `Year of Birth` series from  `babynames`.

In [20]:
years = babynames.loc[:, 'Year of Birth']

### Question 3(b)

Suppose you want just the names of the babies. Use [`.iloc`](https://pandas.pydata.org/pandas-docs/version/0.25/reference/api/pandas.DataFrame.iloc.html) to set `names` to the `Childs's First Name` series from `babynames`.

In [21]:
names = babynames.iloc[:, 3]

### Question 3(c)

It's good to inspect your new data structure after you create it.
Set `name_counts_length` to the length of the `name_counts` series. 
Use a series method, do not just hard code a number in.

In [22]:
years_size = years.size
years_size

19418

Notice that this number is the same as the number of rows in `babynames`. 

### Question 3(d)
Create a boolean array using `years` where an extry is `True` if the year is equal to 2016.

In [23]:
is_2016 = years == 2016

## Basic Manipulations

### Question 4(a)
Suppose for some reason you were not interested in the ranking of the names but rather the year in which the names appeared. Change the row index of `babynames` to the the `Year of Birth` column.

In [24]:
babynames_by_year = babynames.set_index('Year of Birth')
babynames_by_year.head(5)

Unnamed: 0_level_0,Gender,Ethnicity,Child's First Name,Count,Rank
Year of Birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Olivia,172.0,1.0
2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Chloe,112.0,2.0
2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Sophia,104.0,3.0
2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Emily,99.0,4.0
2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Emma,99.0,4.0


Note that changing the index did not alter the original `babynames` dataframe. Instead it created a new dataframe with the index we wanted which we then assigned to the variable `babynames_by_year`.

### Question 4(b)

Suppose you're not interested in the ethnicity related to the baby names. Drop the `Ethnicity` column from `babynames`. Set `without_ethnicity` to this new dataframe.

In [25]:
without_ethnicity = babynames.drop('Ethnicity', axis=1)
without_ethnicity.head()

Unnamed: 0,Year of Birth,Gender,Child's First Name,Count,Rank
0,2016,FEMALE,Olivia,172.0,1.0
1,2016,FEMALE,Chloe,112.0,2.0
2,2016,FEMALE,Sophia,104.0,3.0
3,2016,FEMALE,Emily,99.0,4.0
4,2016,FEMALE,Emma,99.0,4.0


## Filtering

Filtering data is part of the data cleaning process. We filter data to focus on the specific parts that interest us and remove unwanted data.

### Question 5(a)
Lets say we are only interested in baby names from the year 2016. Filter `babynames` to only include rows for names where the `Year of Birth` is 2016.

In [26]:
babynames_only_2016 = babynames[is_2016]

In [27]:
babynames_only_2016['Year of Birth'].all()

True

### Question 5(b)
Often, datasets will have missing values. Ideally, missing values are at random and not correlated to other variables. There are multiple ways to deal with missing values. For now, we will drop them from our table.

Drop all rows where `count` is a NaN value.

In [29]:
babynames = babynames[(~babynames['Count'].isnull())]
babynames.head(10)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Olivia,172.0,1.0
1,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Chloe,112.0,2.0
2,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Sophia,104.0,3.0
3,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Emily,99.0,4.0
4,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Emma,99.0,4.0
5,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Mia,79.0,5.0
7,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Sarah,57.0,7.0
8,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Isabella,56.0,8.0
9,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Hannah,56.0,8.0
10,2016,FEMALE,ASIAN AND PACIFIC ISLANDER,Grace,54.0,9.0


When you look at the beginning of the dataframe now, notice that `Charlotte` who was index `6` is no longer there.

### Congratulations! You've finished your first lab!