# Introduction to Pandas Data Structures

This lesson introduces [Pandas](https://pandas.pydata.org/) a library for data analysis and manipulation for the Python programming language. Pandas provides a set of data structures, *DataFrames* and *Series* for representing and manipulating data fast and efficiently.

## Learning Objectives

After completing this lesson you will:

- Know the three Pandas data structures: Series, DataFrames, and Indexes.
- Understand what kinds of data can be represented in each of the Pandas data structures.
- Be able to create Pandas data structures from vanilla Python data structures.


## Data used in this lesson

While this lesson does not use any external data files, there will be data represented as Markdown tables in this lesson. Be ready to do some typing because you will need to copy these data tables into your code. Don't worry, the data tables are not large. 😊

This lesson uses a snapshot of data from the [2010 Pittsburgh Neighborhood Profiles](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf). The information represented in these data includes neighborhood name, the population, and the area in square miles.


## Loading the Pandas library

Run the code cell below to load the Pandas library into memory. Note the `as pd`, this loads Pandas with an *alias*, "pd", that is faster to write than the word "pandas." This alias is a common convention used in the Python Data Science community.

In [None]:
# import pandas
import pandas as pd

## Introduction to Pandas Data Structures

To understand Pandas, which takes time, it is helpful to start with the data structures it uses to represent data in memory:
* Series - For one dimensional data (think lists) 
* Dataframe - For two dimensional data (think spreadsheets)
* Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe (think column and row names)

The Pandas data structures are built on top of data structures from [Numpy](https://numpy.org/), a powerful numeric computing library for Python. We won't cover Numpy in this less, but it is helpful to know the library underpinning Pandas.

## Series

The first data structure you will encounter is the *Series*. Pandas Series is a:
* A one-dimensional array of indexed data
* A blend of a Python lists and dictionaries
* A basic building block of the DataFrame

## How to create a Series

To create a Series you must use the `Series()` function and pass it your list-like data as an argument. Run the code cell below to create a Pandas Series from a Python List.

In [None]:
# Create a regular Python list of floating point numbers
my_list = [0.25, 0.5, 0.75, 1.0]

# Transform that list into a Series
data = pd.Series(my_list)

# Display the data in the series
data

#### Task - Parts of a Series

1. Look at the output above, you can see the data from the list but you can also see some additional information. Specifically, there are a set of numbers (0-3) and a line that says `dtype`
    - What do you think these additional pieces of information represent?
2. In the code cell below, copy the code above and modify the variable `my_list` to be a list of integers instead of floating point numbers.
    - Look at the output, what has changed? What is the `dtype` or *data type* of the Series now?

In [None]:
# Create a series of integer numbers



#### Answer - Parts of a Series

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
# Create a regular Python list of floating point numbers
my_list = [25, 5, 75, 1]

# Transform that list into a Series
data = pd.Series(my_list)

# Display the data in the series
data

#### Task - Series Data Types

1. Copy and modify the code above to make `my_list` a mixture of integers and floating point numbers.
    - What happens to the data and the `dtype` of the Series?

In [None]:
# Your code here


#### Answer - Series Data Types

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
# Create a regular Python list of floating point numbers
my_list = [2.5, 5, 7.5, 1]

# Transform that list into a Series
data = pd.Series(my_list)

# Display the data in the series
data

Unlike Python lists, a Pandas Series must have all data be of the same type. When you mix integers and floating point numbers Pandas will convert your data to the  data type possible.

#### Task - Creating a Series

| Neighborhood              | Population | Area |
| ------------------------- | ---------- | ---- |
| East Liberty              | 5869       | 0.58 |
| Greenfield                | 7294       | 0.78 |
| Squirrel Hill North       | 11363      | 1.22 |
| Bloomfield                | 8442       | 0.70 |
| Central Business District | 3629       | 0.65 |
|Data source [2010 Pittsburgh Neighborhood Profiles](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf)|


1.  From the data table above, create a python dictionary representing the population of a few Pittsburgh neighborhoods. The dictionary values should be the population data and the keys should be the neighborhood name.
2. Use the `Series()` function to create a series called `neighborhood_series`, using the Python dictionary as the data parameter.
3. Look at the resulting output, consider how has the data from the dictionary, which is an unordered structure with keys and values, been transformed into an ordered, one-dimensional Series?

In [None]:
# Your code here





#### Answer - Creating a Series

Click on the ellipsis (...) below to see the answer.

In [None]:
# Answer

neighborhood_dictionary = {
    "East Liberty":5869,
    "Greenfield":7294,
    "Squirrel Hill North":11363,
    "Bloomfield":8442,
    "Central Business District":3629
}

neighborhood_series = pd.Series(neighborhood_dictionary)
neighborhood_series
    

## Components of a Series

A Series has several different parts for storing both data and metadata about the data being represented by the series. There are four components of a Series that are most relevant for this lesson:
1. A sequence of ordered values.
2. An explicit named index for each value.
3. An implicit numerical index for each value.
4. A data type of all the values.

So if you look at the results of the previous task above you will see:
1. A sequence of ordered values representing neighborhood populations.
2. An explicit named index indicating the name of the neighborhood.
3. An implicit numerical index, which isn't shown, but is there if we want to use it.
3. A data type, `int64` indicating the population values are Integers.

## Selecting Values from a Series

You can index into a series using the same syntax as with Python lists. Using Python's *indexing* and *slicing* notation you can extract a specific values or set of values from a Series, just as you would with a list. 

#### Task - Indexing and Slicing a Series

1. From the `neighborhood_series` data you created in the previous, using Python's indexing syntax extract the first element of the Series, the population of the first neighborhood.

In [None]:
# put your code here



#### Answer - Indexing and Slicing a Series

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
neighborhood_series[0]

#### Task - Indexing and Slicing a Series Again

1. From the `neighborhood_series`, using numerical indexing to select the value for the neighborhood "Squirrel Hill North" using its *numerical position* in the Series.

In [None]:
# put your code here



#### Answer - Indexing and Slicing a Series Again

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
neighborhood_series[2]

#### Task - Slicing a Series
* From the `neighborhood_series`, use slicing syntax to create a subset of the 2nd through 4th elements.

In [None]:
# put your code here



#### Answer - Slicing a Series

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
neighborhood_series[2:5]

#### Task - Indexing the End of a Series

1. If you didn't know how many values were in the series, how would you using indexing to get the last item?

In [None]:
# put your code here



#### Answer - Indexing the End of a Series

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
neighborhood_series[-1]

### The Named Index 

Pandas series can also behave like a Python dictionary, that is, you can look up values by their names rather than position in the sequence.  

#### Tasks - Extracting Values by Name

* Extract the value for "Greenfield" from the series `neighborhood_series`

In [None]:
# your code here



#### Answer - Extracting Values by Name

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
neighborhood_series["Greenfield"]

#### Task - Extract a Slice

1. Extract a subset of values, that is a slice, of from "Greenfield" to the "Central Business District" but use the names, not the numerical positions. 

In [None]:
# your code here



#### Answer - Extract a Slice

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
neighborhood_series["Greenfield":"Central Business District"]

Note how with named indexing the first and last are inclusive, but with numerical indexing the the first number is inclusive and the second number is exclusive.

## Dataframes

Dataframes are a two-dimensional data structure, just like an Excel spreadsheet. They have rows and columns. Each column in a DataFrame is a Series under the hood.

Let's start with the Pittsburgh neighborhood data table.

| Neighborhood              | Population | Area |
| ------------------------- | ---------- | ---- |
| East Liberty              | 5869       | 0.58 |
| Greenfield                | 7294       | 0.78 |
| Squirrel Hill North       | 11363      | 1.22 |
| Bloomfield                | 8442       | 0.70 |
| Central Business District | 3629       | 0.65 |
|Data source [2010 Pittsburgh Neighborhood Profiles](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf)|

And now lets create three lists from each of the columns in the data table above.

In [None]:
# create the first list of neighborhoods
neighborhoods = ["East Liberty", "Greenfield", "Squirrel Hill North", "Bloomfield", "Central Business District"]
neighborhoods

#### Task - Create Lists of Data

1. Create two more lists, one for the population values (called `population` and one for the area values( called `area`). Make sure the order of the values matches the `neighborhoods` series above!


In [None]:
# your code here


#### Answer - Create Lists of Data

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer
population = [5869, 7294, 11363, 8442, 3629]
area = [0.58, 0.78, 1.22, 0.70, 0.65]

### Creating DataFrames using Python data structures

Creating a DataFrame from Python means you need to compose your data into a two dimensional structure using a combination of Python lists and Dictionaries.

In [None]:
# Create a Dataframe from a dictionary of lists

# create a dictionary with 
data = {"neighborhoods":neighborhoods,
        "population": population,
        "area": area}

pgh_neighborhood_info = pd.DataFrame(data)
pgh_neighborhood_info

Look! We have re-created the table above but now it is loaded as a Pandas DataFrame. Notice how the DataFrame has been rendered as a pretty HTML table, that is a benefit of Jupyter Notebooks.

There are other ways to create DataFrames, for example maybe your data is more row-centric and you have a list of lists.

```python
[['East Liberty', 5869, 0.58],
 ['Greenfield', 7294, 0.78],
 ['Squirrel Hill North', 11363, 1.22],
 ['Bloomfield', 8442, 0.7],
 ['Central Business District', 3629, 0.65]]
```

#### Task - Creating a Dataframe from a list of lists

1. Copy the list of lists in the code above and save it to a variable called "data".
2. Create a new list with three string values: "neighborhood", "population", and "area" and save it to a variable called "column_names".
3. Recreate the `pgh_neighborhood_info` DataFrame using the `pd.Dataframe()` function. Put the `data` variable as the first positional argument and the `column_names` as the value for the `columns` keyword argument.  
4. Display the DataFrame and see if it is the same as the output above.

In [None]:
# your code here


#### Answer - Creating a DataFrame from a list of lists

Click on the ellipsis (...) below to see the answer.

In [None]:
# answer

# create the list of lists
data = [['East Liberty', 5869, 0.58],
 ['Greenfield', 7294, 0.78],
 ['Squirrel Hill North', 11363, 1.22],
 ['Bloomfield', 8442, 0.7],
 ['Central Business District', 3629, 0.65]]

# create a list of column names
column_names = ["neighborhood", "population", "area"]

# create the dataframe with positional and keyword argument
pgh_neighborhood_info = pd.DataFrame(data, columns=column_names)
pgh_neighborhood_info

### Slicing a DataFrame

You can use indexing notation to select individual or groups of columns from a DataFrame. 

In [None]:
pgh_neighborhood_info['area']

Notice the output is not a list, but a Series. When we created the DataFrame from the list of lists the inner lists were converted to Series and the outer list structured the DataFrame. 

#### Task - Selecting Columns of a Dataframe

1. Slice out the `population` column from the `pgh_neighborhood_info` DataFrame.
2. Try to slice both the `neighborhood` and `population` columns, what happens?
3. Create a python list with two strings representing the names of the columns and save that to a variable (you pick the name).
4. Pass your newly created variable to DataFrame within the slicing notation (that is, put it in the square brackets). What happens now?

In [None]:
# your code here


#### Answer - Selecting Columns of a DataFrame

Click on the ellipses (...) below to see the answers.

In [None]:
# answer 1
pgh_neighborhood_info['population']

In [None]:
# answer 2
pgh_neighborhood_info['neighborhood','population']

In [None]:
# asnwer 3 & 4
foo = ["neighborhood", "population"]
pgh_neighborhood_info[foo]

Why do we get a key error when we try to slice two column names?
If we don't wrap our values in a list, pandas will try to find a single column with a complex name `('neighborhood', 'population')`, which is possible, rather than two columns with the names "neighborhood" and "population."

This wart is an artifact of the way in which column names are treated in Pandas. As an Index!

## Index

* Pandas `Series` and `DataFrames` are containers for data
* The Index (and Indexing) is the mechanism to make that data retrievable
* In a `Series` the index is the key to each value in the list
* In a `DataFrame` the index is the column names, but there is also an index for each row
* Indexing allows you to merge or join disparate datasets together

Let's start with the Pittsburgh neighborhood data table.

| Neighborhood              | Population | Area |
| ------------------------- | ---------- | ---- |
| East Liberty              | 5869       | 0.58 |
| Greenfield                | 7294       | 0.78 |
| Squirrel Hill North       | 11363      | 1.22 |
| Bloomfield                | 8442       | 0.70 |
| Central Business District | 3629       | 0.65 |
|Data source [2010 Pittsburgh Neighborhood Profiles](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf)|


#### Task - Create a Population Series

1. Create a Python dictionary with neighborhood names as keys and population as values
2. Use the `pd.Series` function to create a Pandas Series called `population_series`.

In [None]:
# your code here


#### Answer - Create a Population Series

Click on the ellipses (...) below to see the answers.

In [None]:
# Answer

# create a dictionary of population
population_dictionary = {
    "East Liberty":5869,
    "Greenfield":7294,
    "Squirrel Hill North":11363,
    "Bloomfield":8442,
    "Central Business District":3629
}

population_series = pd.Series(population_dictionary)
population_series
    

#### Task - Create an Area Series

1. Create a Python dictionary with neighborhood names as keys and area as values. This time, write the neighborhoods in a different order. Additionally, add the neighborhood "Beechview", which has an area of 1.46 square miles, to your dictionary.
2. Use the `pd.Series` function to create a Pandas Series called `area_series`.

In [None]:
# your code here


#### Answer - Create an Area Series

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# create a dictionary with beechview added
area_dictionary = {
    "Central Business District":0.65,
    "East Liberty":0.58,
    "Squirrel Hill North":1.22,
    "Greenfield":0.78,
    "Bloomfield":0.70,
    "Beechview":1.46
}

area_series = pd.Series(area_dictionary)
area_series
    

#### Task - Create a DataFrame from Differently Ordered Series

1. Create a DataFrame called `neighborhoods` with two columns `population` and `area` using the two Series you created as the data. Display your DataFrame
2. Check to make sure the populations and area line up with the data table above.
3. What is the population value for "Beechview"?

In [None]:
# your code here


#### Answer - Create a DataFrame from Differently Ordered Series

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# create a Dataframe for population and area
neighborhoods = pd.DataFrame({"population":population_series, 
                              "area": area_series})

neighborhoods

Note how the data have been aligned and ordered based on the neighborhood names. Because population values for Beechview were missing, Pandas fills in a missing value `NaN` (which will be discussed in a later lesson).

#### Task - Selecting Rows of a DataFrame

1. Try using the indexing syntax to select the second row of the `neighborhoods` DataFrame using the neighborhood's name and position. 
2. Why do you think you get an error? What kind of error do you get?

In [None]:
# your code here


#### Answer - Selecting Rows of a DataFrame

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# try to select the first row of the dataframe
neighborhoods['Bloomfield']

In [None]:
# answer

# try to select the first row of the dataframe
neighborhoods[1]

Indexing a DataFrame will always select a column (or columns). To select rows we must use the `loc`[(docs)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and `iloc`[(docs)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) properties of a Pandas DataFrame. 

#### Task - Selecting Rows of a DataFrame using `loc`

1. Run the code cell below to select the row for the Bloomfield neighborhood
2. What data structure do you get back?
3. What do the index values correspond to?

In [None]:
# using loc to index rows by name
neighborhoods.loc["Bloomfield"]

#### Answer - Selecting Rows of a DataFrame using `loc`

Click on the ellipses (...) below to see the answers.

You get back a Pandas Series representing one row of the Dataframe. The column names become the named index for the values. Also note the Series actually has a `Name`, which is another metadata field of a Series, that corresponds to the named index value specified by `loc`.

#### Task - Selecging Rows of a DataFrame using `iloc`

1. Use the `iloc` function to select the same row as the previous task, but using the numeric index instead of the named index.

In [None]:
# your code here


#### Answer - Task - Selecting Rows of a DataFrame using `iloc`

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# select the row at index position 1 from the neighborhoods dataframe
neighborhoods.iloc[1]

## Selecting Rows and Columns

The `loc` and `iloc` properties allow for the selection of rows and columns using the following syntax:
```python
df.loc[row_indexer,column_indexer]
```

The `indexer` can be a pointer to a specific value or even a slice. Let's try it with our `neighborhoods` DataFrame. Run the code cells below to see

In [None]:
# display the neighborhoods dataframe
neighborhoods

In [None]:
# select the row for Greenfield and the column population
neighborhoods.loc["Greenfield", "population"]

In [None]:
# select the population for several neighborhood
neighborhoods.loc["Bloomfield":"Greenfield", "population"]

#### Task - Select with `iloc`

1. Recreate the results above using the positional, numeric index with `iloc` instead of label-based index with `loc`.
2. Pay close attention to the inclusive and exclusive values in the range.

In [None]:
# your code here


#### Answer - Select with `iloc`

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
neighborhoods.iloc[1:5, 0]

#### Tasks - Slicing and Dicing a DataFrame

* Using `iloc` and the slicing syntax, select specific values from the example DataFrame based on the highlighted blocks in the image.
* It is helpful to think of the slicing syntax to grab the rows THEN think of the slicing syntax for the columns.
* Put the row slices *before* the comma and the column slices *after* the comma.

In [None]:
# This is our example Dataframe
indexing_example = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
indexing_example

![first slice exercise](images/indexing1.png)
* Select the second two columns of the first two rows.

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


![first slice exercise](images/indexing2.png)
* Select the third row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


![first slice exercise](images/indexing3.png)
* Select the first two columns

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


![first slice exercise](images/indexing4.png)
* Select the first two columns of the second row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


#### Answers - Slicing and Dicing a Dataframe

* Using the `iloc` and slicing syntax slice the following dataframe based on the highlighted blocks in the image
* first think of the slicing syntax to grab just the rows you want THEN think of the slicing syntax for the columns you want
* Put the row slices *before* the comma and the column slices *after* the comma

![first slice exercise](images/indexing1.png)
* Select the second two columns of the first two rows.

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[:2, 1:]

![first slice exercise](images/indexing2.png)
* Select the third row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[2, :]

![first slice exercise](images/indexing3.png)
* Select the first two columns

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[:,:2]

![first slice exercise](images/indexing4.png)
* Select the first two columns of the second row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[1,:2]

## Why `loc` and `iloc`?

Because pandas Data Structures have both explicit named and implicit numeric indexes, it can sometimes be ambiguous which index is being referenced. This can be especially true if you are using numbers in the explicit named index!

#### Task - Which One is 1?

1. Run the code below to create a Series with an explicit numeric index.
2. Run the code `confusing_index[1]` and look at the result.
3. Which index, the implicit, numeric positional index or the explicit, named/label index?

In [None]:
neighborhoods_list = ["East Liberty", 
                 "Greenfield", 
                 "Squirrel Hill North", 
                 "Bloomfield", 
                 "Central Business District"]
which_index = pd.Series(neighborhoods_list, index=[1,2,3,4,5])
which_index

In [None]:
# Which index is being called?
which_index[1]

#### Answer - Which One is 1?

Click on the ellipses (...) below to see the answers.

When indexing a Series without using `loc` or `iloc` than Pandas will use the named or label based index, which is equivalent to using `loc`.