## Introduction to Pandas Data Structures

* To understand Pandas, which is hard, it is helpful to start the data structures it adds to Python:
    * Series - For one dimensional data (lists) 
    * Dataframe - For two dimensional data (spreadsheets)
    * Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe (column and row names)

In [1]:
# import pandas
import pandas as pd

---

## Series

* A one-dimensional array of indexed data
* Kind of like a blend of a Python list and dictionary
* You can create them from a Python list


In [2]:
# creating a series using the top-level pandas function
# Put the cursor inside the parentheses and hit shift-enter
pd.Series()

Series([], dtype: float64)

In [3]:
# Create a regular Python list
my_list = [0.25, 0.5, 0.75, 1.0]

# Transform that list into a Series
data = pd.Series(my_list)

# Display the data in the series
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

* A Series is a list-like structure, which means it is *ordered* 
* You can use indexing to grab items in a Series, just like a list
* Those numbers next to the other numbers, that is the *index* to the series
* It is best to use the `iloc` method to grab elements by their location in the series.

In [4]:
# grab the first element
data[0]

0.25

In [5]:
# grab the first element using iloc
data.iloc[0]

0.25

In [6]:
# grab the 4th elemenet
data.iloc[3]

1.0

#### Quick Exercise
* How might we grab the *last* element if we didn't know the length of the list?

In [None]:
# hint: think small
data.iloc[???]


* Also, like lists, you can use *slicing* notation to grab sub-lists
* Again, it is best to use the `.iloc` method

#### Quick Exercise

* Use slices to grab the 2nd and 3rd elements of this series

In [None]:
# hint: the 2nd & 3rd elements are 0.50 and 0.75
# your code below
data.iloc[???]


### Index by name

* Series also act like Python dictionaries, *ordered* python dictionaries
* This means you can grab things by name in addition to location

In [7]:
# Create a regular Python Dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# Transform that dictionary into a Series 
population = pd.Series(population_dict)

# Display the data
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

* You can also create a named series with multiple lists

In [8]:
# create two ordered lists
population_list = [38332521, 26448193, 19651127, 19552860, 12882135]
states = ['California', 'Texas', 'New York', 'Florida', 'Illinois']

# Create a Series from those two lists
population = pd.Series(population_list, index=states)

# display the data
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

* You can use indexing and slicing like above, but now with keys instead of numbers!
* It is best to use the `.loc` method when looking up things by name instead of by number


In [9]:
population['California']

38332521

In [10]:
# select the data value with the name "California"
population.loc['California']

38332521

In [11]:
# What happens if you try an use a name when it wants
population.iloc['California']

TypeError: cannot do positional indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [California] of <class 'str'>

* Like a Python dictionary, a Series is a list of key/value pairs
* But these are *ordered*, which means you can do slicing

#### Quick Exercise

* Try slicing this series, but with keys instead of numbers!
* Select a subset of the data using the Python slicing notation
* Don't forget, use `loc`!

In [14]:
# Hint: Use the same : notation, but use the state names listed above
# Your code here:

population.loc["California" : "New York"]

California    38332521
Texas         26448193
New York      19651127
dtype: int64

In [15]:
# Try some numeric slicing if you'd like

population.iloc[0:3]

California    38332521
Texas         26448193
New York      19651127
dtype: int64

---

## DataFrame

* `DataFrames` are the real workhorse of Pandas and Python Data Science
* We will be spending a lot of time with data inside of Dataframes, so buckle up!
* `DataFrames` contain two-dimensional data, just like an Excel spreadsheet
* In practice, a `DataFrame` is a bunch of `Series` lined up next to each other

In [16]:
# Start with our population Series
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [17]:
# Then create another Series for the area
area_dict = {'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297}
area = pd.Series(area_dict)
area

Illinois      149995
California    423967
Texas         695662
Florida       170312
New York      141297
dtype: int64

In [18]:
# Create a dictionary with a key:value for each column
state_info_dictionary = {'population': population,
                       'area': area}

# Now mash them together into a DataFrame
states = pd.DataFrame(state_info_dictionary)
# Display the data
states

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


* Pandas automatically lines everything up because they have shared index values

In [19]:
# create a list of dictionaries that contain our data.
# one dictionary per observation/row
dead_people = [
    {"ssn":1, "first_name": "Bob", "last_name": "Jones", "age": 200},
    {"ssn":2, "first_name": "Jane", "last_name": "Jones", "age": 199},
    {"ssn":3, "first_name": "Ethel", "last_name": "Jones", "age": 180},
    {"ssn":4, "first_name": "Hortense", "last_name": "Jones", "age": 178},
    {"ssn":5, "first_name": "Vern", "last_name": "Jones", "age": 178}
]

# create a Dataframe from a list of dictionaries
pd.DataFrame(dead_people)

Unnamed: 0,age,first_name,last_name,ssn
0,200,Bob,Jones,1
1,199,Jane,Jones,2
2,180,Ethel,Jones,3
3,178,Hortense,Jones,4
4,178,Vern,Jones,5


In [20]:
# create a list of lists, each sub-list is an observation/row
dead_people = [
    [1,"Bob","Jones",200],
    [2,"Jane","Jones",199],
    [3,"Ethel","Jones",180],
    [4,"Hortense","Jones",178],
    [5,"Vern","Jones",178]
]

# specify the column names seperately
column_names = ["ssn","first_name", "last_name", "age"]

# make a Dataframe with column names specified separately
pd.DataFrame(dead_people, columns=column_names)

Unnamed: 0,ssn,first_name,last_name,age
0,1,Bob,Jones,200
1,2,Jane,Jones,199
2,3,Ethel,Jones,180
3,4,Hortense,Jones,178
4,5,Vern,Jones,178


In [None]:
# create a list of lists, each sub-list is an observation/row
dead_people = [
    [1,"Bob","Jones",200],
    [2,"Jane","Jones",199],
    [3,"Ethel","Jones",180],
    [4,"Hortense","Jones",178],
    [5,"Vern","Jones",178]
]

# specify the column names seperately
column_names = ["ssn","first_name", "last_name", "age"]

row_ids = [123,3452,3235,4345,563463]

# make a Dataframe with column names specified separately
dead_dataframe = pd.DataFrame(dead_people, columns=column_names, index=row_ids)
dead_dataframe

---

## Index

* Pandas `Series` and `DataFrames` are containers for data
* The Index (and Indexing) is the mechanism to make that data retrievable
* In a `Series` the index is the key to each value in the list
* In a `DataFrame` the index is the column names, but there is also an index for each row
* Indexing allows you to merge or join disparate datasets together

In [None]:
states

* You can programmatically access the column and row lables by calling the following functions

In [None]:
# get the column lables as a list-like data structure
states.columns

In [None]:
# get the row labls as a list-like data structure
states.index

* The `loc` method I talked about above allows us to select specific rows and columns *by name*.
* Use the syntax `[<row>,<column>]` with index values

In [None]:
# Get the value of the population column from Illinois
states.loc["Illinois", "population"]

* We can also be tricky and use more advanced syntax to do more advanced queries.
* You can do any kind of list slicing in place of `<row>` or `<column>` to slice rows and columns

In [None]:
# Get the area for states from Florida to Texas
# this is two dimensional slicing
states.loc["Florida":"Texas", "area"]

In [None]:
# Get the area for Florida and Texas
# Use a list to select multiple specific values
states.loc[["Florida", "Texas"], "area"]

In [None]:
# Get area and population for Florida and Texas
# use a ":" to specify "all columns"
states.loc[["Florida", "Texas"], :]

In [None]:
# select all the rows and columns
states.loc[:,:]

* What is happening here is we are passing a list of names for the rows, and using the colon ":" to say "all columns
* We can do the same thing with column numbers using `iloc`

In [None]:
# Get the area for states from Florida to Texas
# this is two dimensional slicing
states.iloc[1:, 1]

In [None]:
# Get the area for Florida and Texas
# Use a list to select multiple specific values
states.iloc[[1, 4], 1]

In [None]:
# Get the area for Florida and Texas
# Use a list to select multiple specific values
states.iloc[[1, -1], 1]

In [None]:
# Get area and population for Florida and Texas
# use a ":" to specify "all columns"
states.loc[[1, -1], :]

---

## Exercise

* Using the `iloc` and slicing syntax slice the following dataframe based on the highlighted blocks in the image
* first think of the slicing syntax to grab just the rows you want THEN think of the slicing syntax for the columns you want
* Put the row slices *before* the comma and the column slices *after* the comma

In [None]:
# This is our example Dataframe
indexing_example = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
indexing_example

![first slice exercise](images/indexing1.png)
* Select the second two columns of the first two rows.

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


![first slice exercise](images/indexing2.png)
* Select the third row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


![first slice exercise](images/indexing3.png)
* Select the first two columns

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


![first slice exercise](images/indexing4.png)
* Select the first two columns of the second row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[???]

In [None]:
# scratch space


## Exercise

* Using the `iloc` and slicing syntax slice the following dataframe based on the highlighted blocks in the image
* first think of the slicing syntax to grab just the rows you want THEN think of the slicing syntax for the columns you want
* Put the row slices *before* the comma and the column slices *after* the comma

In [None]:
# This is our example Dataframe
indexing_example = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
indexing_example

![first slice exercise](images/indexing1.png)
* Select the second two columns of the first two rows.

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[:2, 1:]

In [None]:
# scratch space


![first slice exercise](images/indexing2.png)
* Select the third row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[2, :]

In [None]:
# scratch space


![first slice exercise](images/indexing3.png)
* Select the first two columns

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[:,:2]

In [None]:
# scratch space


![first slice exercise](images/indexing4.png)
* Select the first two columns of the second row

In [None]:
# Put the slicing syntax in your answer here
indexing_example.iloc[1,:2]

In [None]:
# scratch space


---