## The Basics of Using Pandas 

This notebook demonstrates some of the features of working with [Pandas](https://pandas.pydata.org) DataFrames and Series. The goals are:

- Learn how to get started using `pandas`,
- Describe some of it's nomenclature 
- Load a data set with `pandas`
- Describe how to answer simple questions about a data set.
- Show some common `pandas` operations on data sets

In [1]:
%pip install pandas
%pip install faker

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Importing Pandas

The custom of `pandas`, `numpy` and other scientific libraries is to import them with a short name. The idea is to keep from importing `*` into your program's namespace while not having to continually type the (possibly long) package name. When looking at other people's code, you will often see the following:
```
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
```

It's up to you to write your code however it makes sense to you, but I wanted to prepare you for when you look at other people's code. 

In [2]:
import pandas as pd

## Demo Data

Our demo data set is constructed using [Faker](https://github.com/joke2k/faker), which allows us to create very large and repeatable data sets without having to find a place to permanently house the data set. The `csv_data` function returns a string, and we wrap it in a `io.StringIO` object to make it appear more `file`-like and give `pandas.read_csv` something to work on. We'll explore the data set below, so I'm not going to bother to describe it here.

In [3]:
import customers
import io

customer_data = io.StringIO(customers.csv_data())

## Reading the Data

`pandas` has an astounding support for various data formats, here we take advantage of the `pandas.read_csv` function which will return a `pandas.DataFrame` initialized with the contents of the supplied file. There are quite a few optional arguments to `pandas.read_csv` but often times you don't need to modify them to get a reasonable result.

In [4]:
df = pd.read_csv(customer_data)

## Inspecting a DataFrame

A `pandas.DataFrame` has rows and columns which makes it look like a 2-dimensional array or Excel spreadsheet, however in practice the `DataFrame` is more like a list of columns (which in fact are `pandas.Series` objects) than a matrix.

In the output below, we have the column names across the top, the index on the far left (0, 9) and the data for each row.

In [5]:
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-25
1,66-7342,2020-03-08,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-11
2,47-6588,2020-03-02,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-20
3,31-6105,2020-03-04,NG57-OK6,887.84,6,2020-05-21,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-04
4,90-0956,2020-03-04,GP92-RI2,888.51,10,2020-03-17,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-22
5,96-1537,2020-03-02,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-08
6,89-9680,2020-03-02,RJ01-NJ0,61.06,10,2020-06-07,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-11
7,62-0581,2020-03-06,JG47-BF7,372.58,1,2020-05-08,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-07
8,32-6389,2020-03-06,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-18
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-06-29


## DataFrame Columns

The columns property of a `pandas.DataFrame` is a `list`-like object which contains the column names in left to right order. 

In [6]:
df.columns

Index(['Order Number', 'Order Date', 'Inventory Number', 'Unit Price', 'Units',
       'Ship Date', 'Name', 'Address', 'Email', 'Phone Number',
       'Date of Birth'],
      dtype='object')

The columns property is a readable and writable property, which allows you to reorganize the columns of the DataFrame in whatever order you wish. For instance, here we take a copy of our source DataFrame, `rdf`, reverse the columns and update `rdf.columns`.

In [7]:
rdf = df.copy()
rdf.columns = reversed(list(rdf.columns))
rdf

Unnamed: 0,Date of Birth,Phone Number,Email,Address,Name,Ship Date,Units,Unit Price,Inventory Number,Order Date,Order Number
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-25
1,66-7342,2020-03-08,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-11
2,47-6588,2020-03-02,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-20
3,31-6105,2020-03-04,NG57-OK6,887.84,6,2020-05-21,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-04
4,90-0956,2020-03-04,GP92-RI2,888.51,10,2020-03-17,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-22
5,96-1537,2020-03-02,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-08
6,89-9680,2020-03-02,RJ01-NJ0,61.06,10,2020-06-07,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-11
7,62-0581,2020-03-06,JG47-BF7,372.58,1,2020-05-08,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-07
8,32-6389,2020-03-06,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-18
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-06-29


## DataFrame Index

The index property is the other important DataFrame property for locating and describing your data. The index can be quite complicated, but in the case of our demo data, it's a simple monotonically increasing integer from 0 to 9. Later we'll see some more interesting ways to set the index to explore our data set. 

In [8]:
df.index

RangeIndex(start=0, stop=10, step=1)

# Common DataFrame Interogative Functions - Describe

The `describe` function will apply some simple statistical functions to numerical columns found in the data set. In our case, the data has two columns which have numerical data: Unit Price and Units.

In [9]:
df.describe()

Unnamed: 0,Unit Price,Units
count,10.0,10.0
mean,456.831,6.2
std,365.743293,3.047768
min,21.67,1.0
25%,103.2,4.25
50%,421.36,5.5
75%,839.9975,9.25
max,888.51,10.0


# Common DataFrame Interogative Functions - Head & Tail


The `head` and `tail` functions return the first **N** or last **N** rows of a DataFrame. 


In [10]:
df.head(2)

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-25
1,66-7342,2020-03-08,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-11


In [11]:
df.tail(2)

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
8,32-6389,2020-03-06,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-18
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-06-29


# Common DataFrame Interogative Functions - Sample

The `sample` function is a quick way to get a random sample of the rows in your DataFrame. This can be quite helpful for getting a sense of the contents of very large data sets. 



In [12]:
df.sample(5)

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
4,90-0956,2020-03-04,GP92-RI2,888.51,10,2020-03-17,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-22
6,89-9680,2020-03-02,RJ01-NJ0,61.06,10,2020-06-07,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-11
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-06-29
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-25
5,96-1537,2020-03-02,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-08


# Viewing a Subset of a DataFrame's Columns

Often times our data sets are "long" with many rows and "wide" with many columns, and it'd be easier to comprehend if we could only view some columns. The DataFrame index operator `[]` takes a column name or list of column names as an argument and returns a "view" of the DataFrame narrowed to the columns specified. The index operator can also take a boolean array as input which we'll explore further down.


In [13]:
df[['Order Number', 'Order Date', 'Ship Date']]

Unnamed: 0,Order Number,Order Date,Ship Date
0,90-6790,2020-03-03,2020-04-18
1,66-7342,2020-03-08,
2,47-6588,2020-03-02,
3,31-6105,2020-03-04,2020-05-21
4,90-0956,2020-03-04,2020-03-17
5,96-1537,2020-03-02,
6,89-9680,2020-03-02,2020-06-07
7,62-0581,2020-03-06,2020-05-08
8,32-6389,2020-03-06,
9,97-6482,2020-03-06,


# Viewing a Subset of a DataFrame's Rows

Here's where things get weird. Using a DataFrame as iterator doesn't do what we expect it to do:

In [14]:
for row_maybe in df:
    print(type(row_maybe), row_maybe)

<class 'str'> Order Number
<class 'str'> Order Date
<class 'str'> Inventory Number
<class 'str'> Unit Price
<class 'str'> Units
<class 'str'> Ship Date
<class 'str'> Name
<class 'str'> Address
<class 'str'> Email
<class 'str'> Phone Number
<class 'str'> Date of Birth


# Viewing DataFrame Rows using iloc

The `iloc` property (not function) is an iterator that selects rows based on their integer row value. The property can be addressed using square brackets and accepts a variety of selectors: integers, slices, lists or tuples of integers, boolean arrays, and functions. Finally, the selectors can be a tuple of slices that address rows first, columns second!

In [15]:
df.iloc[1] # single rows are returned as pandas.Series

Order Number                                         66-7342
Order Date                                        2020-03-08
Inventory Number                                    NP82-IX1
Unit Price                                            722.75
Units                                                      4
Ship Date                                                NaN
Name                                           Amanda Prince
Address             2673 Gay Garden\nSouth Gabriel, VT 18295
Email                             amanda72garcia-dickson.net
Phone Number                              105-121-9772x76875
Date of Birth                                     2003-01-11
Name: 1, dtype: object

In [16]:
df.iloc[2:4] # Multiple rows are returned as pandas.DataFrame's

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
2,47-6588,2020-03-02,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-20
3,31-6105,2020-03-04,NG57-OK6,887.84,6,2020-05-21,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-04


The `iloc` accessor is pretty powerful! 

In [17]:
df.iloc[:,[9,0,5,2]] # all rows, columns 9, 0, 5, and 2

Unnamed: 0,Phone Number,Order Number,Ship Date,Inventory Number
0,+1-181-269-5921x1723,90-6790,2020-04-18,HI77-BR4
1,105-121-9772x76875,66-7342,,NP82-IX1
2,+1-867-834-9510x36966,47-6588,,BK35-VD9
3,001-399-790-6359x3895,31-6105,2020-05-21,NG57-OK6
4,(309)219-5240,90-0956,2020-03-17,GP92-RI2
5,+1-205-572-0324x3336,96-1537,,WR26-BR7
6,009-038-0324,89-9680,2020-06-07,RJ01-NJ0
7,576-422-6883,62-0581,2020-05-08,JG47-BF7
8,001-884-971-6167x3607,32-6389,,BM63-SE1
9,318-745-5865,97-6482,,ZY23-PM7


# Creating New DataFrame Columns

Sometimes we find that adding new columns to a DataFrame can be helpful to us when working with our data sets. It turns out that the DataFrame will create a new column for us when we reference a non-existent column name. In this example, we also show how we can perform operations on the contents of a column versus having to iterate thru each value and apply the operation in a "traditional" pythonic data structure like a `list` or `dict`. 


In [18]:
df['Total Invoice'] = df['Unit Price'] * df['Units']
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth,Total Invoice
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-25,350.6
1,66-7342,2020-03-08,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-11,2891.0
2,47-6588,2020-03-02,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-20,2350.7
3,31-6105,2020-03-04,NG57-OK6,887.84,6,2020-05-21,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-04,5327.04
4,90-0956,2020-03-04,GP92-RI2,888.51,10,2020-03-17,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-22,8885.1
5,96-1537,2020-03-02,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-08,1607.34
6,89-9680,2020-03-02,RJ01-NJ0,61.06,10,2020-06-07,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-11,610.6
7,62-0581,2020-03-06,JG47-BF7,372.58,1,2020-05-08,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-07,372.58
8,32-6389,2020-03-06,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-18,86.68
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-06-29,4395.4


# Removing Columns from a DataFrame

Removing a column is accomplished with the function `pandas.DataFrame.drop` which has a huge number of optional arguments that can be intimidating. In our example, let's remove the column 'Date of Birth' since HR has decided that we won't be sending our customers birthday cards any more.

In [19]:
df.drop(columns='Date of Birth')

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,350.6
1,66-7342,2020-03-08,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-02,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
3,31-6105,2020-03-04,NG57-OK6,887.84,6,2020-05-21,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,5327.04
4,90-0956,2020-03-04,GP92-RI2,888.51,10,2020-03-17,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,8885.1
5,96-1537,2020-03-02,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
6,89-9680,2020-03-02,RJ01-NJ0,61.06,10,2020-06-07,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,610.6
7,62-0581,2020-03-06,JG47-BF7,372.58,1,2020-05-08,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,372.58
8,32-6389,2020-03-06,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


# But Was The Column Removed?

The column 'Date of Birth' is missing in the output above, but we're about to learn something interesting about the `pandas` way of handling data. Many, but not all, `pandas` operations return a new instance by default. If we go back and inspect our DataFrame, we'll be surprised to see that the 'Date Of Birth' column is still there!

In [20]:
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth,Total Invoice
0,90-6790,2020-03-03,HI77-BR4,35.06,10,2020-04-18,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-25,350.6
1,66-7342,2020-03-08,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-11,2891.0
2,47-6588,2020-03-02,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-20,2350.7
3,31-6105,2020-03-04,NG57-OK6,887.84,6,2020-05-21,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-04,5327.04
4,90-0956,2020-03-04,GP92-RI2,888.51,10,2020-03-17,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-22,8885.1
5,96-1537,2020-03-02,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-08,1607.34
6,89-9680,2020-03-02,RJ01-NJ0,61.06,10,2020-06-07,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-11,610.6
7,62-0581,2020-03-06,JG47-BF7,372.58,1,2020-05-08,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-07,372.58
8,32-6389,2020-03-06,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-18,86.68
9,97-6482,2020-03-06,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-06-29,4395.4


# inplace=True

Many `pandas` functions take a boolean keyword option called 'inplace' whose default is False. If we wanted to make sure that the 'Date of Birth' column is dropped in the source DataFrame, we could do it this way:

In [None]:
df.drop(columns='Date of Birth')