## The Basics of Using Pandas 

This notebook demonstrates some of the features of working with [Pandas](https://pandas.pydata.org) DataFrames and Series. The goals are:

- Learn how to get started using `pandas`,
- Describe some of it's nomenclature 
- Load a data set with `pandas`
- Describe how to answer simple questions about a data set.
- Show some common `pandas` operations on data sets

Pandas is a very complicated and powerful framework, and the goal of this notebook is to expose the bare minimum to get you started working on your own data sets. Let's get to it!

In [1]:
%pip install pandas
%pip install faker

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Importing Pandas

Traditionally `pandas`, `numpy` and other scientific libraries are imported them with a short name. The idea is to keep from importing `*` into your program's namespace while not having to continually type the (possibly long) package name. When looking at other people's code, you will often see the following:
```
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
```

Write your code however it makes sense to you, but you will likely run into this when viewing other people's code.

In [1]:
import pandas as pd

## Demo Data

Our demo data set is constructed using [Faker](https://github.com/joke2k/faker), which gives us the capability to create very large and repeatable data sets without having to permanently store them. The `csv_data` function returns a CSV formatted string of fake customer order data. We wrap it in a `io.StringIO` object to make it appear more `file`-like and give `pandas.read_csv` something to work on. We'll explore the data set below...

In [2]:
import customers
import io

customer_data = io.StringIO(customers.csv_data())

## Reading the Data

`pandas` has an astounding support for various data formats. Here we use function `pandas.read_csv` to build a `pandas.DataFrame` initialized with the contents of the supplied file. There are quite a few optional arguments to `pandas.read_csv`, but often they aren't required to get a very useful DataFrame. As you learn more about your data set, you can refine these arguments to reduce the amount of "cleaning" you need to apply to it.

In [3]:
df = pd.read_csv(customer_data)

## Inspecting a DataFrame

A `pandas.DataFrame` has rows and columns which makes it look like a 2-dimensional array or Excel spreadsheet, however in practice the `DataFrame` is more like a list of columns than a matrix. Each of the columns in a DataFrame is a `pandas.Series` object which is a 1-dimensional array with axis labels.

In the output below, we have the column names across the top, the index on the far left (0, 9) and the data for each row.

In [4]:
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-26,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-09-02
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-19
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-28
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-29,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-12
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-25,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-30
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-16
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-15,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-19
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-16,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-15
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-26
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-07-07


## DataFrame Columns

The columns property of a `pandas.DataFrame` is a `list`-like object which contains the column names in left to right order. 

In [6]:
df.columns

Index(['Order Number', 'Order Date', 'Inventory Number', 'Unit Price', 'Units',
       'Ship Date', 'Name', 'Address', 'Email', 'Phone Number',
       'Date of Birth'],
      dtype='object')

The columns property is a readable and writable property, which allows you to reorganize the columns of the DataFrame in whatever order you wish. For instance, here we take a copy of our source DataFrame, `rdf`, reverse the columns and update `rdf.columns`.

In [7]:
rdf = df.copy()
rdf.columns = reversed(list(rdf.columns))
rdf

Unnamed: 0,Date of Birth,Phone Number,Email,Address,Name,Ship Date,Units,Unit Price,Inventory Number,Order Date,Order Number
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-30
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-16
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-25
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-09
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-27
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-13
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-16
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-12
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-23
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-07-04


## DataFrame Index

The index property is the other important DataFrame property for locating and describing your data. The index can be quite complicated, but in the case of our demo data, it's a simple monotonically increasing integer from 0 to 9. Later we'll see some more interesting ways to set the index to explore our data set. 

In [8]:
df.index

RangeIndex(start=0, stop=10, step=1)

# Common DataFrame Interogative Functions - Describe

The `describe` function will apply some simple statistical functions to numerical columns found in the data set. In our case, the data has two columns which have numerical data: Unit Price and Units.

In [9]:
df.describe()

Unnamed: 0,Unit Price,Units
count,10.0,10.0
mean,456.831,6.2
std,365.743293,3.047768
min,21.67,1.0
25%,103.2,4.25
50%,421.36,5.5
75%,839.9975,9.25
max,888.51,10.0


# Common DataFrame Interogative Functions - Info

The `info` function prints a concise summary of a DataFrame's component columns. 

In [10]:
df.info(verbose=True, memory_usage=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order Number      10 non-null     object 
 1   Order Date        10 non-null     object 
 2   Inventory Number  10 non-null     object 
 3   Unit Price        10 non-null     float64
 4   Units             10 non-null     int64  
 5   Ship Date         5 non-null      object 
 6   Name              10 non-null     object 
 7   Address           10 non-null     object 
 8   Email             10 non-null     object 
 9   Phone Number      10 non-null     object 
 10  Date of Birth     10 non-null     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 1008.0+ bytes


# DataFrame Property - Shape

The `shape` property is a tuple whose members are the length of the DataFrame in rows and the width in columns. 

In [11]:
df.shape

(10, 11)

# Common DataFrame Interogative Functions - Head & Tail


The `head` and `tail` functions return the first **N** or last **N** rows of a DataFrame. 


In [12]:
df.head(2)

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-30
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-16


In [13]:
df.tail(2)

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-23
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-07-04


# Common DataFrame Interogative Functions - Sample

The `sample` function is a quick way to get a random sample of the rows in your DataFrame. It's helpful for spot checking values you've updated in a DataFrame without suffering from confirmation bias induced by only checking the `head` or `tail` of a very long DataFrame.



In [14]:
df.sample(5)

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-16
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-12
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-30
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-27
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-23


# Viewing a Subset of a DataFrame's Columns

Often times our data sets are "long" with many rows and "wide" with many columns, and it'd be easier to comprehend if we could only view some columns. The DataFrame index operator `[]` takes a column name or list of column names as an argument and returns a "view" of the DataFrame narrowed to the columns specified. The index operator can also take a boolean array as input which we'll explore further down.


In [15]:
df[['Order Number', 'Order Date', 'Ship Date']]

Unnamed: 0,Order Number,Order Date,Ship Date
0,90-6790,2020-03-06,2020-04-23
1,66-7342,2020-03-15,
2,47-6588,2020-03-04,
3,31-6105,2020-03-07,2020-05-26
4,90-0956,2020-03-08,2020-03-22
5,96-1537,2020-03-04,
6,89-9680,2020-03-04,2020-06-12
7,62-0581,2020-03-12,2020-05-13
8,32-6389,2020-03-12,
9,97-6482,2020-03-11,


# Viewing a Subset of a DataFrame's Rows

Here's where things get weird. Using a DataFrame as iterator doesn't do what we expect it to do:

In [16]:
for row_maybe in df:
    print(type(row_maybe), row_maybe)

<class 'str'> Order Number
<class 'str'> Order Date
<class 'str'> Inventory Number
<class 'str'> Unit Price
<class 'str'> Units
<class 'str'> Ship Date
<class 'str'> Name
<class 'str'> Address
<class 'str'> Email
<class 'str'> Phone Number
<class 'str'> Date of Birth


# Viewing DataFrame Rows using iloc

The `iloc` property (not function) is an iterator that selects rows based on their integer row value. The property can be addressed using square brackets and accepts a variety of selectors: integers, slices, lists or tuples of integers, boolean arrays, and functions. Finally, the selectors can be a tuple of slices that address rows first, columns second!

In [17]:
df.iloc[1] # single rows are returned as pandas.Series

Order Number                                         66-7342
Order Date                                        2020-03-15
Inventory Number                                    NP82-IX1
Unit Price                                            722.75
Units                                                      4
Ship Date                                                NaN
Name                                           Amanda Prince
Address             2673 Gay Garden\nSouth Gabriel, VT 18295
Email                             amanda72garcia-dickson.net
Phone Number                              105-121-9772x76875
Date of Birth                                     2003-01-16
Name: 1, dtype: object

In [18]:
df.iloc[2:4] # multiple rows are returned as pandas.DataFrame's

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-25
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-09


In [19]:
df.iloc[:,[9,0,5,2]] # all rows, columns 9, 0, 5, and 2

Unnamed: 0,Phone Number,Order Number,Ship Date,Inventory Number
0,+1-181-269-5921x1723,90-6790,2020-04-23,HI77-BR4
1,105-121-9772x76875,66-7342,,NP82-IX1
2,+1-867-834-9510x36966,47-6588,,BK35-VD9
3,001-399-790-6359x3895,31-6105,2020-05-26,NG57-OK6
4,(309)219-5240,90-0956,2020-03-22,GP92-RI2
5,+1-205-572-0324x3336,96-1537,,WR26-BR7
6,009-038-0324,89-9680,2020-06-12,RJ01-NJ0
7,576-422-6883,62-0581,2020-05-13,JG47-BF7
8,001-884-971-6167x3607,32-6389,,BM63-SE1
9,318-745-5865,97-6482,,ZY23-PM7


# Viewing DataFram Rows with loc
The `loc` property (not function) is an interator that selects rows and columns based on their label(s) or a boolean array. `loc` works just like `iloc` which used indices of rows and columns, just referenced by their labels. In this case, our DataFrame's indices are integers so those are the "labels" that `loc` expects. Later examples will show how to re-index a DataFrame and this method of access will make more sense.

In [20]:
df.loc[[1,3,5,7], ['Name','Address']]

Unnamed: 0,Name,Address
1,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295"
3,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552"
5,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...
7,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157"


# Creating New DataFrame Columns

Sometimes adding new columns to a DataFrame can be helpful when working with data sets. The DataFrame will create a new column for us when we reference a non-existent column name. In this example, we also show how we can perform operations on the contents of a column versus having to iterate through each value and apply the operation in a "traditional" pythonic data structure like a `list` or `dict` or the explicit use of `map`. 


In [21]:
df['Total Invoice'] = df['Unit Price'] * df['Units']
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-30,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-16,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-25,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-09,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-27,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-13,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-16,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-12,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-23,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-07-04,4395.4


# Removing Columns from a DataFrame

Removing a column is accomplished with the function `pandas.DataFrame.drop` which has a huge number of optional arguments that can be intimidating. In our example, let's remove the column 'Date of Birth' since HR has decided that we won't be sending our customers birthday cards any more.

In [22]:
df.drop(columns='Date of Birth')

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


# But Was The Column Removed?

The column 'Date of Birth' is missing in the output above, but we're about to learn something interesting about the `pandas` way of handling data. Many, but not all, `pandas` operations return a new instance by default. If we go back and inspect our DataFrame, we'll be surprised to see that the 'Date Of Birth' column is still there!

In [23]:
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Date of Birth,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,1951-08-30,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2003-01-16,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2014-09-25,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,1910-09-09,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,1958-01-27,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1987-11-13,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,1998-04-16,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,1996-11-12,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,1966-08-23,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,1985-07-04,4395.4


# inplace=True

Many `pandas` functions take a boolean keyword option called 'inplace' whose default is False. If we wanted to make sure that the 'Date of Birth' column is dropped in the source DataFrame, we would do it this way:

In [24]:
df.drop(columns='Date of Birth', inplace=True)
df

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


# Data Cleaning - Handling Dates

Our data set has a number of columns that have content that looks like a date, however if you recall the output of the `info` function, those columns' data type is 'object' which generally means 'string'.

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order Number      10 non-null     object 
 1   Order Date        10 non-null     object 
 2   Inventory Number  10 non-null     object 
 3   Unit Price        10 non-null     float64
 4   Units             10 non-null     int64  
 5   Ship Date         5 non-null      object 
 6   Name              10 non-null     object 
 7   Address           10 non-null     object 
 8   Email             10 non-null     object 
 9   Phone Number      10 non-null     object 
 10  Total Invoice     10 non-null     float64
dtypes: float64(2), int64(1), object(8)
memory usage: 1008.0+ bytes


Since the **Order Date** column has no null values, let's work with it first. Converting that column to a more programmtically friendly datatype is pretty easy using `pandas.to_datetime`, which takes a Series as an argument.

In [26]:
pd.to_datetime(df['Order Date'])

0   2020-03-06
1   2020-03-15
2   2020-03-04
3   2020-03-07
4   2020-03-08
5   2020-03-04
6   2020-03-04
7   2020-03-12
8   2020-03-12
9   2020-03-11
Name: Order Date, dtype: datetime64[ns]

It turns out we can replace columns in a DataFrame as long as they are the same shape. In this case the shape we are replacing is (1, 10) or a single column. So to convert the **Order Date** column to _datetime64_ we would write:

In [27]:
df['Order Date'] = pd.to_datetime(df['Order Date'])
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order Number      10 non-null     object        
 1   Order Date        10 non-null     datetime64[ns]
 2   Inventory Number  10 non-null     object        
 3   Unit Price        10 non-null     float64       
 4   Units             10 non-null     int64         
 5   Ship Date         5 non-null      object        
 6   Name              10 non-null     object        
 7   Address           10 non-null     object        
 8   Email             10 non-null     object        
 9   Phone Number      10 non-null     object        
 10  Total Invoice     10 non-null     float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 1008.0+ bytes
None


Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


# Data Cleaning - What the Heck is NaN/NaT

Some times our data sets have missing values and they show up as `pandas.NaN` and `pandas.NaT` objects in our DataFrames. `NaN` is short for "Not A Number", while `NaT` is short for "Not A Time". Looking at the **Ship Date** column, you can see we have five `NaN`'s lurking in there. Looking back at the documentation for `pandas.read_csv`, blank CSV fields are interpreted as `NaN` so that's where they came frome.

Let's first convert that column to datetime64 like we did for **Order Date**.

In [28]:
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order Number      10 non-null     object        
 1   Order Date        10 non-null     datetime64[ns]
 2   Inventory Number  10 non-null     object        
 3   Unit Price        10 non-null     float64       
 4   Units             10 non-null     int64         
 5   Ship Date         5 non-null      datetime64[ns]
 6   Name              10 non-null     object        
 7   Address           10 non-null     object        
 8   Email             10 non-null     object        
 9   Phone Number      10 non-null     object        
 10  Total Invoice     10 non-null     float64       
dtypes: datetime64[ns](2), float64(2), int64(1), object(6)
memory usage: 1008.0+ bytes
None


Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,2020-04-23,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,NaT,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,NaT,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,2020-05-26,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,2020-03-22,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,NaT,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,2020-06-12,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,2020-05-13,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,NaT,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,NaT,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


Now we've converted **Ship Date** from object to datetime, but we've traded `NaN` for `NaT`. It's a step in the right direction. Since the ship date is missing, we can infer that these rows are unfulfilled orders.

In [29]:
shipped = df.dropna(axis=1)
shipped

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Name,Address,Email,Phone Number,Total Invoice
0,90-6790,2020-03-06,HI77-BR4,35.06,10,Ronald Baker,"7379 Brandi Fords\nPort Amandamouth, ND 59565",richardsmegan5walker.com,+1-181-269-5921x1723,350.6
1,66-7342,2020-03-15,NP82-IX1,722.75,4,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
3,31-6105,2020-03-07,NG57-OK6,887.84,6,Taylor Castaneda,"5629 Le Centers\nCopelandtown, KS 97552",catherineallenhawkins.com,001-399-790-6359x3895,5327.04
4,90-0956,2020-03-08,GP92-RI2,888.51,10,Sean Mccarthy,"256 Cooper Overpass Apt. 316\nBlakehaven, FL 5...",twardgmail.com,(309)219-5240,8885.1
5,96-1537,2020-03-04,WR26-BR7,229.62,7,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
6,89-9680,2020-03-04,RJ01-NJ0,61.06,10,Christopher Park,"67751 Jon Common\nEast Reneeburgh, NE 38218",dpacheco9yahoo.com,009-038-0324,610.6
7,62-0581,2020-03-12,JG47-BF7,372.58,1,John Hill,"9862 Cisneros Run Apt. 070\nTaraport, RI 11157",david052yahoo.com,576-422-6883,372.58
8,32-6389,2020-03-12,BM63-SE1,21.67,4,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


# Selecting Data out of DataFrames with Boolean Series


In [30]:
df['Ship Date'] == pd.NaT

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Ship Date, dtype: bool

In [31]:
df['Ship Date'].fillna(0)

0    2020-04-23 00:00:00
1                      0
2                      0
3    2020-05-26 00:00:00
4    2020-03-22 00:00:00
5                      0
6    2020-06-12 00:00:00
7    2020-05-13 00:00:00
8                      0
9                      0
Name: Ship Date, dtype: object

In [32]:
df['Ship Date'].fillna(0) == 0

0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
8     True
9     True
Name: Ship Date, dtype: bool

In [33]:
df[df['Ship Date'].fillna(0) == 0]

Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
1,66-7342,2020-03-15,NP82-IX1,722.75,4,NaT,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,NaT,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
5,96-1537,2020-03-04,WR26-BR7,229.62,7,NaT,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
8,32-6389,2020-03-12,BM63-SE1,21.67,4,NaT,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,NaT,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4


In [34]:
unshipped = df[df['Ship Date'].fillna(1010) == 1010]
unshipped


Unnamed: 0,Order Number,Order Date,Inventory Number,Unit Price,Units,Ship Date,Name,Address,Email,Phone Number,Total Invoice
1,66-7342,2020-03-15,NP82-IX1,722.75,4,NaT,Amanda Prince,"2673 Gay Garden\nSouth Gabriel, VT 18295",amanda72garcia-dickson.net,105-121-9772x76875,2891.0
2,47-6588,2020-03-04,BK35-VD9,470.14,5,NaT,Thomas Wang,USCGC Mitchell\nFPO AP 19018,kristinodom8vargas.org,+1-867-834-9510x36966,2350.7
5,96-1537,2020-03-04,WR26-BR7,229.62,7,NaT,Jay Fowler,158 Franklin Mountain Apt. 263\nNew Thomasmout...,ddean6ross.biz,+1-205-572-0324x3336,1607.34
8,32-6389,2020-03-12,BM63-SE1,21.67,4,NaT,Matthew Richardson,"42076 Adam Ramp\nKimberlyhaven, OK 60700",vvancemartinez.net,001-884-971-6167x3607,86.68
9,97-6482,2020-03-11,ZY23-PM7,879.08,5,NaT,James Martin,"0026 Parker Spring\nSouth Bradberg, ND 60017",welchmichael1gmail.com,318-745-5865,4395.4
