#  Exploratory Data Analysis With Pandas

### Overview

In this lesson, students will begin using Pandas for exploratory data analysis. This will include filtering and sorting data to generate insights.

### Learning Objectives

* Use Pandas to read in a data set
* Use DataFrame attributes and methods to investigate a data set's integrity
* Apply filters and sorting to DataFrames

## Meet Pandas
 

#### What is Pandas?

In [1]:
# What is Pandas?
import pandas as pd

print(f"{pd.__package__} version {pd.__version__}")
print(pd.__doc__)

pandas version 1.0.1

pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply 

###  Reading a Data Set

In [2]:
# set some variables
file_address = 'data/chipotle.tsv'
delimiter_character = '\t'

In [3]:
# read in a file
data_frame = pd.read_csv(file_address,sep=delimiter_character)
data_frame[:5]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [4]:
data_frame[:10:2]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
2,1,1,Nantucket Nectar,[Apple],$3.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
6,3,1,Side of Chips,,$1.69
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25


###  Series vs. DataFrames

In [6]:
data_frame.info

<bound method DataFrame.info of       order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                   NaN     $2.39   
1                        

In [11]:
data_frame.shape

(4622, 5)

In [12]:
data_frame.info

<bound method DataFrame.info of       order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                   NaN     $2.39   
1                        

In [13]:
data_frame.sample(15).sort_index()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
850,350,3,Canned Soft Drink,[Sprite],$3.75
1082,446,1,Chips,,$2.15
1669,675,1,Barbacoa Soft Tacos,"[Tomatillo Red Chili Salsa, [Rice, Cheese, Sou...",$9.25
1960,792,1,Chicken Soft Tacos,"[Roasted Chili Corn Salsa, [Rice, Cheese]]",$8.75
2193,884,1,Chips,,$2.15
2301,924,1,Chips and Roasted Chili-Corn Salsa,,$2.39
2534,1006,1,Chicken Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Chees...",$8.75
2943,1170,1,Steak Burrito,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$8.99
3363,1349,1,Chicken Salad,"[Fresh Tomato Salsa (Mild), [Pinto Beans, Rice...",$8.49
3754,1502,2,Steak Bowl,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$23.50


In [14]:
my_data = data_frame['item_name']

In [15]:
# A series
my_data

0                Chips and Fresh Tomato Salsa
1                                        Izze
2                            Nantucket Nectar
3       Chips and Tomatillo-Green Chili Salsa
4                                Chicken Bowl
                        ...                  
4617                            Steak Burrito
4618                            Steak Burrito
4619                       Chicken Salad Bowl
4620                       Chicken Salad Bowl
4621                       Chicken Salad Bowl
Name: item_name, Length: 4622, dtype: object

In [16]:
type(my_data)

pandas.core.series.Series

In [17]:
len(my_data)

4622

In [None]:
my_data.info()

In [20]:
print(type(my_data))
print(my_data)

<class 'pandas.core.series.Series'>
0                Chips and Fresh Tomato Salsa
1                                        Izze
2                            Nantucket Nectar
3       Chips and Tomatillo-Green Chili Salsa
4                                Chicken Bowl
                        ...                  
4617                            Steak Burrito
4618                            Steak Burrito
4619                       Chicken Salad Bowl
4620                       Chicken Salad Bowl
4621                       Chicken Salad Bowl
Name: item_name, Length: 4622, dtype: object


In [21]:
my_data_array = my_data.tolist()

print(my_data_array[:10])

['Chips and Fresh Tomato Salsa', 'Izze', 'Nantucket Nectar', 'Chips and Tomatillo-Green Chili Salsa', 'Chicken Bowl', 'Chicken Bowl', 'Side of Chips', 'Steak Burrito', 'Steak Soft Tacos', 'Steak Burrito']


In [23]:
# A dataframe
new_df = data_frame[['item_name','item_price']]

In [24]:
new_df

Unnamed: 0,item_name,item_price
0,Chips and Fresh Tomato Salsa,$2.39
1,Izze,$3.39
2,Nantucket Nectar,$3.39
3,Chips and Tomatillo-Green Chili Salsa,$2.39
4,Chicken Bowl,$16.98
...,...,...
4617,Steak Burrito,$11.75
4618,Steak Burrito,$11.75
4619,Chicken Salad Bowl,$11.25
4620,Chicken Salad Bowl,$8.75


In [25]:
new_df.describe

<bound method NDFrame.describe of                                   item_name item_price
0              Chips and Fresh Tomato Salsa     $2.39 
1                                      Izze     $3.39 
2                          Nantucket Nectar     $3.39 
3     Chips and Tomatillo-Green Chili Salsa     $2.39 
4                              Chicken Bowl    $16.98 
...                                     ...        ...
4617                          Steak Burrito    $11.75 
4618                          Steak Burrito    $11.75 
4619                     Chicken Salad Bowl    $11.25 
4620                     Chicken Salad Bowl     $8.75 
4621                     Chicken Salad Bowl     $8.75 

[4622 rows x 2 columns]>

###  Accessing and Modifying the Index

In [26]:
new_df.index

RangeIndex(start=0, stop=4622, step=1)

In [27]:
new_df.loc[0:4]

Unnamed: 0,item_name,item_price
0,Chips and Fresh Tomato Salsa,$2.39
1,Izze,$3.39
2,Nantucket Nectar,$3.39
3,Chips and Tomatillo-Green Chili Salsa,$2.39
4,Chicken Bowl,$16.98


In [28]:
new_df.loc[5:12:2]

Unnamed: 0,item_name,item_price
5,Chicken Bowl,$10.98
7,Steak Burrito,$11.75
9,Steak Burrito,$9.25
11,Chicken Crispy Tacos,$8.75


In [31]:
new_df.iloc[10]

item_name     Chips and Guacamole
item_price                 $4.45 
Name: 10, dtype: object

In [30]:
new_df.iloc[[10,11]]

Unnamed: 0,item_name,item_price
10,Chips and Guacamole,$4.45
11,Chicken Crispy Tacos,$8.75


In [32]:
column_name = 'order_id'
data_frame.set_index(column_name, inplace=True)
data_frame

Unnamed: 0_level_0,quantity,item_name,choice_description,item_price
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,Izze,[Clementine],$3.39
1,1,Nantucket Nectar,[Apple],$3.39
1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
...,...,...,...,...
1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75


###  Columns and Data Types

In [33]:
data_frame.columns # Prints all the column names

Index(['quantity', 'item_name', 'choice_description', 'item_price'], dtype='object')

In [34]:
data_frame.dtypes # Prints all the data types, but is hard to read!

quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [39]:
# Easy-to-read DataFrame of the data types: 
pd.DataFrame(data_frame.dtypes, columns=['DataType'])

Unnamed: 0,DataType
quantity,int64
item_name,object
choice_description,object
item_price,object


###  Renaming Columns

In [45]:
# the "head" of a file
file_address = 'data/mtcars.csv'
delimiter_character = ','
df = pd.read_csv(file_address,sep=delimiter_character)
df.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [46]:
# renaming
df.rename(columns={'Unnamed: 0': 'model'}, inplace=True)
df.head(3)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1


###  Common Column Operations

In [47]:
df['mpg'].describe()

count    32.000000
mean     20.090625
std       6.026948
min      10.400000
25%      15.425000
50%      19.200000
75%      22.800000
max      33.900000
Name: mpg, dtype: float64

In [48]:
df['cyl'].value_counts()

8    14
4    11
6     7
Name: cyl, dtype: int64

In [54]:
df['model'].unique()

array(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',
       'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D',
       'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL',
       'Merc 450SLC', 'Cadillac Fleetwood', 'Lincoln Continental',
       'Chrysler Imperial', 'Fiat 128', 'Honda Civic', 'Toyota Corolla',
       'Toyota Corona', 'Dodge Challenger', 'AMC Javelin', 'Camaro Z28',
       'Pontiac Firebird', 'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa',
       'Ford Pantera L', 'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
      dtype=object)

In [55]:
df['model'].nunique()

32

## Filtering and Sorting Data
 

###  The Boolean Mask

In [56]:
#  The Boolean Mask
df['cyl'] == 6

0      True
1      True
2     False
3      True
4     False
5      True
6     False
7     False
8     False
9      True
10     True
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29     True
30    False
31    False
Name: cyl, dtype: bool

In [57]:
#  The Boolean Mask, cont.
df[df['cyl'] == 6].tail(3)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
10,Merc 280C,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6


###  DataFrame Syntax Chaining

In [59]:
# Access the mpg column of all results passing the filter:
df[df['gear'] == 4]['mpg']

0     21.0
1     21.0
2     22.8
7     24.4
8     22.8
9     19.2
10    17.8
17    32.4
18    30.4
19    33.9
25    27.3
31    21.4
Name: mpg, dtype: float64

In [64]:
# Gives numerical summaries based only on results passing the filter
df[df['gear'] == 4]['mpg'].describe()

count    12.000000
mean     24.533333
std       5.276764
min      17.800000
25%      21.000000
50%      22.800000
75%      28.075000
max      33.900000
Name: mpg, dtype: float64

###  Filtering by Multiple Conditions

In [65]:
# Adding more conditions uses the same syntax, 
# even if it looks more complicated
df[(df['gear'] == 5) & (df['mpg'] > 20)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


In [67]:
# The parentheses here are very important! 
# Leaving them out will usually trigger an error.
df[(df['cyl'] == 8) | (df['mpg'] > 32)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
11,Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
12,Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
13,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3
14,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
15,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
16,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
17,Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
19,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1


###  Sorting

In [68]:
# For a Series object, no need to specify column: There's only one!
df['mpg'].sort_values().head()

15    10.4
14    10.4
23    13.3
6     14.3
16    14.7
Name: mpg, dtype: float64

In [69]:
# For a DataFrame, it will sort by index unless given a column name. 
df.sort_values(by='mpg', ascending=True).head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
15,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
14,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
23,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
16,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4


###  Accessing an Individual Row

In [70]:
# We can use the iloc property to use indexing syntax
df.sort_values(by="mpg").iloc[[0]]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
15,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4


In [71]:
# We could also simply use one bracket
df.sort_values(by="mpg").iloc[0]

model    Lincoln Continental
mpg                     10.4
cyl                        8
disp                     460
hp                       215
drat                       3
wt                     5.424
qsec                   17.82
vs                         0
am                         0
gear                       3
carb                       4
Name: 15, dtype: object