## Pandas: Data Structures and Accessing the Data

First we are going to want to import Pandas. To do this:

In [2]:
# Import libraries
import pandas as pd # import pandas library

Pandas has two core data structures:
- Series: 1D array with native support for many data operations that numpy arrays don't.
- DataFrames: Tabular data with various tabular manipulation operations. Individual columns/rows are pandas Series.

#### Pandas Series

We have data on the highest number of cars that a few famous people have owned. 

| Person | Max number of Cars |
| --- | --- | 
| Muammar Qaddafi | 25000 |
| Mohandas Gandhi | 0 |
| Saddam Hussein | 4500 |
| Kevin Bacon | 2 |
| Billy Bob Thornton | 8 |

Let's represent this as a series.

In [3]:
pd.Series([25000,0,4500,2,8],
          index = ['Muammar Qaddafi', 'Mohandas Gandhi', 'Saddam Hussein', 'Kevin Bacon', 'Billy Bob Thornton'], 
          name = 'Max Number Cars Owned')

Muammar Qaddafi       25000
Mohandas Gandhi           0
Saddam Hussein         4500
Kevin Bacon               2
Billy Bob Thornton        8
Name: Max Number Cars Owned, dtype: int64

In [4]:
# This more naturally can be inputted from a dict.
car_dict = {'Muammar Qaddafi': 25000, 'Mohandas Gandhi': 0, 
            'Saddam Hussein': 4500, 'Kevin Bacon': 2, 'Billy Bob Thornton': 8}

car_owner_series = pd.Series(car_dict)
car_owner_series

Muammar Qaddafi       25000
Mohandas Gandhi           0
Saddam Hussein         4500
Kevin Bacon               2
Billy Bob Thornton        8
dtype: int64

Why use Pandas series?

Combines:
- Dictionary style fast lookup.
- Numpy style vectorized operations on the values.


In [5]:
# indexed on sensible keys. 
car_owner_series['Billy Bob Thornton']

8

In [6]:
# can slice on these keys
car_owner_series["Mohandas Gandhi"
                 :"Kevin Bacon"]

Mohandas Gandhi       0
Saddam Hussein     4500
Kevin Bacon           2
dtype: int64

In [7]:
#can do fast computation like a numpy array

# A new set of values. Kevin Bacon bought an extra car and Billy Bob bought two more. 
delta_cars = {'Mohandas Gandhi': 0, 'Billy Bob Thornton': 2, 
              'Saddam Hussein': 0, 'Kevin Bacon': 1, 'Muammar Qaddafi': 0}

delta_cars_series = pd.Series(delta_cars)

In [8]:
print(delta_cars_series)

Mohandas Gandhi       0
Billy Bob Thornton    2
Saddam Hussein        0
Kevin Bacon           1
Muammar Qaddafi       0
dtype: int64


In [9]:
print(car_owner_series)

Muammar Qaddafi       25000
Mohandas Gandhi           0
Saddam Hussein         4500
Kevin Bacon               2
Billy Bob Thornton        8
dtype: int64


Want to update but the two series are not in the same order.

No problem for pandas.

In [10]:
new_car_series = car_owner_series + delta_cars_series
print(new_car_series)

Billy Bob Thornton       10
Kevin Bacon               3
Mohandas Gandhi           0
Muammar Qaddafi       25000
Saddam Hussein         4500
dtype: int64


#### Some important Series attributes

- The Series.index attribute: list of indices (keys)

In [11]:
new_car_series.index

Index(['Billy Bob Thornton', 'Kevin Bacon', 'Mohandas Gandhi',
       'Muammar Qaddafi', 'Saddam Hussein'],
      dtype='object')

- The Series.values attribute: series values returns as numpy array

In [12]:
new_car_series.values

array([   10,     3,     0, 25000,  4500], dtype=int64)

- The Series.name attribute: the name of the series

In [13]:
new_car_series.name = 'Max cars owned'
print(new_car_series)

Billy Bob Thornton       10
Kevin Bacon               3
Mohandas Gandhi           0
Muammar Qaddafi       25000
Saddam Hussein         4500
Name: Max cars owned, dtype: int64


In [14]:
new_car_series.name

'Max cars owned'

- The Series.dtype: data type for Series values

In [15]:
new_car_series.dtype

dtype('int64')

Series have some various attached methods.

Examples: sorting by max cars in descending order:

In [16]:
new_car_series.sort_values(ascending = False)

Muammar Qaddafi       25000
Saddam Hussein         4500
Billy Bob Thornton       10
Kevin Bacon               3
Mohandas Gandhi           0
Name: Max cars owned, dtype: int64

Series have:
- native methods for handling time series data
- whole host of other nice methods.

Will see these later.

#### Pandas DataFrames

We saw these before with the heart disease dataset. Tabular data structure.

Let' take a new dataset that has data about various breakfast cereals.

In [17]:
cereal_df = pd.read_csv('Data/cereal.csv', index_col = 'name')

Often want a quick view of the first few entries in the table data.

The .head() method:

In [18]:
cereal_df.head(2) # default returns first 5 elements

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679


Less common, take a look at the end:

The .tail() method:

In [19]:
cereal_df.tail()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Triples,G,C,110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
Wheat Chex,R,C,100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
Wheaties,G,C,100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193
Wheaties Honey Gold,G,C,110,2,1,200,1.0,16.0,8,60,25,1,1.0,0.75,36.187559


Good common practice: 

Start by looking at some metadata and descriptive statistics on DataFrame.

- .info() method: column data type. Any nulls?
- .describe() method: statistics for each column

In [20]:
cereal_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, 100% Bran to Wheaties Honey Gold
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   mfr       77 non-null     object 
 1   type      77 non-null     object 
 2   calories  77 non-null     int64  
 3   protein   77 non-null     int64  
 4   fat       77 non-null     int64  
 5   sodium    77 non-null     int64  
 6   fiber     77 non-null     float64
 7   carbo     77 non-null     float64
 8   sugars    77 non-null     int64  
 9   potass    77 non-null     int64  
 10  vitamins  77 non-null     int64  
 11  shelf     77 non-null     int64  
 12  weight    77 non-null     float64
 13  cups      77 non-null     float64
 14  rating    77 non-null     float64
dtypes: float64(5), int64(8), object(2)
memory usage: 9.6+ KB


In [21]:
cereal_df.describe()

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
std,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
50%,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
75%,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
max,160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


Important basic DataFrame attributes:

- DataFrame.index: list of index names for rows
- DataFrame.columns: list of column names
- DataFrame.shape: returns (number rows, number columns) tuple.


In [22]:
cereal_df.columns

Index(['mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo',
       'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating'],
      dtype='object')

In [23]:
cereal_df.index[0:10]

Index(['100% Bran', '100% Natural Bran', 'All-Bran',
       'All-Bran with Extra Fiber', 'Almond Delight',
       'Apple Cinnamon Cheerios', 'Apple Jacks', 'Basic 4', 'Bran Chex',
       'Bran Flakes'],
      dtype='object', name='name')

In [24]:
cereal_df.shape

(77, 15)

#### Accessing data in a DataFrame

Accessing data in a Series by named index is easy. Remember:

In [25]:
new_car_series['Billy Bob Thornton']

10

DataFrames: can access entire **columns** in a similar way. Access the calories column.

In [26]:
cereal_df['calories']

name
100% Bran                     70
100% Natural Bran            120
All-Bran                      70
All-Bran with Extra Fiber     50
Almond Delight               110
                            ... 
Triples                      110
Trix                         110
Wheat Chex                   100
Wheaties                     100
Wheaties Honey Gold          110
Name: calories, Length: 77, dtype: int64

In [27]:
cereal_df.calories # equivalent to cereal_df['calories']

name
100% Bran                     70
100% Natural Bran            120
All-Bran                      70
All-Bran with Extra Fiber     50
Almond Delight               110
                            ... 
Triples                      110
Trix                         110
Wheat Chex                   100
Wheaties                     100
Wheaties Honey Gold          110
Name: calories, Length: 77, dtype: int64

Wait a minute...this is returning a Series with name "calories"! 

Individual columns/rows extracted as pandas Series from the DataFrame architecture.

Can also extract data from a subset of the columns by passing in a list of column names.

DataFrame[list of column names in subset]: returns a DataFrame

In [28]:
col_list = ['calories', 'fat', 'sugars']
cereal_df[['calories', 'fat', 'sugars']]

Unnamed: 0_level_0,calories,fat,sugars
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100% Bran,70,1,6
100% Natural Bran,120,5,8
All-Bran,70,1,5
All-Bran with Extra Fiber,50,0,0
Almond Delight,110,2,8
...,...,...,...
Triples,110,1,3
Trix,110,1,12
Wheat Chex,100,1,3
Wheaties,100,1,3


This is a new dataframe with just the accessed columns in the list. We can access a particular row and column as follows:

DataFrame[column_name][row_name]

In [29]:
cereal_df['sugars']['Fruity Pebbles']

12

#### The .loc[] accessor:

- Access single row by named index
- Complex selections: slicing across both rows and columns, etc
- Really important to use when assigning values in selections.

1. DataFrame.loc[row_accessor]
2. DataFrame.loc[row_accessor, column_accessor]


Accessing a single row with .loc[]

In [30]:
cereal_df.head(8)

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562


In [31]:
cereal_df.loc['All-Bran']

mfr                 K
type                C
calories           70
protein             4
fat                 1
sodium            260
fiber             9.0
carbo             7.0
sugars              5
potass            320
vitamins           25
shelf               3
weight            1.0
cups             0.33
rating      59.425505
Name: All-Bran, dtype: object

Accessing multiple rows:

In [32]:
cereal_df.head(8)

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562


In [33]:
# select rows by list of index names
row_list = ['All-Bran', 'Almond Delight', 'Apple Jacks']
cereal_df.loc[row_list]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094


In [34]:
#slice rows by name
cereal_df.loc['All-Bran':'Apple Jacks']

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094


Note: with .loc[],  final entry *is included* in slice.

Accessing multiple columns:

In [35]:
cereal_df.head(8)

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562


In [36]:
# select columns by list
listcol = ["calories", "protein", 
                   "fat","sodium"]
cereal_df.loc["All-Bran", listcol]

calories     70
protein       4
fat           1
sodium      260
Name: All-Bran, dtype: object

In [37]:
# slice on columns by name
cereal_df.loc["All-Bran", 
              "calories":"sodium"]


calories     70
protein       4
fat           1
sodium      260
Name: All-Bran, dtype: object

Putting it altogether (selections on rows and columns):

In [38]:
cereal_df.head(8)

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562


In [39]:
# slicing on rows AND columns
cereal_df.loc["All-Bran":"Almond Delight", 
              "calories":"sodium"]

Unnamed: 0_level_0,calories,protein,fat,sodium
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
All-Bran,70,4,1,260
All-Bran with Extra Fiber,50,4,0,140
Almond Delight,110,2,2,200


In [40]:
# accessing all rows and a column subset 
# with .loc accessor 
cereal_df.loc[:, ['protein', 'fat']]

Unnamed: 0_level_0,protein,fat
name,Unnamed: 1_level_1,Unnamed: 2_level_1
100% Bran,4,1
100% Natural Bran,3,5
All-Bran,4,1
All-Bran with Extra Fiber,4,0
Almond Delight,2,2
...,...,...
Triples,2,1
Trix,1,1
Wheat Chex,3,1
Wheaties,3,1


In [41]:
cereal_df[['calories','protein']]

Unnamed: 0_level_0,calories,protein
name,Unnamed: 1_level_1,Unnamed: 2_level_1
100% Bran,70,4
100% Natural Bran,120,3
All-Bran,70,4
All-Bran with Extra Fiber,50,4
Almond Delight,110,2
...,...,...
Triples,110,2
Trix,110,1
Wheat Chex,100,3
Wheaties,100,3


Only difference arises when slicing on columns:
- Really need to use .loc[] accessor for this.

In [42]:
cereal_df.loc[:, 'calories':'sodium']

Unnamed: 0_level_0,calories,protein,fat,sodium
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,1,130
100% Natural Bran,120,3,5,15
All-Bran,70,4,1,260
All-Bran with Extra Fiber,50,4,0,140
Almond Delight,110,2,2,200
...,...,...,...,...
Triples,110,2,1,250
Trix,110,1,1,140
Wheat Chex,100,3,1,230
Wheaties,100,3,1,200


In [43]:
cereal_df['calories':'sodium']

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1


The .iloc[] accessor:

- Access rows and columns by their integer position instead of named index.
- Everything else pretty much the same as .loc[]

In [44]:
cereal_df.head(5)

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [45]:
cereal_df.iloc[1:4, 2:6]

Unnamed: 0_level_0,calories,protein,fat,sodium
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Natural Bran,120,3,5,15
All-Bran,70,4,1,260
All-Bran with Extra Fiber,50,4,0,140


Note with .iloc slice, last index *NOT included* in slice

## Pandas Selection by Logical Filtering

Pandas provides an easy way of making selections based on a given condition. This relies on two steps:
- constructing a Boolean mask based on a set of logical conditions
- Applying the mask to the data in our selection/accessor

We will see that Boolean masking really allows one to start making complex data selections:

#### Constructing the 'Boolean Mask'

We use Boolean testing conditions on pandas Series/DataFrames. This will return a Boolean series specifying whether a given row satisfies the condition. Let's walk through a few common examples:

Let's look at a statistical summary of our data:

In [46]:
cereal_df.describe()

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
std,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
50%,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
75%,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
max,160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


Maybe we want to examine which cereals contain excessive amounts of sugar. Inspecting the summary above, let's say that we make a judgment call that everything above the upper quartile (Q3) is a high sugar content cereal. First lets see how to create a Boolean mask that reflects whether a row matches the criterion:

Cereal has sugar content in the top 25% bracket?

In [47]:
cereal_df['sugars'] >= 11

name
100% Bran                    False
100% Natural Bran            False
All-Bran                     False
All-Bran with Extra Fiber    False
Almond Delight               False
                             ...  
Triples                      False
Trix                          True
Wheat Chex                   False
Wheaties                     False
Wheaties Honey Gold          False
Name: sugars, Length: 77, dtype: bool

Clearly Trix should not be for kids...

- Tests condition for all rows. This Pandas Boolean comparison is typically much quicker than looping through a list or creating a list comprehension with the corresponding conditional logic. 
- Returns a pandas Series with Boolean values.
    - Is the result a True or False for given row?


You can, of course, create more complex Boolean conditions with | (OR) and & (AND):

In [None]:
(cereal_df['sugars'] >= 11) & (cereal_df['potass'] <= 20)

name
100% Bran                    False
100% Natural Bran            False
All-Bran                     False
All-Bran with Extra Fiber    False
Almond Delight               False
                             ...  
Triples                      False
Trix                         False
Wheat Chex                   False
Wheaties                     False
Wheaties Honey Gold          False
Length: 77, dtype: bool

Test whether cereal is manufactured by General Mills or Kellog's

In [None]:
(cereal_df['mfr'] == 'G')|(cereal_df['mfr'] == 'K')

name
100% Bran                    False
100% Natural Bran            False
All-Bran                      True
All-Bran with Extra Fiber     True
Almond Delight               False
                             ...  
Triples                       True
Trix                          True
Wheat Chex                   False
Wheaties                      True
Wheaties Honey Gold           True
Name: mfr, Length: 77, dtype: bool

#### Other useful Pandas in-built Boolean operations:

- .isin()
- .str.contains()
- The ~ operator.

Check whether manufactured by Nabisco, General Mills, or Kellogs:

In [None]:
mfr_list = ['N', 'G', 'K']
cereal_df['mfr'].isin(mfr_list)

name
100% Bran                     True
100% Natural Bran            False
All-Bran                      True
All-Bran with Extra Fiber     True
Almond Delight               False
                             ...  
Triples                       True
Trix                          True
Wheat Chex                   False
Wheaties                      True
Wheaties Honey Gold           True
Name: mfr, Length: 77, dtype: bool

In [None]:
# ~ negates logical statement
~cereal_df['mfr'].isin(mfr_list)

name
100% Bran                    False
100% Natural Bran             True
All-Bran                     False
All-Bran with Extra Fiber    False
Almond Delight                True
                             ...  
Triples                      False
Trix                         False
Wheat Chex                    True
Wheaties                     False
Wheaties Honey Gold          False
Name: mfr, Length: 77, dtype: bool

The .str methods contains built-in *vectorized* string operations that can be execute operations across Series or Pandas Index objects containing strings very quickly. They follow the pattern:
*Series.str.function()*. The functions are typically ones that you already encountered when wrangling strings using the base Python `str` methods.

An example of such a vectorized method that can be used for Boolean masking:

- Series.str.contains(...) checks whether each entry in a Series has a string containing given substring.

We can pass this on a given column or on the index:

In [None]:
cereal_df.index

Index(['100% Bran', '100% Natural Bran', 'All-Bran',
       'All-Bran with Extra Fiber', 'Almond Delight',
       'Apple Cinnamon Cheerios', 'Apple Jacks', 'Basic 4', 'Bran Chex',
       'Bran Flakes', 'Cap'n'Crunch', 'Cheerios', 'Cinnamon Toast Crunch',
       'Clusters', 'Cocoa Puffs', 'Corn Chex', 'Corn Flakes', 'Corn Pops',
       'Count Chocula', 'Cracklin' Oat Bran', 'Cream of Wheat (Quick)',
       'Crispix', 'Crispy Wheat & Raisins', 'Double Chex', 'Froot Loops',
       'Frosted Flakes', 'Frosted Mini-Wheats',
       'Fruit & Fibre Dates; Walnuts; and Oats', 'Fruitful Bran',
       'Fruity Pebbles', 'Golden Crisp', 'Golden Grahams', 'Grape Nuts Flakes',
       'Grape-Nuts', 'Great Grains Pecan', 'Honey Graham Ohs',
       'Honey Nut Cheerios', 'Honey-comb', 'Just Right Crunchy  Nuggets',
       'Just Right Fruit & Nut', 'Kix', 'Life', 'Lucky Charms', 'Maypo',
       'Muesli Raisins; Dates; & Almonds', 'Muesli Raisins; Peaches; & Pecans',
       'Mueslix Crispy Blend', 'Multi-Gr

In [None]:
cereal_df.index.str.contains('Bran')

array([ True,  True,  True,  True, False, False, False, False,  True,
        True, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False,  True,  True, False, False, False,
       False,  True, False, False, False, False, False,  True, False,
       False, False, False, False, False])

Now that we have gone through all

#### Applying the Boolean mask

- Can select on our True/False series.
- Selector takes in Boolean series.
- Indices for which condition is True get selected.

In [None]:
# inputting Boolean mask into df. 
cereal_df[cereal_df['sugars'] >= 11].head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
Cap'n'Crunch,Q,C,120,1,2,220,0.0,12.0,12,35,25,2,1.0,0.75,18.042851
Cocoa Puffs,G,C,110,1,1,180,0.0,12.0,13,55,25,2,1.0,1.0,22.736446
Corn Pops,K,C,110,1,0,90,1.0,13.0,12,20,25,2,1.0,1.0,35.782791
Count Chocula,G,C,110,1,1,180,0.0,12.0,13,65,25,2,1.0,1.0,22.396513


Yeah that makes sense.

Now, select cereals manufactured by Nabisco, General Mills, or Kellogs.

In [None]:
cereal_df[cereal_df['mfr'].isin(mfr_list)].head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094


Now, select all cereals with the word 'Bran' in the name.

In [None]:
cereal_df[cereal_df.index.str.contains('Bran')]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Bran Chex,R,C,90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253
Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813
Cracklin' Oat Bran,K,C,110,3,3,140,4.0,10.0,7,160,25,3,1.0,0.5,40.448772
Fruitful Bran,K,C,120,3,0,240,5.0,14.0,12,190,25,3,1.33,0.67,41.015492
Post Nat. Raisin Bran,P,C,120,3,1,200,6.0,11.0,14,260,25,3,1.33,0.67,37.840594
Raisin Bran,K,C,120,3,1,210,5.0,14.0,12,240,25,2,1.33,0.75,39.259197


#### Combining masks with the .loc[] accessor

- Combine boolean selections + column selections
- Complex selections start to get easier
- DataFrame.loc[row_filtering, columns]

Let's find all cereals with the name 'Bran' in it and high sugar content. Want sodium, fiber, carb and sugar info only. The row accessor takes the Boolean mask here and we slice on columns:

In [None]:
cereal_df.loc[(cereal_df['sugars'] > 11) & cereal_df.index.str.contains('Bran'), 'sodium':'sugars']

Unnamed: 0_level_0,sodium,fiber,carbo,sugars
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fruitful Bran,240,5.0,14.0,12
Post Nat. Raisin Bran,200,6.0,11.0,14
Raisin Bran,210,5.0,14.0,12
Total Raisin Bran,190,4.0,15.0,14


## Pandas Series and DataFrames: changing attributes and values

There are a few ways in which you will want to modify a DataFrame you have imported or Series you have created:

- Cleaning and altering column/index names
- Creating and removing columns/rows
- Altering values 
- Changing datatypes

Let's walk through how to do all of these:


Let's take a look at the column names with the `.column` DataFrame attribute:

In [None]:
cereal_df.columns 

Index(['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber',
       'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups',
       'rating'],
      dtype='object')

Some of these names (e.g. "mfr") are abbreviated. Let's say we want to rename some of these columns to their full name:

#### Renaming columns 

- DataFrame.rename(columns = ___)
- columns takes in a dict that maps column names.

In [None]:
cereal_df.rename(columns = {'mfr': 'manufacturer', 'carbo': 'carbohydate', 'potass': 'potassium  '})

Unnamed: 0,name,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.00,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.50,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Triples,G,C,110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
73,Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.00,27.753301
74,Wheat Chex,R,C,100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
75,Wheaties,G,C,100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.00,51.592193


In [None]:
cereal_df.head(2)

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679


Column names are still the same. What gives?

Dataframe.rename() method creates new dataframe by default.

In [None]:
cereal_df = cereal_df.rename(columns = {'mfr': 'manufacturer', 'carbo': 'carbohydate', 'potass': 'potassium  '})

This is equivalent to reassigning (inplace = ... argument)

In [None]:
cereal_df.rename(columns = {'mfr': 'manufacturer', 'carbo': 'carbohydate', 'potass': 'potassium  '}, inplace = True)

In [None]:
cereal_df.head(2)

Unnamed: 0,name,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679


Let's take a look at the potassium column.

In [None]:
cereal_df['potassium']

KeyError: 'potassium'

What happened?

In [None]:
cereal_df.columns

Index(['name', 'manufacturer', 'type', 'calories', 'protein', 'fat', 'sodium',
       'fiber', 'carbohydate', 'sugars', 'potassium  ', 'vitamins', 'shelf',
       'weight', 'cups', 'rating'],
      dtype='object')

We accidently introduce trailing white space into the column name. Many imports from files have this problem.

What string command do we need to trim white space?

The way you have seen before:

In [None]:
[col.strip() for col in cereal_df.columns]

['name',
 'manufacturer',
 'type',
 'calories',
 'protein',
 'fat',
 'sodium',
 'fiber',
 'carbohydate',
 'sugars',
 'potassium',
 'vitamins',
 'shelf',
 'weight',
 'cups',
 'rating']

It's generally better to use Pandas native vectorized str methods as these are faster and less code:

In [None]:
cereal_df.columns = cereal_df.columns.str.strip()
print(cereal_df.columns)

Index(['name', 'manufacturer', 'type', 'calories', 'protein', 'fat', 'sodium',
       'fiber', 'carbohydate', 'sugars', 'potassium', 'vitamins', 'shelf',
       'weight', 'cups', 'rating'],
      dtype='object')


Now look at potassium column:

In [None]:
cereal_df['potassium'].head(3)

0    280
1    135
2    320
Name: potassium, dtype: int64

#### Removing Columns

The `shelf` column: shelf in cereal aisle of particular grocery store.

- Let's say we dont care about this column. We drop it:  

In [None]:
cereal_df.drop(columns = ['shelf'], inplace = True)
cereal_df.columns

Index(['name', 'manufacturer', 'type', 'calories', 'protein', 'fat', 'sodium',
       'fiber', 'carbohydate', 'sugars', 'potassium', 'vitamins', 'weight',
       'cups', 'rating'],
      dtype='object')

#### Creating new columns

The 'type' column has only two unique entries 'C' and 'H' (cold or hot cereal?):

- Can use Boolean condition to create a series (Pandas magic).

Is this a hot cereal? Convert Boolean to integer.

- False = 0
- True = 1

In [None]:
is_hot = (cereal_df.type == 'H').astype('int')


print(is_hot)
print(is_hot.value_counts())

0     0
1     0
2     0
3     0
4     0
     ..
72    0
73    0
74    0
75    0
76    0
Name: type, Length: 77, dtype: int32
type
0    74
1     3
Name: count, dtype: int64


Store this as a new column by:
- DataFrame[new_column_name] = series

In [None]:
cereal_df['is_hot'] = is_hot
cereal_df.head()

Unnamed: 0,name,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,1.0,0.33,68.402973,0
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,1.0,0.75,34.384843,0


#### Dealing with the index and rows:
- Clearly, the 'name' column should be our index.
- .set_index(col_name) will set that column to the row index.
- .set_index() can also take in a list or an index object.

In [None]:
cereal_df.set_index('name', inplace = True)
cereal_df.head()

Unnamed: 0_level_0,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,1.0,0.33,68.402973,0
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,1.0,0.75,34.384843,0


In [49]:
cereal_df.index

Index(['100% Bran', '100% Natural Bran', 'All-Bran',
       'All-Bran with Extra Fiber', 'Almond Delight',
       'Apple Cinnamon Cheerios', 'Apple Jacks', 'Basic 4', 'Bran Chex',
       'Bran Flakes', 'Cap'n'Crunch', 'Cheerios', 'Cinnamon Toast Crunch',
       'Clusters', 'Cocoa Puffs', 'Corn Chex', 'Corn Flakes', 'Corn Pops',
       'Count Chocula', 'Cracklin' Oat Bran', 'Cream of Wheat (Quick)',
       'Crispix', 'Crispy Wheat & Raisins', 'Double Chex', 'Froot Loops',
       'Frosted Flakes', 'Frosted Mini-Wheats',
       'Fruit & Fibre Dates; Walnuts; and Oats', 'Fruitful Bran',
       'Fruity Pebbles', 'Golden Crisp', 'Golden Grahams', 'Grape Nuts Flakes',
       'Grape-Nuts', 'Great Grains Pecan', 'Honey Graham Ohs',
       'Honey Nut Cheerios', 'Honey-comb', 'Just Right Crunchy  Nuggets',
       'Just Right Fruit & Nut', 'Kix', 'Life', 'Lucky Charms', 'Maypo',
       'Muesli Raisins; Dates; & Almonds', 'Muesli Raisins; Peaches; & Pecans',
       'Mueslix Crispy Blend', 'Multi-Gr

- Sometimes we want to reset the index.
- This takes index to a column again.
- Dataframe index is integer-indexed.

In [53]:
cereal_df.reset_index(inplace = True)
cereal_df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [54]:
cereal_df.index

RangeIndex(start=0, stop=77, step=1)

Dropping rows by index name:

In [51]:
cereal_df.set_index('name', inplace = True) # set the index back to name
allbran_dropped = cereal_df.drop('All-Bran') # can also take a list of index names or an index object
allbran_dropped.head(4)

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [None]:
two_dropped = cereal_df.drop(['100% Bran', 'Almond Delight'])
two_dropped.head()

Unnamed: 0_level_0,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1.0,0.75,29.509541,0
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,1.0,1.0,33.174094,0


#### Altering dataframe/series values

- It's really important to use the .loc[] accessor when assigning data to dataframe/series selections.
- Here's why:

Select all cold cereals and look at their rating:

In [None]:
cereal_df[cereal_df["type"] == 'C']["rating"] 

name
100% Bran                    68.402973
100% Natural Bran            33.983679
All-Bran                     59.425505
All-Bran with Extra Fiber    93.704912
Almond Delight               34.384843
                               ...    
Triples                      39.106174
Trix                         27.753301
Wheat Chex                   49.787445
Wheaties                     51.592193
Wheaties Honey Gold          36.187559
Name: rating, Length: 74, dtype: float64

Now, add 5 to this selection.

In [None]:
cereal_df[cereal_df["type"] == 'C']["rating"] + 5

name
100% Bran                    73.402973
100% Natural Bran            38.983679
All-Bran                     64.425505
All-Bran with Extra Fiber    98.704912
Almond Delight               39.384843
                               ...    
Triples                      44.106174
Trix                         32.753301
Wheat Chex                   54.787445
Wheaties                     56.592193
Wheaties Honey Gold          41.187559
Name: rating, Length: 74, dtype: float64

Assign this modification to our original selection:

In [None]:
cereal_df[cereal_df["type"] == 'C']["rating"] = cereal_df[cereal_df["type"] == 'C']["rating"] + 5 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cereal_df[cereal_df["type"] == 'C']["rating"] = cereal_df[cereal_df["type"] == 'C']["rating"] + 5


Uh...oh. A warning was issued. Let's see what our assignment did:

In [None]:
cereal_df[cereal_df["type"] == 'C']["rating"]

name
100% Bran                    68.402973
100% Natural Bran            33.983679
All-Bran                     59.425505
All-Bran with Extra Fiber    93.704912
Almond Delight               34.384843
                               ...    
Triples                      39.106174
Trix                         27.753301
Wheat Chex                   49.787445
Wheaties                     51.592193
Wheaties Honey Gold          36.187559
Name: rating, Length: 74, dtype: float64

No change was made to original dataframe.

.loc accessor[] accesses the original dataframe in memory while standard selectors will create a copy of your selection (i.e. a new dataframe). Modifying the selection modifies the copy but not the original dataframe. Using the .loc[] accessor does the trick:

In [None]:
cereal_df.loc[cereal_df["type"] == 'C', "rating"] += 5
cereal_df.loc[cereal_df["type"] == 'C', "rating"]

name
100% Bran                    73.402973
100% Natural Bran            38.983679
All-Bran                     64.425505
All-Bran with Extra Fiber    98.704912
Almond Delight               39.384843
                               ...    
Triples                      44.106174
Trix                         32.753301
Wheat Chex                   54.787445
Wheaties                     56.592193
Wheaties Honey Gold          41.187559
Name: rating, Length: 74, dtype: float64

#### Datetime indices
- Pandas supports datetime types
- Series/DataFrame index: special operations/functionality for datetimes

Load MTA turnsile maintenance dataset to see pandas datetimes in action!

In [None]:
turnstile_df = pd.read_csv('Data/turnstile_180901.txt')
turnstile_df.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,00:00:00,REGULAR,6736067,2283184
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,04:00:00,REGULAR,6736087,2283188


In [None]:
turnstile_df['DATE']

0         08/25/2018
1         08/25/2018
2         08/25/2018
3         08/25/2018
4         08/25/2018
             ...    
197620    08/31/2018
197621    08/31/2018
197622    08/31/2018
197623    08/31/2018
197624    08/31/2018
Name: DATE, Length: 197625, dtype: object

In [None]:
turnstile_df['TIME']

0         00:00:00
1         04:00:00
2         08:00:00
3         12:00:00
4         16:00:00
            ...   
197620    05:00:00
197621    09:00:00
197622    13:00:00
197623    17:00:00
197624    21:00:00
Name: TIME, Length: 197625, dtype: object

Both in string format.

- Join date and time.
- Assign to new column.

In [None]:
    turnstile_df['DATETIME'] = turnstile_df['DATE'] + ' ' + turnstile_df['TIME']
    turnstile_df.drop(columns = ['DATE', 'TIME'], inplace = True)
    turnstile_df['DATETIME'].head()

0    08/25/2018 00:00:00
1    08/25/2018 04:00:00
2    08/25/2018 08:00:00
3    08/25/2018 12:00:00
4    08/25/2018 16:00:00
Name: DATETIME, dtype: object

In [None]:
turnstile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197625 entries, 0 to 197624
Data columns (total 10 columns):
 #   Column                                                                Non-Null Count   Dtype 
---  ------                                                                --------------   ----- 
 0   C/A                                                                   197625 non-null  object
 1   UNIT                                                                  197625 non-null  object
 2   SCP                                                                   197625 non-null  object
 3   STATION                                                               197625 non-null  object
 4   LINENAME                                                              197625 non-null  object
 5   DIVISION                                                              197625 non-null  object
 6   DESC                                                                  197625 non-null  objec

- Convert string to datetime type.
- pd.to_datetime(): can intelligently parse various common datetime string formats
- %m/%d/%Y date format parsing.

In [None]:
 turnstile_df['DATETIME'] = pd.to_datetime(turnstile_df['DATETIME'])

In [None]:
turnstile_df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS,DATETIME
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736067,2283184,2018-08-25 00:00:00
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736087,2283188,2018-08-25 04:00:00
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736105,2283229,2018-08-25 08:00:00
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736180,2283314,2018-08-25 12:00:00
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736349,2283384,2018-08-25 16:00:00


In [None]:
turnstile_df['DATETIME']

0        2018-08-25 00:00:00
1        2018-08-25 04:00:00
2        2018-08-25 08:00:00
3        2018-08-25 12:00:00
4        2018-08-25 16:00:00
                 ...        
197620   2018-08-31 05:00:00
197621   2018-08-31 09:00:00
197622   2018-08-31 13:00:00
197623   2018-08-31 17:00:00
197624   2018-08-31 21:00:00
Name: DATETIME, Length: 197625, dtype: datetime64[ns]

This is a datetime series. Datetime series have vectorized methods and attributes that are very useful.
- Round date to nearest start of week.
- Get named day of week for date.

In [None]:
turnstile_df['DATETIME'].dt.round('7D')

0        2018-08-23
1        2018-08-23
2        2018-08-23
3        2018-08-23
4        2018-08-23
            ...    
197620   2018-08-30
197621   2018-08-30
197622   2018-08-30
197623   2018-08-30
197624   2018-08-30
Name: DATETIME, Length: 197625, dtype: datetime64[ns]

In [None]:
turnstile_df['DATETIME'].dt.day_name()

0         Saturday
1         Saturday
2         Saturday
3         Saturday
4         Saturday
            ...   
197620      Friday
197621      Friday
197622      Friday
197623      Friday
197624      Friday
Name: DATETIME, Length: 197625, dtype: object

Set column to our datetime index.

In [None]:
turnstile_df.set_index('DATETIME', inplace = True)

We now have a datetime index.

Let's sort the index so that the data is ordered sequentially. This can be accomplished with the following method:

In [None]:
turnstile_df.sort_index(inplace=True)

Once  the datetimes are ordered, you can slice your data by datetime index to get data that falls within a certain time range:

In [None]:
turnstile_df['2018-08-25':'2018-08-27']

Unnamed: 0_level_0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS
DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-08-25 00:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736067,2283184
2018-08-25 00:00:00,R728,R226,00-00-01,GUN HILL RD,5,IRT,REGULAR,2530477,585174
2018-08-25 00:00:00,R606,R225,00-00-04,HOYT ST,23,IRT,REGULAR,1552288,3855944
2018-08-25 00:00:00,N535,R220,00-00-01,CARROLL ST,FG,IND,REGULAR,3815302,2414507
2018-08-25 00:00:00,N535,R220,00-00-02,CARROLL ST,FG,IND,REGULAR,16152865,3262980
...,...,...,...,...,...,...,...,...,...
2018-08-27 23:57:22,PTH07,R550,00-01-07,CITY / BUS,1,PTH,REGULAR,119052,223384
2018-08-27 23:58:30,PTH20,R549,03-00-06,NEWARK HM HE,1,PTH,REGULAR,25579,160841
2018-08-27 23:58:35,PTH17,R541,01-00-06,THIRTY THIRD ST,1,PTH,REGULAR,869309,591851
2018-08-27 23:59:26,PTH19,R549,02-01-07,NEWARK C,1,PTH,REGULAR,319369,250131


Pandas datetime indexes provide some powerful capabilties for cleaning and transforming time series data.

## Pandas: Descriptive statistics and applying functions


Often want to: 
- Calculate summary statistics for a series or a subset of a DataFrame.
- Apply a function to a pandas series or DataFrame.

Let's look at the iconic Titanic passenger dataset.

<center>Can data science tell us whether Jack dies? </center>
<div align">
<center><img src="Images/jack_titanic.jpg" width="500"/></center>
</div>

In [None]:
df = pd.read_csv('Data/titanic.csv', index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [None]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Calculating Column Statistics
- We often are interested in calculating a particular statistic for one or a few columns.
- DataFrames and Series objects have many built-in methods that can calculate these summary statistics.

First, we can use the .select_dtypes() method to select columns from a DataFrame of a particular data type. Here we will insist that the columns we extract are numeric only. This can be by passing in the 'number' string to `include` keyword argument.


In [None]:
df_numeric = df.select_dtypes(include='number')
df_numeric.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Age       714 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
dtypes: float64(2), int64(4)
memory usage: 48.7 KB


The general pattern for calculating a statistic across many columns of a DataFrame or for a particular column (i.e. a Series) goes as follows:

- DataFrame.particular_statistic()
- Series.particular_statistic()

Let's do this for the mean:

If we want to calculate the mean across these columns:

In [None]:
df_numeric.mean()

Survived     0.383838
Pclass       2.308642
Age         29.699118
SibSp        0.523008
Parch        0.381594
Fare        32.204208
dtype: float64

### Mean

Calculating the mean for a particular column (i.e. across a Series is the same syntax):

In [None]:
df_numeric['Fare'].mean()

32.204207968574636

The pattern is the same for statistics like the median, calculating percentiles, or the standard deviation:
- this should all remind you of numpy syntax.

### Median:

In [None]:
df_numeric.median()

Survived     0.0000
Pclass       3.0000
Age         28.0000
SibSp        0.0000
Parch        0.0000
Fare        14.4542
dtype: float64

In [None]:
df_numeric['Age'].median()

28.0

### Percentiles/Quantiles:
- The .quantile() method is similar to numpy's np.percentile function.
- The function takes in a decimal value between 0 and 1 with decimal value corresponding to a given percentile.
- The following calculates the tenth percentile:

In [None]:
# applied to a DataFrame -- calculates across many columns
df_numeric.quantile(.1)

Survived     0.00
Pclass       1.00
Age         14.00
SibSp        0.00
Parch        0.00
Fare         7.55
Name: 0.1, dtype: float64

In [None]:
# applied to a Series

df_numeric['Age'].quantile(.1)

14.0

### Standard Deviation:

In [None]:
df_numeric.std(ddof=1)

Survived     0.486592
Pclass       0.836071
Age         14.526497
SibSp        1.102743
Parch        0.806057
Fare        49.693429
dtype: float64

In [None]:
df_numeric['Age'].std(ddof=1)

14.526497332334044

#### Other important statistics:

- .mode() -- the mode of the column
- .count() -- the count of the total number of entries in a column
- .var() -- the variance for the column
- .sum() -- the sum of all values in the column
- .cumsum() -- the cumulative sum, where each cell index contains the sum of all indices lower than, and including, itself.

#### Useful summary methods for categorical columns
- Unique entries in categorical column
- Counts for each unique entry

'Embarked' is a categorical column in our Titanic dataset:

In [None]:
df['Embarked'].head()

PassengerId
1    S
2    C
3    S
4    S
5    S
Name: Embarked, dtype: object

We often want to know the unique values that a categorical can take on:
- Series.unique()

In [None]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

It'd be good to know the breakdown of the counts in each class:
- Series.value_counts()

In [None]:
df['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

Note that `nan` is not a class but stands for 'Not a Number' and signifies empty or missing value in pandas. This is why it is not included in the value_counts().

#### Transforming Series and DataFrames

Sometimes we want to do more than simple pre-defined aggregations. 
Want to implement our own functions on pandas data structures.

- Series.map()
- Series.apply()
- DataFrame.apply()


#### Series.map()
Used for substituting each value in a Series with another value, that may be derived from:
- a dict
- a mapping function

Returns a new object. This can be a Series or a DataFrame depending on the form of the function.

Let's create a new Series to see how this works:

In [None]:
orig_series = pd.Series([2,5,8], index = ['A', 'B', 'C'])
orig_series

A    2
B    5
C    8
dtype: int64

Creating the mapping dictionary and then applying it to our Series:

In [None]:
dict_mapper = {2: 2.5, 5: 7.2, 8: 3.9}
orig_series.map(dict_mapper) 

A    2.5
B    7.2
C    3.9
dtype: float64

In [None]:
orig_series

A    2
B    5
C    8
dtype: int64

We can also define a function that applies a numeric transformation and use the .map() method to transform each value in the Series accordingly.

In [None]:
def func(x):
    return 2*x
orig_series.map(func)

A     4
B    10
C    16
dtype: int64

In general, you would use the .map() method for more complex mappings. A simple arithmetic operation like this can be accomplished via using Pandas vectorization implicitly:
- i.e. just multiply the Series directly by a constant.

In [None]:
2*orig_series

A     4
B    10
C    16
dtype: int64

A more interesting use-case might be when you create mappings for binning continous data using conditonal logic. Here we define a function that checks whether a value for a paid Fare lies in a range and then returns a string denoting what bin that Fare belongs to:

In [None]:
def bin_fare(input):
    if input <= 10:
        return 'Low Fare'
    elif (input > 10) & (input <= 40):
        return 'Medium Fare'
    elif input > 40:
        return 'High Fare'

Applying this function to our Fare data in the titanic dataset yields the binning:

In [None]:
df_numeric['Fare'].map(bin_fare)

PassengerId
1         Low Fare
2        High Fare
3         Low Fare
4        High Fare
5         Low Fare
          ...     
887    Medium Fare
888    Medium Fare
889    Medium Fare
890    Medium Fare
891       Low Fare
Name: Fare, Length: 891, dtype: object

In [None]:
df_numeric

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,22.0,1,0,7.2500
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.9250
4,1,1,35.0,1,0,53.1000
5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...
887,0,2,27.0,0,0,13.0000
888,1,1,19.0,0,0,30.0000
889,0,3,,1,2,23.4500
890,1,1,26.0,0,0,30.0000


#### Series.apply():

Applies function elementwise to Series:

- Only applies functions.
- Can do what map does w/ functions but can also do more.
- Can use functions that simply return a Series or even DataFrames


In [None]:
orig_series

A    2
B    5
C    8
dtype: int64

In [None]:
# did this with map. map is faster for this.
orig_series.apply(func)

A     4
B    10
C    16
dtype: int64

Series.apply() really shines when you want to do something a little more complicated:

- For each element in original Series, compute powers of element up to some order. Return these powers as a series.

In [None]:
# this returns a series [x, x**2, x**3,...up to highest order] for each value
def new_func(x, highest_order = 5):
    
    polynomial_series = pd.Series({'Order ' + str(n+1): x**(n+1) for n in range(highest_order)})
    
    return polynomial_series

Using the .apply() method with new_func:

Pandas will run through each element of the original series:
- for each, element return a series of the powers from order 1 to order 5
- combine the series of powers for each element into a DataFrame.


In [None]:
orig_series

A    2
B    5
C    8
dtype: int64

In [None]:
# start with a Series. End up with a DataFrame.
orig_series.apply(new_func)

Unnamed: 0,Order 1,Order 2,Order 3,Order 4,Order 5
A,2,4,8,16,32
B,5,25,125,625,3125
C,8,64,512,4096,32768


#### DataFrame.apply()

- Takes in entire columns (or rows) of dataframe into function and applies transformations:
    - Transformations can be applied column-wise or row-wise.
    - The function can return a number, a Series, or a Dataframe.
    
As such the DataFrame.apply() method is pretty useful, flexible, and is capable of complex transformations. 

Let's see an example. Suppose, for some reason we want the sum of the squares of all entries in the Age and Fares column respectively. Then we want to log transform the result for each column. 

The function below breaks this down in steps. Each function is designed to accept a Series (representing a given column). Then we square each entry in the column and thentake the sum of these squares. Finally, the log of this sum of squares is computed. Thus if the function takes in the data in the `Age` column (a Series) it will return a single value representing the log-squared-sum. 

In [None]:
def log_sq_trans(x):
    x_sq = x**2
    summed_sq = np.sum(x_sq)
    logsquaredsum = np.log(1 + summed_sq)
    
    return logsquaredsum

Now we subset the DataFrame on Age and Fare and apply this function to the DataFrame. The .apply() method then take in each column (`Age` and `Fare`) and computes the log-sum-squared for each. The `axis = 0` (the default argument) ensures that the Series being accepted by the function are *columns*.

In [None]:
# axis = 0: applies function to each column 
df[['Age', 'Fare']].apply(log_sq_trans, axis = 0)

Age     13.567347
Fare    14.953941
dtype: float64

If for some reason you wanted to have the function accepts the rows as argument (i.e. add the squares of Age and Fare for each row and log transforms the result), then you could just change to `axis = 1`.

In [None]:
df[['Age', 'Fare']].apply(log_sq_trans, axis = 1)

PassengerId
1      6.287045
2      8.783597
3      6.606387
4      8.305388
5      7.163019
         ...   
887    6.801283
888    7.140453
889    6.311558
890    7.363280
891    6.989393
Length: 891, dtype: float64