<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0"> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Transforming Excel Analysis into pandas Data Models</h1>
<h1>Pandas DataFrames</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

import watermark

%load_ext watermark
%matplotlib inline

We start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

watermark  2.0.2
json       2.0.9
autopep8   1.5
numpy      1.18.1
pandas     1.0.1
matplotlib 3.1.3
Thu Sep 03 2020 

CPython 3.7.3
IPython 6.2.1

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.6.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : 645675cceecd8c43030193fc309fa409552f669e


Set the default figure style

In [3]:
plt.style.use('./d4sci.mplstyle')

## DataFrames and Series

Series and DataFrames can be thought of as dictionaries associating keys to lists of values

In [5]:
data = {"id": [23, 42, 12, 86], "Name": ["Bob", "Karen", "Kate", "Bill"]}

In [6]:
data

{'Name': ['Bob', 'Karen', 'Kate', 'Bill'], 'id': [23, 42, 12, 86]}

A Series corresponds to just a sigle list of values

In [7]:
series = pd.Series(data["id"])

In [8]:
series

0    23
1    42
2    12
3    86
dtype: int64

While a DataFrame can have multiple

In [9]:
 df = pd.DataFrame(data)

In [10]:
df

Unnamed: 0,id,Name
0,23,Bob
1,42,Karen
2,12,Kate
3,86,Bill


Another way of looking at it, is that DataFrames are essentially groups of individual Series. Each Series can have it's own datatype **dtype**

In [11]:
df.dtypes

id       int64
Name    object
dtype: object

We can get general information about how the DataFrame is being stored by calling __info()__

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      4 non-null      int64 
 1   Name    4 non-null      object
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes


Subsetting a DataFrame by column name we retrieve the underlying Series

In [13]:
type(df['id'])

pandas.core.series.Series

Both columns and index values have types and possibly names

In [14]:
df.columns

Index(['id', 'Name'], dtype='object')

In [15]:
df.index

RangeIndex(start=0, stop=4, step=1)

And we can query the shape and number of dimensions of the DataFrame easily

In [16]:
df.shape

(4, 2)

In [17]:
df.ndim

2

And relabel both index and column values

In [18]:
df.index = ["row" + str(i) for i in range(4)]
df.columns = ['ID', 'First Name']

In [20]:
df.loc['row1']

ID               42
First Name    Karen
Name: row1, dtype: object

In [21]:
df.iloc[1]

ID               42
First Name    Karen
Name: row1, dtype: object

## Importing and exporting data

### Read csv files

File can be zipped

In [22]:
green = pd.read_csv('data/green_tripdata_2014-04.csv.gz', 
        parse_dates=['lpep_pickup_datetime', 'Lpep_dropoff_datetime'], nrows=1000, index_col='VendorID', 
        dtype={'RateCodeID':'str', 'Trip_type': 'str'}
                   )

We read only 1000 rows as expected

In [23]:
green.shape

(1000, 21)

And used the right dtypes for each column

In [24]:
green.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 2 to 2
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   lpep_pickup_datetime   1000 non-null   datetime64[ns]
 1   Lpep_dropoff_datetime  1000 non-null   datetime64[ns]
 2   Store_and_fwd_flag     1000 non-null   object        
 3   RateCodeID             1000 non-null   object        
 4   Pickup_longitude       1000 non-null   float64       
 5   Pickup_latitude        1000 non-null   float64       
 6   Dropoff_longitude      1000 non-null   float64       
 7   Dropoff_latitude       1000 non-null   float64       
 8   Passenger_count        1000 non-null   int64         
 9   Trip_distance          1000 non-null   float64       
 10  Fare_amount            1000 non-null   float64       
 11  Extra                  1000 non-null   float64       
 12  MTA_tax                1000 non-null   float64       
 13  Tip_am

### Read excel spreadsheets

By default, __read_excel__ reads the first spreadsheet

In [27]:
movies = pd.read_excel('data/movies.xlsx', index_col='Title')

In [28]:
movies.head()

Unnamed: 0_level_0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123,1.33,385907.0,,D.W. Griffith,...,436,22,9.0,481,691,1,10718,88,69.0,8.0
Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,Harry F. Millarde,...,2,2,0.0,4,0,1,5,1,1.0,4.8
The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,King Vidor,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,Fritz Lang,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3
Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110,1.33,,9950.0,Georg Wilhelm Pabst,...,426,20,3.0,455,926,1,7431,84,71.0,8.0


But it preserves no information about the sheet name. An alternative is to use

In [29]:
dfs = pd.read_excel('data/movies.xlsx', sheet_name=None)

To retrieve a dictionary with all the available sheets keyed by name

In [30]:
len(dfs)

4

In [31]:
dfs.keys()

dict_keys(['1900s', '2000s', '2010s', '3000s'])

In [32]:
movies_1900 = dfs['1900s']

In [33]:
movies_1900.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123,1.33,385907.0,,...,436,22,9.0,481,691,1,10718,88,69.0,8.0
1,Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,...,2,2,0.0,4,0,1,5,1,1.0,4.8
2,The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
3,Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3
4,Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110,1.33,,9950.0,...,426,20,3.0,455,926,1,7431,84,71.0,8.0


Or to select a specific work sheet directly by name

In [35]:
movies_2000 = pd.read_excel('data/movies.xlsx', sheet_name='2000s')

In [36]:
movies_2000.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,102 Dalmatians,2000,Adventure|Comedy|Family,English,USA,G,100.0,1.85,85000000.0,66941559.0,...,2000.0,795.0,439.0,4182,372,1,26413,77.0,84.0,4.8
1,28 Days,2000,Comedy|Drama,English,USA,PG-13,103.0,1.37,43000000.0,37035515.0,...,12000.0,10000.0,664.0,23864,0,1,34597,194.0,116.0,6.0
2,3 Strikes,2000,Comedy,English,USA,R,82.0,1.85,6000000.0,9821335.0,...,939.0,706.0,585.0,3354,118,1,1415,10.0,22.0,4.0
3,Aberdeen,2000,Drama,English,UK,,106.0,1.85,6500000.0,64148.0,...,844.0,2.0,0.0,846,260,0,2601,35.0,28.0,7.3
4,All the Pretty Horses,2000,Drama|Romance|Western,English,USA,PG-13,220.0,2.35,57000000.0,15527125.0,...,13000.0,861.0,820.0,15006,652,2,11388,183.0,85.0,5.8


If the cells contain formulas, pandas simply returns the current values (the current output of each formula)

In [37]:
mortgage = pd.read_excel('data/excel-mortgage-calculator.xlsx', skiprows=15)

In [38]:
mortgage

Unnamed: 0.1,Unnamed: 0,PMT NO,PAYMENT DATE,BEGINNING BALANCE,SCHEDULED PAYMENT,EXTRA PAYMENT,TOTAL PAYMENT,PRINCIPAL,INTEREST,ENDING BALANCE,CUMULATIVE INTEREST
0,,1,2020-09-01,200000.000000,1073.643246,100,1173.643246,340.309913,833.333333,199659.690087,833.333333
1,,2,2020-10-01,199659.690087,1073.643246,100,1173.643246,341.727871,831.915375,199317.962217,1665.248709
2,,3,2020-11-01,199317.962217,1073.643246,100,1173.643246,343.151737,830.491509,198974.810480,2495.740218
3,,4,2020-12-01,198974.810480,1073.643246,100,1173.643246,344.581536,829.061710,198630.228944,3324.801928
4,,5,2021-01-01,198630.228944,1073.643246,100,1173.643246,346.017292,827.625954,198284.211652,4152.427882
...,...,...,...,...,...,...,...,...,...,...,...
355,,356,2050-04-01,0.000000,1073.643246,0,0.000000,0.000000,0.000000,0.000000,149442.538588
356,,357,2050-05-01,0.000000,1073.643246,0,0.000000,0.000000,0.000000,0.000000,149442.538588
357,,358,2050-06-01,0.000000,1073.643246,0,0.000000,0.000000,0.000000,0.000000,149442.538588
358,,359,2050-07-01,0.000000,1073.643246,0,0.000000,0.000000,0.000000,0.000000,149442.538588


Wherever possible, dypes are chosen according to the excel format specified

In [39]:
mortgage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Unnamed: 0           0 non-null      float64       
 1   PMT NO               360 non-null    int64         
 2   PAYMENT DATE         360 non-null    datetime64[ns]
 3   BEGINNING BALANCE    360 non-null    float64       
 4   SCHEDULED PAYMENT    360 non-null    float64       
 5   EXTRA PAYMENT        360 non-null    int64         
 6   TOTAL PAYMENT        360 non-null    float64       
 7   PRINCIPAL            360 non-null    float64       
 8   INTEREST             360 non-null    float64       
 9   ENDING BALANCE       360 non-null    float64       
 10  CUMULATIVE INTEREST  360 non-null    float64       
dtypes: datetime64[ns](1), float64(8), int64(2)
memory usage: 31.1 KB


### ExcelFile

In [53]:
book = pd.ExcelFile('data/movies.xlsx')

We can easily get a list of all worksheets

In [54]:
book.sheet_names

['1900s', '2000s', '2010s', '3000s']

We can easily parse a specific sheet and convert it to a DataFrame

In [42]:
df3 = book.parse('2000s')

In [43]:
df3.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,102 Dalmatians,2000,Adventure|Comedy|Family,English,USA,G,100.0,1.85,85000000.0,66941559.0,...,2000.0,795.0,439.0,4182,372,1,26413,77.0,84.0,4.8
1,28 Days,2000,Comedy|Drama,English,USA,PG-13,103.0,1.37,43000000.0,37035515.0,...,12000.0,10000.0,664.0,23864,0,1,34597,194.0,116.0,6.0
2,3 Strikes,2000,Comedy,English,USA,R,82.0,1.85,6000000.0,9821335.0,...,939.0,706.0,585.0,3354,118,1,1415,10.0,22.0,4.0
3,Aberdeen,2000,Drama,English,UK,,106.0,1.85,6500000.0,64148.0,...,844.0,2.0,0.0,846,260,0,2601,35.0,28.0,7.3
4,All the Pretty Horses,2000,Drama|Romance|Western,English,USA,PG-13,220.0,2.35,57000000.0,15527125.0,...,13000.0,861.0,820.0,15006,652,2,11388,183.0,85.0,5.8


__parse__ supports most of the parameters available for read_excel

In [44]:
df4 = book.parse('2000s', index_col=0, usecols=['Title', 'Year', 'Director', 'Budget'])

In [45]:
df4.head()

Unnamed: 0_level_0,Year,Budget,Director
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
102 Dalmatians,2000,85000000.0,Kevin Lima
28 Days,2000,43000000.0,Betty Thomas
3 Strikes,2000,6000000.0,DJ Pooh
Aberdeen,2000,6500000.0,Hans Petter Moland
All the Pretty Horses,2000,57000000.0,Billy Bob Thornton


### Web pages

We're going to use the Wikipedia page with the current numbers of cases for CoVID-19

In [46]:
url = 'https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory'

All we have to do is to provide the url

In [47]:
dfs = pd.read_html(url)

Which retrieves all the tables in the page, in the order they appear

In [48]:
len(dfs)

32

So the first one is the infobox on the top right hand corner

In [49]:
dfs[0]

Unnamed: 0,COVID-19 pandemic,COVID-19 pandemic.1
0,"Confirmed cases per 100,000 population as of 2...","Confirmed cases per 100,000 population as of 2..."
1,Disease,Coronavirus disease 2019 (COVID-19)
2,Virus strain,Severe acute respiratory syndromecoronavirus 2...
3,Source,"Probably bats, possibly via pangolins[1][2]"
4,Location,Worldwide
5,First outbreak,Mainland China[3]
6,Index case,"Wuhan, Hubei, China30°37′11″N 114°15′28″E﻿ / ﻿..."
7,Date,1 December 2019[3]–present(9 months and 2 days)
8,Confirmed cases,"26,062,946[4]"
9,Active cases,"7,882,999[4]"


And the second one is the number of cases and deaths per country

In [50]:
dfs[1]

Unnamed: 0_level_0,Location[a],Location[a],Cases[b],Deaths[c],Recov.[d],Ref.
Unnamed: 0_level_1,Unnamed: 0_level_1,World[e],"26,062,946","863,741","17,316,206",[4]
0,,United States[f],6206224,188590,3280930,[13]
1,,Brazil[g],4001422,123899,3210405,[16][17]
2,,India,3853406,67376,2970492,[18]
3,,Russia[h],1009995,17528,826935,[19]
4,,Peru,663437,29259,480177,[20][21]
...,...,...,...,...,...,...
225,,Anguilla,3,0,3,[317]
226,,Sint Eustatius,2,0,2,[318]
227,,Tanzania[bd],No data,No data,No data,[320][321]
228,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...


Due to the formatting of the table, pandas interpreted the first two rows to be the headers. We can fix this by explicitly telling it to just use the first row for the column headers

In [51]:
dfs = pd.read_html(url, header=0)

In [52]:
dfs[1]

Unnamed: 0,Location[a],Location[a].1,Cases[b],Deaths[c],Recov.[d],Ref.
0,,World[e],26062946,863741,17316206,[4]
1,,United States[f],6206224,188590,3280930,[13]
2,,Brazil[g],4001422,123899,3210405,[16][17]
3,,India,3853406,67376,2970492,[18]
4,,Russia[h],1009995,17528,826935,[19]
...,...,...,...,...,...,...
226,,Anguilla,3,0,3,[317]
227,,Sint Eustatius,2,0,2,[318]
228,,Tanzania[bd],No data,No data,No data,[320][321]
229,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...,As of 3 September 2020 (UTC) · History of case...


## Subsetting

The top/bottom N number of values are easy to access

In [55]:
movies.head(10)

Unnamed: 0_level_0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123,1.33,385907.0,,D.W. Griffith,...,436,22,9.0,481,691,1,10718,88,69.0,8.0
Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,Harry F. Millarde,...,2,2,0.0,4,0,1,5,1,1.0,4.8
The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,King Vidor,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,Fritz Lang,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3
Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110,1.33,,9950.0,Georg Wilhelm Pabst,...,426,20,3.0,455,926,1,7431,84,71.0,8.0
The Broadway Melody,1929,Musical|Romance,English,USA,Passed,100,1.37,379000.0,2808000.0,Harry Beaumont,...,77,28,4.0,109,167,8,4546,71,36.0,6.3
Hell's Angels,1930,Drama|War,English,USA,Passed,96,1.2,3950000.0,,Howard Hughes,...,431,12,4.0,457,279,1,3753,53,35.0,7.8
A Farewell to Arms,1932,Drama|Romance|War,English,USA,Unrated,79,1.37,800000.0,,Frank Borzage,...,998,164,99.0,1284,213,1,3519,46,42.0,6.6
42nd Street,1933,Comedy|Musical|Romance,English,USA,Unrated,89,1.37,439000.0,2300000.0,Lloyd Bacon,...,610,105,45.0,995,439,2,7921,97,65.0,7.7
She Done Him Wrong,1933,Comedy|Drama|History|Musical|Romance,English,USA,Approved,66,1.37,200000.0,,Lowell Sherman,...,418,85,28.0,583,328,1,4152,59,35.0,6.5


In [56]:
movies.tail(2)

Unnamed: 0_level_0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Wild Wild West,1999,Action|Comedy|Sci-Fi|Western,English,USA,PG-13,106,1.85,170000000.0,113745408.0,Barry Sonnenfeld,...,10000,4000,582.0,15870,0,2,129601,648,85.0,4.8
Wing Commander,1999,Action|Adventure|Sci-Fi,English,USA,PG-13,100,2.35,30000000.0,11576087.0,Chris Roberts,...,811,586,362.0,2497,858,3,14747,338,85.0,4.1


And individual rows, which can be indexed by Name

In [57]:
movies.loc["Wild Wild West\xa0"]

Year                                                   1999
Genres                         Action|Comedy|Sci-Fi|Western
Language                                            English
Country                                                 USA
Content Rating                                        PG-13
Duration                                                106
Aspect Ratio                                           1.85
Budget                                              1.7e+08
Gross Earnings                                  1.13745e+08
Director                                   Barry Sonnenfeld
Actor 1                                          Will Smith
Actor 2                                         Salma Hayek
Actor 3                                            Bai Ling
Facebook Likes - Director                               188
Facebook Likes - Actor 1                              10000
Facebook Likes - Actor 2                               4000
Facebook Likes - Actor 3                

Or by position

In [58]:
movies.iloc[1336]

Year                                                   1999
Genres                         Action|Comedy|Sci-Fi|Western
Language                                            English
Country                                                 USA
Content Rating                                        PG-13
Duration                                                106
Aspect Ratio                                           1.85
Budget                                              1.7e+08
Gross Earnings                                  1.13745e+08
Director                                   Barry Sonnenfeld
Actor 1                                          Will Smith
Actor 2                                         Salma Hayek
Actor 3                                            Bai Ling
Facebook Likes - Director                               188
Facebook Likes - Actor 1                              10000
Facebook Likes - Actor 2                               4000
Facebook Likes - Actor 3                

Rows behave a named tuples, so you can access individual elements by position:

In [59]:
movies.iloc[1336, 10]

'Will Smith'

Or by name

In [60]:
movies.iloc[1336].Budget

170000000.0

In [61]:
movies.loc['Wild Wild West\xa0', 'Director']

'Barry Sonnenfeld'

Ranges can also be used with iloc

In [64]:
movies.iloc[1:4]

Unnamed: 0_level_0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,Harry F. Millarde,...,2,2,0.0,4,0,1,5,1,1.0,4.8
The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,King Vidor,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,Fritz Lang,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3


And __loc__ with the important difference that loc automatically __includes__ the last value of the range, while iloc does not

In [63]:
movies.loc["Over the Hill to the Poorhouse\xa0":"Metropolis\xa0"]

Unnamed: 0_level_0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,Harry F. Millarde,...,2,2,0.0,4,0,1,5,1,1.0,4.8
The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,King Vidor,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,Fritz Lang,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3


Since each column is just a numpy array, we can easily manipulate the values and create new columns

In [65]:
movies['Budget2'] = movies.Budget+45

In [66]:
movies[['Budget', 'Budget2']]

Unnamed: 0_level_0,Budget,Budget2
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Intolerance: Love's Struggle Throughout the Ages,385907.0,385952.0
Over the Hill to the Poorhouse,100000.0,100045.0
The Big Parade,245000.0,245045.0
Metropolis,6000000.0,6000045.0
Pandora's Box,,
...,...,...
Twin Falls Idaho,500000.0,500045.0
Universal Soldier: The Return,24000000.0,24000045.0
Varsity Blues,16000000.0,16000045.0
Wild Wild West,170000000.0,170000045.0


We can also append new rows to the dataframe. Since we have 3 sheets that all follow the same format, we can just stack them together using __concat__:

In [67]:
df2 = pd.concat([movies, movies_2000])

In [68]:
df2

Unnamed: 0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score,Budget2,Title
Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123.0,1.33,385907.0,,D.W. Griffith,...,9.0,481,691,1,10718,88.0,69.0,8.0,385952.0,
Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110.0,1.33,100000.0,3000000.0,Harry F. Millarde,...,0.0,4,0,1,5,1.0,1.0,4.8,100045.0,
The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151.0,1.33,245000.0,,King Vidor,...,6.0,108,226,0,4849,45.0,48.0,8.3,245045.0,
Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145.0,1.33,6000000.0,26435.0,Fritz Lang,...,18.0,203,12000,1,111841,413.0,260.0,8.3,6000045.0,
Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110.0,1.33,,9950.0,Georg Wilhelm Pabst,...,3.0,455,926,1,7431,84.0,71.0,8.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2095,2009,Action|Adventure|Fantasy|Sci-Fi|Thriller,English,USA,PG-13,119.0,2.35,150000000.0,179883016.0,Gavin Hood,...,2000.0,40054,0,0,361924,641.0,350.0,6.7,,X-Men Origins: Wolverine
2096,2009,Adventure|Comedy,English,USA,PG-13,100.0,1.85,60000000.0,43337279.0,Harold Ramis,...,485.0,12258,0,2,76770,203.0,170.0,4.9,,Year One
2097,2009,Comedy|Drama|Romance,English,USA,R,90.0,1.85,18000000.0,15281286.0,Miguel Arteta,...,729.0,16004,0,2,64646,105.0,192.0,6.5,,Youth in Revolt
2098,2009,Comedy|Horror|Sci-Fi,English,USA,R,89.0,1.85,500000.0,,Kevin Hamedani,...,23.0,292,0,0,3650,39.0,64.0,5.1,,ZMD: Zombies of Mass Destruction


__concat__ can also be used to place two DataFrames side by side

In [71]:
pd.concat([movies, movies], axis=1).shape

(1338, 50)

## Time Series

Apple stock information from https://finance.yahoo.com/quote/AAPL/history

We can automatically convert the Date column using __pd.read_csv__:

In [72]:
data = pd.read_csv('data/AAPL.csv', parse_dates=['Date'])

In [73]:
data.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Adj Close           float64
Volume              float64
dtype: object

If we now set the Date column to be the index, we effectively create our first Time Series

In [74]:
data.set_index('Date', inplace=True)

We see that pandas automatically generated a "DatetimeIndex" object that allos us to take advantage of the fact that we are dealing with dates

In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9887 entries, 1980-12-12 to 2020-02-28
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       9886 non-null   float64
 1   High       9886 non-null   float64
 2   Low        9886 non-null   float64
 3   Close      9886 non-null   float64
 4   Adj Close  9886 non-null   float64
 5   Volume     9886 non-null   float64
dtypes: float64(6)
memory usage: 540.7 KB


We can easily access parts of the date object

In [76]:
data

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980-12-12,0.513393,0.515625,0.513393,0.513393,0.406782,117258400.0
1980-12-15,0.488839,0.488839,0.486607,0.486607,0.385558,43971200.0
1980-12-16,0.453125,0.453125,0.450893,0.450893,0.357260,26432000.0
1980-12-17,0.462054,0.464286,0.462054,0.462054,0.366103,21610400.0
1980-12-18,0.475446,0.477679,0.475446,0.475446,0.376715,18362400.0
...,...,...,...,...,...,...
2020-02-24,297.260010,304.179993,289.230011,298.179993,298.179993,55548800.0
2020-02-25,300.950012,302.529999,286.130005,288.079987,288.079987,57668400.0
2020-02-26,286.529999,297.880005,286.500000,292.649994,292.649994,49513700.0
2020-02-27,281.100006,286.000000,272.959991,273.519989,273.519989,80151400.0


In [77]:
data.index.month

Int64Index([12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
            ...
             2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
           dtype='int64', name='Date', length=9887)

In [78]:
data.index.year

Int64Index([1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980,
            ...
            2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020],
           dtype='int64', name='Date', length=9887)

In [79]:
data.index.day

Int64Index([12, 15, 16, 17, 18, 19, 22, 23, 24, 26,
            ...
            14, 18, 19, 20, 21, 24, 25, 26, 27, 28],
           dtype='int64', name='Date', length=9887)

And slice the DataFrame by date

In [96]:
data.loc['2010':'2010-06-10'].round(2)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,30.49,30.64,30.34,30.57,26.54,123432400.0
2010-01-05,30.66,30.80,30.46,30.63,26.58,150476200.0
2010-01-06,30.63,30.75,30.11,30.14,26.16,138040000.0
2010-01-07,30.25,30.29,29.86,30.08,26.11,119282800.0
2010-01-08,30.04,30.29,29.87,30.28,26.29,111902700.0
...,...,...,...,...,...,...
2010-06-04,36.89,37.41,36.38,36.57,31.74,189576100.0
2010-06-07,36.90,37.02,35.79,35.85,31.12,221735500.0
2010-06-08,36.18,36.26,35.09,35.62,30.92,250192600.0
2010-06-09,35.92,35.99,34.64,34.74,30.16,213657500.0


As before, we note that the last value is also included

## DataFrame Manipulations

Map allows us to easily apply a function to the rows of a Series.

In [81]:
movies['Director'].head()

Title
Intolerance: Love's Struggle Throughout the Ages           D.W. Griffith
Over the Hill to the Poorhouse                         Harry F. Millarde
The Big Parade                                                King Vidor
Metropolis                                                    Fritz Lang
Pandora's Box                                        Georg Wilhelm Pabst
Name: Director, dtype: object

In [84]:
movies['Director'].map(lambda x: x.lower())

Title
Intolerance: Love's Struggle Throughout the Ages           d.w. griffith
Over the Hill to the Poorhouse                         harry f. millarde
The Big Parade                                                king vidor
Metropolis                                                    fritz lang
Pandora's Box                                        georg wilhelm pabst
                                                            ...         
Twin Falls Idaho                                          michael polish
Universal Soldier: The Return                                mic rodgers
Varsity Blues                                              brian robbins
Wild Wild West                                          barry sonnenfeld
Wing Commander                                             chris roberts
Name: Director, Length: 1338, dtype: object

For a dataframe we need to use transform

In [85]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1338 entries, Intolerance: Love's Struggle Throughout the Ages  to Wing Commander 
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Year                         1338 non-null   int64  
 1   Genres                       1338 non-null   object 
 2   Language                     1334 non-null   object 
 3   Country                      1338 non-null   object 
 4   Content Rating               1316 non-null   object 
 5   Duration                     1338 non-null   int64  
 6   Aspect Ratio                 1308 non-null   float64
 7   Budget                       1281 non-null   float64
 8   Gross Earnings               1086 non-null   float64
 9   Director                     1338 non-null   object 
 10  Actor 1                      1338 non-null   object 
 11  Actor 2                      1338 non-null   object 
 12  Actor 3               

In [86]:
movies[['Duration', 'Budget']].transform(lambda x: x * 2)

Unnamed: 0_level_0,Duration,Budget
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Intolerance: Love's Struggle Throughout the Ages,246,771814.0
Over the Hill to the Poorhouse,220,200000.0
The Big Parade,302,490000.0
Metropolis,290,12000000.0
Pandora's Box,220,
...,...,...
Twin Falls Idaho,222,1000000.0
Universal Soldier: The Return,166,48000000.0
Varsity Blues,212,32000000.0
Wild Wild West,212,340000000.0


Or apply

In [87]:
movies[['Duration', 'Budget']].apply(np.sum, axis=0)

Duration    1.512890e+05
Budget      3.474154e+10
dtype: float64

In [88]:
movies[['Duration', 'Budget']].apply(np.sum, axis=1)

Title
Intolerance: Love's Struggle Throughout the Ages        386030.0
Over the Hill to the Poorhouse                          100110.0
The Big Parade                                          245151.0
Metropolis                                             6000145.0
Pandora's Box                                              110.0
                                                        ...     
Twin Falls Idaho                                        500111.0
Universal Soldier: The Return                         24000083.0
Varsity Blues                                         16000106.0
Wild Wild West                                       170000106.0
Wing Commander                                        30000100.0
Length: 1338, dtype: float64

## Merge and Join

Define 2 toy DataFrames

In [89]:
A = pd.DataFrame({"lkey":["foo", "bar", "baz", "foo"], "value":[1,2,3,4]})
B = pd.DataFrame({"rkey":["foo", "bar", "qux", "bar"], "value":[5,6,7,8]})

Merge allows us to join them by specifying an arbitrary column on each of them

In [93]:
A.merge(B, left_on="lkey", right_on="rkey", how="left")

Unnamed: 0,lkey,value_x,rkey,value_y
0,foo,1,foo,5
1,foo,4,foo,5
2,bar,2,bar,6
3,bar,2,bar,8


On the other hand, join performs the join using the respective Indices

In [94]:
A.set_index('lkey', inplace=True)
B.set_index('rkey', inplace=True)

In [95]:
A.join(B, lsuffix="_l", rsuffix="_r", how="inner")

Unnamed: 0,value_l,value_r
bar,2,6
bar,2,8
foo,1,5
foo,4,5


<div style="width: 100%; overflow: hidden;">
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</div>