<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=150px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Transforming Excel Analysis into pandas Data Models</h1>
<h1>Excel Pitfalls</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np
import numpy_financial as npf

import matplotlib
import matplotlib.pyplot as plt 

import watermark

%load_ext watermark
%matplotlib inline

We start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Git hash: 8cd8c4969f6d4b5f0dbf5d55746efe085989b1e3

pandas         : 1.1.3
numpy_financial: 1.0.0
json           : 2.0.9
watermark      : 2.1.0
numpy          : 1.19.2
matplotlib     : 3.3.2



Load default figure style

In [3]:
plt.style.use('./d4sci.mplstyle')

## Large file

Data Frames are limited only by available memory, and have no fixed limit on the number of rows or columns

In [4]:
taxis = pd.read_csv('data/green_tripdata_2014-04.csv.gz', 
        parse_dates=['lpep_pickup_datetime', 'Lpep_dropoff_datetime']
                   )

In [5]:
!ls -lh 'data/green_tripdata_2014-04.csv.gz'

-rw-r--r--@ 1 bgoncalves  staff    51M Sep  1  2020 data/green_tripdata_2014-04.csv.gz


Even relatively small files can have a large number of rows

In [6]:
taxis.shape

(1309155, 22)

And we can be sure that each column has a unique format without any unexpected changes in formatting

In [7]:
taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309155 entries, 0 to 1309154
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   VendorID               1309155 non-null  int64         
 1   lpep_pickup_datetime   1309155 non-null  datetime64[ns]
 2   Lpep_dropoff_datetime  1309155 non-null  datetime64[ns]
 3   Store_and_fwd_flag     1309155 non-null  object        
 4   RateCodeID             1309155 non-null  int64         
 5   Pickup_longitude       1309155 non-null  float64       
 6   Pickup_latitude        1309155 non-null  float64       
 7   Dropoff_longitude      1309155 non-null  float64       
 8   Dropoff_latitude       1309155 non-null  float64       
 9   Passenger_count        1309155 non-null  int64         
 10  Trip_distance          1309155 non-null  float64       
 11  Fare_amount            1309155 non-null  float64       
 12  Extra                  13091

Computations are column based and written in a compact form

In [8]:
taxis['Trip_type'].unique()

array([ 1., nan,  2.])

In [9]:
taxis['Fare_amount'].mean()

12.339962212266684

Easily index any part of the full DataFrame

In [10]:
taxis.iloc[1000:1020]

Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Trip_type,Unnamed: 20,Unnamed: 21
1000,2,2014-04-01 01:00:44,2014-04-01 01:13:30,N,1,-73.89183,40.747021,-73.917595,40.834263,5,...,0.5,0.5,0.0,5.33,,28.33,2,1.0,,
1001,1,2014-04-01 01:00:45,2014-04-01 01:08:54,N,1,-73.918846,40.759144,-73.944969,40.753735,1,...,0.5,0.5,2.0,0.0,,11.5,1,,,
1002,2,2014-04-01 01:00:46,2014-04-01 01:13:03,N,5,-73.919205,40.8153,-73.946342,40.797829,1,...,0.0,0.0,0.0,0.0,,0.12,2,2.0,,
1003,2,2014-04-01 01:00:59,2014-04-01 01:18:16,N,1,-73.937576,40.758251,-73.958313,40.731274,1,...,0.5,0.5,2.0,0.0,,21.5,1,1.0,,
1004,2,2014-04-01 01:01:07,2014-04-01 01:01:24,N,5,-73.871689,40.852695,-73.871452,40.846371,1,...,0.0,0.0,4.0,0.0,,24.0,1,2.0,,
1005,2,2014-04-01 01:01:07,2014-04-01 01:06:28,N,1,-73.947624,40.71125,-73.934898,40.703197,1,...,0.5,0.5,0.0,0.0,,7.5,2,1.0,,
1006,1,2014-04-01 01:01:08,2014-04-01 01:07:29,N,1,-73.997536,40.594048,-73.998772,40.577499,1,...,0.5,0.5,0.0,0.0,,9.0,2,,,
1007,1,2014-04-01 01:01:15,2014-04-01 01:21:49,N,1,-73.975716,40.687069,-73.974289,40.646133,1,...,0.5,0.5,0.0,0.0,,19.0,2,,,
1008,2,2014-04-01 01:01:19,2014-04-01 01:10:20,N,1,-73.844254,40.721008,-73.806267,40.678631,1,...,0.5,0.5,0.0,0.0,,14.5,1,1.0,,
1009,2,2014-04-01 01:01:22,2014-04-01 01:03:55,N,1,-73.931755,40.765072,-73.918663,40.758862,1,...,0.5,0.5,0.0,0.0,,5.0,2,1.0,,


In [11]:
taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309155 entries, 0 to 1309154
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   VendorID               1309155 non-null  int64         
 1   lpep_pickup_datetime   1309155 non-null  datetime64[ns]
 2   Lpep_dropoff_datetime  1309155 non-null  datetime64[ns]
 3   Store_and_fwd_flag     1309155 non-null  object        
 4   RateCodeID             1309155 non-null  int64         
 5   Pickup_longitude       1309155 non-null  float64       
 6   Pickup_latitude        1309155 non-null  float64       
 7   Dropoff_longitude      1309155 non-null  float64       
 8   Dropoff_latitude       1309155 non-null  float64       
 9   Passenger_count        1309155 non-null  int64         
 10  Trip_distance          1309155 non-null  float64       
 11  Fare_amount            1309155 non-null  float64       
 12  Extra                  13091

## Mortgage Calculator

The logic underlying any computation is always clear, as each cell displays the code instead of just the computed values

In [12]:
price = 1_000_000
down = .2 * price
ammount = price-down
years = 30
months = 12*years
rate = .03/12

In [13]:
payment = npf.pmt(rate, months, -ammount)

In [14]:
dates = pd.date_range('08/01/2020', periods=months, freq='M')+pd.Timedelta('1D')

From these few lines of code we can easily audit how the computation is performing and find any existing bugs

In [15]:
rows = []

balance = price - down 
extra_payment = 0

for month in dates:
    row = [month]
    row.append(balance)
    row.append(payment)
    row.append(extra_payment)
    row.append(payment + extra_payment)
    
    interest = balance*rate
    principal = payment-interest
    
    row.append(principal+extra_payment)
    row.append(interest)
    
    balance -= principal+extra_payment
    
    row.append(balance)
    rows.append(row)
    
    if balance <= 0:
        break

And convert the data into a compact DataFrame that can be used to perform further computations

In [16]:
mortgage = pd.DataFrame(rows, columns=['Date', 'Beginning Balance', 'Scheduled Payment', 'Extra Payment', 
                                       'Total Payment', 'Principal', 'Interest', 'Ending Balance'])
mortgage['Cummulative Interest'] = mortgage.Interest.cumsum()

In [17]:
mortgage

Unnamed: 0,Date,Beginning Balance,Scheduled Payment,Extra Payment,Total Payment,Principal,Interest,Ending Balance,Cummulative Interest
0,2020-09-01,800000.000000,3372.83227,0,3372.83227,1372.832270,2000.000000,7.986272e+05,2000.000000
1,2020-10-01,798627.167730,3372.83227,0,3372.83227,1376.264351,1996.567919,7.972509e+05,3996.567919
2,2020-11-01,797250.903380,3372.83227,0,3372.83227,1379.705011,1993.127258,7.958712e+05,5989.695178
3,2020-12-01,795871.198368,3372.83227,0,3372.83227,1383.154274,1989.677996,7.944880e+05,7979.373174
4,2021-01-01,794488.044094,3372.83227,0,3372.83227,1386.612160,1986.220110,7.931014e+05,9965.593284
...,...,...,...,...,...,...,...,...,...
355,2050-04-01,16738.414274,3372.83227,0,3372.83227,3330.986234,41.846036,1.340743e+04,414135.716101
356,2050-05-01,13407.428039,3372.83227,0,3372.83227,3339.313700,33.518570,1.006811e+04,414169.234671
357,2050-06-01,10068.114340,3372.83227,0,3372.83227,3347.661984,25.170286,6.720452e+03,414194.404957
358,2050-07-01,6720.452356,3372.83227,0,3372.83227,3356.031139,16.801131,3.364421e+03,414211.206088


## Non-standardized data

In [18]:
movies = pd.read_excel('data/movies.xlsx')

In [19]:
movies.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123,1.33,385907.0,,...,436,22,9.0,481,691,1,10718,88,69.0,8.0
1,Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,...,2,2,0.0,4,0,1,5,1,1.0,4.8
2,The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
3,Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3
4,Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110,1.33,,9950.0,...,426,20,3.0,455,926,1,7431,84,71.0,8.0


Unexpected spaces in the data

In [20]:
movies['Title'].iloc[10]

'It Happened One Night\xa0'

That can easily be cleaned

In [21]:
movies['Title'] = movies['Title'].transform(lambda x: x.strip())

In [22]:
movies['Title'].iloc[10]

'It Happened One Night'

<div style="width: 100%; overflow: hidden;">
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</div>