<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0"> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Transforming Excel Analysis into pandas Data Models</h1>
<h1>Excel Pitfalls</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np
import numpy_financial as npf

import matplotlib
import matplotlib.pyplot as plt 

import watermark

%load_ext watermark
%matplotlib inline

We start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

autopep8        1.5
watermark       2.0.2
numpy_financial 1.0.0
numpy           1.18.1
pandas          1.0.1
matplotlib      3.1.3
json            2.0.9
Wed Sep 02 2020 

CPython 3.7.3
IPython 6.2.1

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.6.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : 00fa09128f296bb2b7d4858d993e4db3654af99c


Load default figure style

In [3]:
plt.style.use('./d4sci.mplstyle')

## Large file

Data Frames are limited only by available memory, and have no fixed limit on the number of rows or columns

In [4]:
taxis = pd.read_csv('data/green_tripdata_2014-04.csv.gz', 
        parse_dates=['lpep_pickup_datetime', 'Lpep_dropoff_datetime']
                   )

Even relatively small files can have a large number of rows

In [5]:
taxis.shape

(1309155, 22)

And we can be sure that each column has a unique format without any unexpected changes in formatting

In [6]:
taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309155 entries, 0 to 1309154
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   VendorID               1309155 non-null  int64         
 1   lpep_pickup_datetime   1309155 non-null  datetime64[ns]
 2   Lpep_dropoff_datetime  1309155 non-null  datetime64[ns]
 3   Store_and_fwd_flag     1309155 non-null  object        
 4   RateCodeID             1309155 non-null  int64         
 5   Pickup_longitude       1309155 non-null  float64       
 6   Pickup_latitude        1309155 non-null  float64       
 7   Dropoff_longitude      1309155 non-null  float64       
 8   Dropoff_latitude       1309155 non-null  float64       
 9   Passenger_count        1309155 non-null  int64         
 10  Trip_distance          1309155 non-null  float64       
 11  Fare_amount            1309155 non-null  float64       
 12  Extra                  13091

Computations are column based and written in a compact form

In [7]:
taxis['Trip_type'].unique()

array([ 1., nan,  2.])

In [8]:
taxis['Fare_amount'].mean()

12.339962212266684

Easily index any part of the full DataFrame

In [9]:
taxis.tail()

Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Trip_type,Unnamed: 20,Unnamed: 21
1309150,2,2014-04-30 23:59:47,2014-05-01 00:11:07,N,1,-73.939415,40.805447,-73.952591,40.824951,1,...,0.5,0.5,2.62,0.0,,13.62,1,1.0,,
1309151,1,2014-04-30 23:59:58,2014-05-01 00:21:50,N,1,-73.957474,40.717976,-73.908546,40.765022,1,...,0.5,0.5,5.85,0.0,,25.35,1,1.0,,
1309152,2,2014-04-30 23:59:58,2014-05-01 00:07:19,N,1,-73.95565,40.721756,-73.942276,40.708183,1,...,0.5,0.5,2.0,0.0,,11.0,1,1.0,,
1309153,2,2014-04-30 23:59:59,2014-05-01 00:09:18,N,1,-73.933823,40.802601,-73.918701,40.818592,1,...,0.5,0.5,0.0,0.0,,11.0,2,1.0,,
1309154,1,2014-04-30 23:59:59,2014-05-01 00:26:09,N,1,-73.995766,40.691593,-73.834732,40.786205,2,...,0.5,0.5,8.6,0.0,,51.6,1,1.0,,


## Mortgage Calculator

The logic underlying any computation is always clear, as each cell displays the code instead of just the computed values

In [10]:
price = 1_000_000
down = .2 * price
ammount = price-down
years = 30
months = 12*years
rate = .03/12

In [11]:
payment = npf.pmt(rate, months, -ammount)

In [12]:
dates = pd.date_range('08/01/2020', periods=months, freq='M')+pd.Timedelta('1D')

From these few lines of code we can easily audit how the computation is performing and find any existing bugs

In [13]:
rows = []

balance = price - down 
extra_payment = 100

for month in dates:
    row = [month]
    row.append(balance)
    row.append(payment)
    row.append(extra_payment)
    row.append(payment+extra_payment)
    
    interest = balance*rate
    principal = payment-interest
    
    row.append(principal+extra_payment)
    row.append(interest)
    
    balance -= principal+extra_payment
    
    row.append(balance)
    rows.append(row)
    
    if balance <= 0:
        break

And convert the data into a compact DataFrame that can be used to perform further computations

In [14]:
mortgage = pd.DataFrame(rows, columns=['Date', 'Beginning Balance', 'Scheduled Payment', 'Extra Payment', 
                                       'Total Payment', 'Principal', 'Interest', 'Ending Balance'])
mortgage['Cummulative Interest'] = mortgage.Interest.cumsum()

In [15]:
mortgage

Unnamed: 0,Date,Beginning Balance,Scheduled Payment,Extra Payment,Total Payment,Principal,Interest,Ending Balance,Cummulative Interest
0,2020-09-01,800000.000000,3372.83227,100,3472.83227,1472.832270,2000.000000,798527.167730,2000.000000
1,2020-10-01,798527.167730,3372.83227,100,3472.83227,1476.514351,1996.317919,797050.653380,3996.317919
2,2020-11-01,797050.653380,3372.83227,100,3472.83227,1480.205636,1992.626633,795570.447743,5988.944553
3,2020-12-01,795570.447743,3372.83227,100,3472.83227,1483.906150,1988.926119,794086.541593,7977.870672
4,2021-01-01,794086.541593,3372.83227,100,3472.83227,1487.615916,1985.216354,792598.925677,9963.087026
...,...,...,...,...,...,...,...,...,...
339,2048-12-01,15664.923674,3372.83227,100,3472.83227,3433.669961,39.162309,12231.253713,392994.225458
340,2049-01-01,12231.253713,3372.83227,100,3472.83227,3442.254136,30.578134,8788.999578,393024.803592
341,2049-02-01,8788.999578,3372.83227,100,3472.83227,3450.859771,21.972499,5338.139807,393046.776091
342,2049-03-01,5338.139807,3372.83227,100,3472.83227,3459.486920,13.345350,1878.652887,393060.121440


## Non-standardized data

In [16]:
movies = pd.read_excel('data/movies.xlsx')

Unexpected spaces in the data

In [17]:
movies['Title'].iloc[0]

"Intolerance: Love's Struggle Throughout the Ages\xa0"

That can easily be cleaned

In [18]:
movies['Title'] = movies['Title'].transform(lambda x:x.strip())

In [19]:
movies['Title'].iloc[0]

"Intolerance: Love's Struggle Throughout the Ages"

<div style="width: 100%; overflow: hidden;">
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</div>