In [25]:
import pandas as pd
import numpy as np
import time 

df = pd.read_csv('congress-terms.csv', on_bad_lines='skip')


Now that we have our Dataframe loaded in, lets take a look at how we can actually use the data. In this example, we want to loop through the dataframe and make a list of string from the first and last names of each representative

As a note, I'll be using the %%timeit command which just tells the notebook to run r tests, executing the code n times each time.

Remember that there are three common ways to loop through a dataset: Index Based iteration, Iterators, and List Comprehensions. Lets start with the first one as iloc and loc fall into those categories

In [26]:
a=[['john','smith'],['jane','doe']]
total=[]
for i in range(len(a)):
    total.append(a[i][0]+' '+a[i][1])
print(total)

['john smith', 'jane doe']


In [27]:
%%timeit -r 3 -n 10
# For loop with .iloc
total = []
for index in range(len(df)):
       total.append(df.iloc[index,3]+' '+df.iloc[index,5])

1 s ± 39 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [28]:
%%timeit -r 3 -n 10
# For loop with .loc
total = []
for index in range(len(df)):
        total.append(df.loc[index,'firstname']+' '+df.loc[index,'lastname'])

400 ms ± 9.6 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


We can also use iat and at which are like iloc and loc respectivly, but meant to only access single values aka scalars from the data from

In [29]:
#df.iat[0:5,12]

In [30]:
%%timeit -r 3 -n 10
# For loop with .iat
total = []
for index in range(len(df)):
       total.append(df.iat[index,3]+' '+df.iat[index,5])

726 ms ± 12.4 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [31]:
%%timeit -r 3 -n 10
# For loop with .at
total = []
for index in range(len(df)):
        total.append(df.at[index,'firstname']+' '+df.at[index,'lastname'])

193 ms ± 4.56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


The next way we loop through a list is via built in iterators. This way is often more readable and can even be faster than indices if you need to access the value multiple time per loop. However, it's important to note that unlike the last examples, you can't alter the data your iterating. Lets take a look

In [32]:
a=[['john','smith'],['jane','doe']]
total=[]
for i in a:
    total.append(i[0]+' '+i[1])
print(total)

['john smith', 'jane doe']


In the same vein, Pandas has a built in method call iterrows, which returns an object of series which we can iterate through

In [33]:
%%timeit -r 3 -n 10
# Iterrows
total=[]
for index,row in df.iterrows():
    total.append(row['firstname']+' '+row['lastname'])

995 ms ± 52.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


This is typically one of the slowest way to loop through a dataframe, however, there is an alternative called itertuples! it, similar to iterrows, returns a list of tuples to loop through which is much faster

In [34]:
%%timeit -r 10 -n 10
total = []
for row in df.itertuples():
    total.append(row.firstname+' '+row.lastname)

36.4 ms ± 2.08 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)


In python, the size of an object is often corrailated to it's processing speed. This is due to two reasons, the first being that there is physically more memory and information to manipulate when dealing with larger objects. The second is that typically large objects in python have more potiential actions to take and thus require much more C code to represent along with more runtime error checking. To get a sence of this, lets consider a simple list

Becuase python lists can hold any data types, the code C to instanciate one must be prepared to deal with allocating memory not only by size but also by type. In this manner, as objects gain more complexity, often (but not always) they can become slower to work with

In [36]:
%%timeit -r 100 -n 1000
a=[1,2,3] #size of a is 120 bytes and code is 3458 lines long
for i in a:
    pass

172 ns ± 44.9 ns per loop (mean ± std. dev. of 100 runs, 1,000 loops each)


In [37]:
%%timeit -r 100 -n 1000
a={0: 1, 1: 2, 2: 3} #size of a is 232 bytes and code is 5813 lines long
for i in a:
    pass

285 ns ± 60.1 ns per loop (mean ± std. dev. of 100 runs, 1,000 loops each)


With this in mind, let's return to iterrows and itertuples

In [38]:
print(type(df.iterrows()),(sys.getsizeof(df.iterrows())), "bytes")
for index, row in df.iterrows():
    print(type(row),sys.getsizeof(row), "bytes")
    print(type(index),sys.getsizeof(index), "bytes")
    break

<class 'generator'> 112 bytes
<class 'pandas.core.series.Series'> 2140 bytes
<class 'int'> 24 bytes


In [39]:
print(type(df.itertuples()),sys.getsizeof(df.itertuples()),"bytes")

for row in df.itertuples():
    print(type(row), (sys.getsizeof(row)), "bytes")
    break


<class 'map'> 48 bytes
<class 'pandas.core.frame.Pandas'> 152 bytes


Great! Now we can move on to and even smaller syntax, list comprehension!

In [40]:
a=[['john','smith'],['jane','doe']]
total=[x[0]+" "+x[1] for x in a]
print(total)

['john smith', 'jane doe']


In [41]:
%%timeit -r 10 -n 100
# List comprehension
total=[x+' '+y for x,y in zip(df['firstname'], df['lastname'])]


7.37 ms ± 256 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


One of the largest factors in this speedup is not having to call .append() every loop. Fuction calls have overhead and with enough loops can add significant overhead. 

Continuing on to the final we reach our final speedup, ditching the loop. In the previous examples,  we iterated over a the dataframes in a sequiential manner, however, we can do much better than that! By using something called vectorization. This is the process of taking an operations that was previously defined by scalar math (such as adding 1 and 2 or joining "john" and "Smith") and describing them as vector/matrix opperations. While there are a variety of benifits to this (often matrix math can actually reduce the number of required comptations) the main one for our purpose is that the computer recongized that the result of df.iloc[1][3]+df.iloc[1][5] is not depenent on df.iloc[0][3]+df.iloc[0][5], thus the computer can compute these values in parrel accross more processers.  

In [71]:
%%timeit -r 10 -n 100
#total=(df['firstname']+' '+df['lastname'])
total=(df['firstname']+' '+df['lastname']).tolist()

2.4 ms ± 263 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


In [46]:
%%timeit -r 10 -n 100
total=(df['firstname'].to_numpy()+ ' ' + df['lastname'].to_numpy()).tolist()

4.14 ms ± 254 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


The next is a built in method called apply, while we will talk more about this in a later lecture, for now all you need to know is it maps a function over some part of the dataframe!

In [64]:
%%timeit -r 3 -n 10
# Apply

df.apply(lambda row: row['firstname'] +' '+ row['lastname'], axis=1).to_list()

184 ms ± 2.27 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


As a pratice problem, try to find the mean age of the data set! Hint not all the tempalates above will be applicable