Now that we have looked at accessing Dataframes, let look at using them!

In [23]:
import pandas as pd
import numpy as np
import time 

df = pd.read_csv('congress-terms.csv', on_bad_lines='skip')


Now that we have our Dataframe loaded in, lets take a look at how we can actually use the data. In this example, we want to loop through the dataframe and make a list of string from the first and last names of each representative

Remember that there are three common ways to loop through a dataset: Index Based, Iterators, and List Comprehensions. Lets start with the first one as iloc and loc fall into those categories

In [24]:
a=[['john','smith'],['jane','doe']]
total=[]
for i in range(len(a)):
    total.append(a[i][0]+' '+a[i][1])
print(total)

['john smith', 'jane doe']


In [25]:
%%timeit -r 3 -n 10
# For loop with .iloc
total = []
for index in range(len(df)):
       total.append(df.iloc[index,3]+' '+df.iloc[index,5])

625 ms ± 8.91 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [26]:
%%timeit -r 3 -n 10
# For loop with .loc
total = []
for index in range(len(df)):
        total.append(df.loc[index,'firstname']+' '+df.loc[index,'lastname'])

266 ms ± 2.44 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


We can also use iat and at which are like iloc and loc respectivly, but meant to only access single values aka scalars from the data from

In [27]:
#df.iat[0:5,12]

In [28]:
%%timeit -r 3 -n 10
# For loop with .iat
total = []
for index in range(len(df)):
       total.append(df.iat[index,3]+' '+df.iat[index,5])

487 ms ± 16 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [29]:
%%timeit -r 3 -n 10
# For loop with .at
total = []
for index in range(len(df)):
        total.append(df.at[index,'firstname']+' '+df.at[index,'lastname'])

122 ms ± 1.4 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


The next way we loop through a list is via built in iterators. This way is often more readable and can even be faster than indices if you need to access the value multiple time per loop. However, it's important to note that unlike the last examples, you can't alter the data your iterating. Lets take a look

In [30]:
a=[['john','smith'],['jane','doe']]
total=[]
for i in a:
    total.append(i[0]+' '+i[1])
print(total)

['john smith', 'jane doe']


In this vein, Pandas has a built in method call iterrows, which returns an object of series which we can iterate through

In [31]:
%%timeit -r 3 -n 10
# Iterrows
total=[]
for index,row in df.iterrows():
    total.append(row['firstname']+' '+row['lastname'])

617 ms ± 20.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


This is typically one of the slowest way to loop through a dataframe, however, there is an alternative called itertuples! it, similar to iterrows, returns a list of tuples to loop through which is much faster

In [32]:
%%timeit -r 10 -n 10
total = []
for row in df.itertuples():
    total.append(row.firstname+' '+row.lastname)

22.6 ms ± 624 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


But why is this method over 20x faster? It has to due with the what objects are actually in play. In python, the size of an object is often directly corrailated to it's processing speed. This is due to two reasons, the first being that there is physically more memory and information to manipulate when dealing with larger objects. The second is that typically large objects in python have more potiential actions to take and thus require much more C code to represent. To get a sence of this, lets look at a simple list

In [33]:
a=[1,"a",.111111]

Becuase python lists can hold any data types, the code C to instanciate one must be prepared to deal with allocating memory not only by size but also by type. In this manner, as objects gain more complexity, often (but not always) they can become slower to work with

In [34]:
%%timeit -r 100 -n 1000
a=[1,2,3] #size of a is 120 bytes
for i in a:
    pass

96.7 ns ± 14.9 ns per loop (mean ± std. dev. of 100 runs, 1,000 loops each)


In [35]:
%%timeit -r 100 -n 1000
a={0: 1, 1: 2, 2: 3} #size of a is 232 bytes
for i in a:
    pass

164 ns ± 32.5 ns per loop (mean ± std. dev. of 100 runs, 1,000 loops each)


In [36]:
print(type(df.iterrows()),(sys.getsizeof(df.iterrows())), "bytes")
for index, row in df.iterrows():
    print(type(row),sys.getsizeof(row), "bytes")
    print(type(index),sys.getsizeof(index), "bytes")
    break

<class 'generator'> 112 bytes
<class 'pandas.core.series.Series'> 2140 bytes
<class 'int'> 24 bytes


In [37]:
print(type(df.itertuples()),sys.getsizeof(df.itertuples()),"bytes")

for row in df.itertuples():
    print(type(row), (sys.getsizeof(row)), "bytes")
    break


<class 'map'> 48 bytes
<class 'pandas.core.frame.Pandas'> 152 bytes


Great! Now we can move on to and even smaller syntax, list comprehension!

In [41]:
a=[['john','smith'],['jane','doe']]
total=[x[0]+" "+x[1] for x in a]
print(total)

['john smith', 'jane doe']


The next is a built in method called apply, while we will talk more about this in a later lecture, for now all you need to know is it maps a function over some part of the dataframe!

In [39]:
%%timeit -r 10 -n 100
# List comprehension
total=[x+' '+y for x,y in zip(df['firstname'], df['lastname'])]


4.66 ms ± 404 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


Vectorization! In the previous one we looped over a 

In [None]:
%%timeit -r 10 -n 100
total=(df['firstname']+df['lastname']).tolist()

1.09 ms ± 122 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


In [None]:
%%timeit -r 10 -n 10000
total=(df['firstname'].to_numpy()+" "+ df['lastname'].to_numpy()).tolist()

1.02 ms ± 101 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


Vectorization!

Numpy

In [None]:
%%timeit -r 10 -n 100
# Apply
total=0
for i in df.apply(lambda row: row['age'], axis=1).to_list():
    total+=i
total/len(df)