# cuDF vs Pandas speed comparison 

Using popular baby names (6 million rows). 

*Full baby name data provided by the SSA by state. For the source data see [here](https://www.ssa.gov/oact/babynames/limits.html).*

---
### Download data

In [34]:
%%capture
%%bash
wget "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
unzip namesbystate.zip -d namesbystate && rm namesbystate.zip
cat namesbystate/*.TXT >> namesbystate.csv
rm -r namesbystate

### Import libraries

In [35]:
import time
import cudf as cd
import pandas as pd
import cupy as cp
cd.DataFrame({'a': [0]}); # initialize cudf 

----
### Read data using Pandas

In [36]:
startTime = time.time()
pdf = pd.read_csv('namesbystate.csv', names=["state", "sex", "year", "name", "rank"])
time.time() - startTime

2.5403804779052734

### Read data using cuDF

In [38]:
startTime = time.time()
cdf = cd.read_csv('namesbystate.csv', names=["state", "sex", "year", "name", "rank"])
time.time() - startTime

0.21918463706970215

---
### Aggregate data with Pandas

In [39]:
startTime = time.time()
print(pdf.groupby(["year", "name", "sex"]).count())
time.time() - startTime

                  state  rank
year name    sex             
1910 Aaron   M       13    13
     Abbie   F        4     4
     Abe     M        4     4
     Abner   M        2     2
     Abraham M        6     6
...                 ...   ...
2020 Zymere  M        1     1
     Zymir   M        8     8
     Zyon    M       14    14
     Zyra    F        4     4
     Zyrah   F        1     1

[642570 rows x 2 columns]


1.847792625427246

### Aggregate data with cuDF

In [40]:
startTime = time.time()
print(cdf.groupby(["year", "name", "sex"]).count())
time.time() - startTime

                   state  rank
year name     sex             
1920 Fern     F       32    32
1992 Jamila   F       18    18
1999 Shannan  F        3     3
1994 Tyree    M       21    21
1945 Signe    F        1     1
...                  ...   ...
1999 Destin   M       19    19
2012 Precious F       11    11
1990 Tyresha  F        1     1
     Shilpa   F        1     1
1923 Lon      M        6     6

[642570 rows x 2 columns]


0.12881231307983398

---
### Simple linear model with Pandas

In [46]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
startTime = time.time()
X = pdf[['year']]
y = pdf['name'].str.len()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
lr = LinearRegression(fit_intercept=True, normalize=False)
model = lr.fit(x_train, y_train)
lr.predict(x_test)
time.time() - startTime

2.6151363849639893

### Simple linear model with cuDF

In [48]:
from cuml import train_test_split
from cuml import LinearRegression
startTime = time.time()
X = cdf[['year']].astype('float32')
y = cdf['name'].str.len()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
lr = LinearRegression(fit_intercept = True, normalize = False, algorithm='svd')
model = lr.fit(x_train, y_train)
lr.predict(x_test)
time.time() - startTime

0.13907885551452637

### Summary

As of December 2021, there were `6,215,834` records. Aggregating across states resulted in `642,570` records.

&nbsp;|Input|Output|Pandas|cuDF|Improvement
---|---:|---:|---:|---:|---:
Reading|6,215,834|6,215,834|2.5s|0.2s|**12.5x**
Aggregating|6,215,834|642,570|1.7s|0.8s|**2.1x**
Regression|6,215,834|6,215,834|2.6s|0.1s|**18.6x**