# Objective: Understand interconversion of data structures and their caveats

In this notebook, you will see a very simple example of a pandas dataframe, how you can convert subsets to arrays, lists and sets. You will learn first hand that they all have different strengths and purposes.

In [1]:
import pandas as pd
import numpy as np
import requests as r
from io import StringIO

### First define a function that will allow us to pull data from Tier 2 storage (S3)

In [2]:
def load_s3_csv(url: str) -> pd.DataFrame:
    """Utility to load S3 csvs into pandas DataFrames.

    Args:
        url (str): S3 url (https)

    Returns:
        pd.DataFrame: containing csv at provided url.
    """
    # using this to get around pandas ssl error when reading url directly
    res = r.get(url)
    assert res.status_code == 200, f'Failed to read {url}'
    csv_str = res.text
    df = pd.read_csv(StringIO(csv_str))
    return df


### Now grab some barley field trial data

In [3]:
trial_data_url = 'https://s3.msi.umn.edu/gems-pyenvirotyping-test-files/barley_trial_data.csv'

In [4]:
trial_data = load_s3_csv(trial_data_url)
trial_metadata = pd.DataFrame({
    'trial': ['2015_SPY4_S2TP_CR15', 'S2_MET_AWI16', 'S2_MET_CRM16'],
    'location': ['Crookston', 'Arlington', 'Crookston'],
    'year': [2015, 2016, 2016],
    'environment': ['CRM15', 'AWI16', 'CRM16'],
    'type': ['spy', 'spy', 'spy'],
    'population': ['s2tp', 's2met', 's2met'],
    'project1': ['Breeding', 'Breeding', 'Breeding'],
    'project2': ['S2MET', 'S2MET', 'S2MET'],
    'project3': [None, None, None],
    'planting_date': [20150416, 20160425, 20160504],
    'harvest_date': [20150831, 20160831, 20160831],
    't3_trial_name': ['S2TP_2015_Crookston', 'S2MET_2016_Arlington', 'S2MET_2016_Crookston'],
    'plot_dim': [None, 4.64515, 1.48645],
    'lat': [47.818536, 43.32724, 47.818536],
    'lon': [-96.613366, -89.334503, -96.613366]
})
trial_metadata = trial_metadata.loc[trial_metadata.trial.isin(trial_data.trial.unique())]

### It's always good to take a quick peak at your data contents and dimensions

In [5]:
trial_metadata

Unnamed: 0,trial,location,year,environment,type,population,project1,project2,project3,planting_date,harvest_date,t3_trial_name,plot_dim,lat,lon
0,2015_SPY4_S2TP_CR15,Crookston,2015,CRM15,spy,s2tp,Breeding,S2MET,,20150416,20150831,S2TP_2015_Crookston,,47.818536,-96.613366
1,S2_MET_AWI16,Arlington,2016,AWI16,spy,s2met,Breeding,S2MET,,20160425,20160831,S2MET_2016_Arlington,4.64515,43.32724,-89.334503
2,S2_MET_CRM16,Crookston,2016,CRM16,spy,s2met,Breeding,S2MET,,20160504,20160831,S2MET_2016_Crookston,1.48645,47.818536,-96.613366


In [6]:
type(trial_data)

pandas.core.frame.DataFrame

In [7]:
trial_data

Unnamed: 0,trial,environment,location,year,trait,line_name,value,std_error
0,S2_MET_AWI16,AWI16,Arlington,2016,GrainYield,06AB-08,4593.467000,603.427979
1,S2_MET_AWI16,AWI16,Arlington,2016,GrainYield,06AB-32,5103.679000,603.427979
2,S2_MET_AWI16,AWI16,Arlington,2016,GrainYield,06MT-93,4582.339608,593.525373
3,S2_MET_AWI16,AWI16,Arlington,2016,GrainYield,06N2-02,5403.786683,586.067733
4,S2_MET_AWI16,AWI16,Arlington,2016,GrainYield,06N2-14,6347.156608,593.525373
...,...,...,...,...,...,...,...,...
636,S2_MET_CRM16,CRM16,Crookston,2016,GrainYield,2MS14_3342-013,9290.591678,462.817157
637,S2_MET_CRM16,CRM16,Crookston,2016,GrainYield,2MS14_3342-018,8873.490531,462.817157
638,S2_MET_CRM16,CRM16,Crookston,2016,GrainYield,2MS14_3342-022,9613.508695,462.817157
639,S2_MET_CRM16,CRM16,Crookston,2016,GrainYield,2MS14_3345-013,8516.936325,462.817157


### Now suppose I want to look at the unique traits that were included

In [8]:
trial_data['trait'].unique()

array(['GrainYield'], dtype=object)

### And how about locations?...

In [9]:
%%time
trial_data['location'].unique()

CPU times: user 273 µs, sys: 88 µs, total: 361 µs
Wall time: 332 µs


array(['Arlington', 'Crookston'], dtype=object)

### Alternatively we could just convert to a set. Is that faster?

In [10]:
%%time
set(trial_data['location'])

CPU times: user 70 µs, sys: 22 µs, total: 92 µs
Wall time: 96.3 µs


{'Arlington', 'Crookston'}

### Is it generally faster to convert to a set? What about study years and line names (varieties)?

In [11]:
%%time
trial_data['year'].unique()

CPU times: user 218 µs, sys: 71 µs, total: 289 µs
Wall time: 291 µs


array([2016, 2015])

In [12]:
%%time
set(trial_data['year'])

CPU times: user 77 µs, sys: 25 µs, total: 102 µs
Wall time: 106 µs


{2015, 2016}

In [13]:
%%time
len(trial_data['line_name'].unique())

CPU times: user 0 ns, sys: 386 µs, total: 386 µs
Wall time: 358 µs


232

In [14]:
%%time
len(set(trial_data['line_name']))

CPU times: user 80 µs, sys: 26 µs, total: 106 µs
Wall time: 109 µs


232

### Not so fast, those were tiny datasets. What if we had 10M entries?

In [15]:
import random
big_df = pd.DataFrame([int(100*random.random()) for _ in range(10000000)], columns=['Observations'])
big_df.head()

Unnamed: 0,Observations
0,44
1,50
2,28
3,25
4,33


In [16]:
%%time
len(big_df['Observations'].unique())

CPU times: user 35.6 ms, sys: 3.73 ms, total: 39.3 ms
Wall time: 38.4 ms


100

In [17]:
%%time
len(set(big_df['Observations']))

CPU times: user 420 ms, sys: 971 µs, total: 421 ms
Wall time: 422 ms


100