# Objective: Understand interconversion of data structures and their caveats

In this notebook, you will see a very simple example of a pandas dataframe, how you can convert subsets to arrays, lists and sets. You will learn first hand that they all have different strengths and purposes.

In [None]:
import pandas as pd
import numpy as np
import requests as r
from io import StringIO

### First define a function that will allow us to pull data from Tier 2 storage (S3)

In [None]:
def load_s3_csv(url: str) -> pd.DataFrame:
    """Utility to load S3 csvs into pandas DataFrames.

    Args:
        url (str): S3 url (https)

    Returns:
        pd.DataFrame: containing csv at provided url.
    """
    # using this to get around pandas ssl error when reading url directly
    res = r.get(url)
    assert res.status_code == 200, f'Failed to read {url}'
    csv_str = res.text
    df = pd.read_csv(StringIO(csv_str))
    return df


### Now grab some barley field trial data

In [None]:
trial_data_url = 'https://s3.msi.umn.edu/hpc4ag/barley_trial_data.csv'

In [None]:
trial_data = load_s3_csv(trial_data_url)
trial_metadata = pd.DataFrame({
    'trial': ['2015_SPY4_S2TP_CR15', 'S2_MET_AWI16', 'S2_MET_CRM16'],
    'location': ['Crookston', 'Arlington', 'Crookston'],
    'year': [2015, 2016, 2016],
    'environment': ['CRM15', 'AWI16', 'CRM16'],
    'type': ['spy', 'spy', 'spy'],
    'population': ['s2tp', 's2met', 's2met'],
    'project1': ['Breeding', 'Breeding', 'Breeding'],
    'project2': ['S2MET', 'S2MET', 'S2MET'],
    'project3': [None, None, None],
    'planting_date': [20150416, 20160425, 20160504],
    'harvest_date': [20150831, 20160831, 20160831],
    't3_trial_name': ['S2TP_2015_Crookston', 'S2MET_2016_Arlington', 'S2MET_2016_Crookston'],
    'plot_dim': [None, 4.64515, 1.48645],
    'lat': [47.818536, 43.32724, 47.818536],
    'lon': [-96.613366, -89.334503, -96.613366]
})
trial_metadata = trial_metadata.loc[trial_metadata.trial.isin(trial_data.trial.unique())]

### It's always good to take a quick peak at your data contents and dimensions

In [None]:
trial_metadata

In [None]:
type(trial_data)

In [None]:
trial_data

### Now suppose I want to look at the unique traits that were included

In [None]:
trial_data['trait'].unique()

### And how about locations?...

In [None]:
%%time
trial_data['location'].unique()

### Alternatively we could just convert to a set. Is that faster?

In [None]:
%%time
set(trial_data['location'])

### Is it generally faster to convert to a set? What about study years and line names (varieties)?

In [None]:
%%time
trial_data['year'].unique()

In [None]:
%%time
set(trial_data['year'])

In [None]:
%%time
trial_data['line_name'].nunique()

In [None]:
%%time
len(set(trial_data['line_name']))

### Not so fast, those were tiny datasets. What if we had 10M entries?

In [None]:
import random
big_df = pd.DataFrame([int(100*random.random()) for _ in range(10000000)], columns=['Observations'])
big_df.head()

In [None]:
%%time
big_df['Observations'].nunique()

In [None]:
%%time
len(set(big_df['Observations']))

### Note the subtelties even within data structures
Pandas allows columns of a data frame to be [categorical](https://pandas.pydata.org/docs/user_guide/categorical.html), which can change our timings. Notice below that setting our Observation column to categorical enables the conversion to a set to occur in half the time, but doesn't affect the unique() method!

In [None]:
big_df['Observations'] = big_df['Observations'].astype("category")

In [None]:
%%time
big_df['Observations'].nunique()

In [None]:
%%time
len(set(big_df['Observations']))