# Pandas and List of Dictionaries
This is a notebook of practice problems surrounding tabular data, using two formats - list of dictionaries and pandas `DataFrame`s. The data is the 'barley' dataset from vega's excellent free library of sample datasets. The problems marked 'a' use the 'list of dictionaries' format, and the problems marked 'b' use the `DataFrame` format.

In [1]:
# DO NOT modify this cell!
import pandas as pd

barley_df = pd.read_csv('data/barley.csv')
barley_list = barley_df.to_dict('records')

barley_df.head()

Unnamed: 0,yield,variety,year,site
0,27.0,Manchuria,1931,University Farm
1,48.86667,Manchuria,1931,Waseca
2,27.43334,Manchuria,1931,Morris
3,39.93333,Manchuria,1931,Crookston
4,32.96667,Manchuria,1931,Grand Rapids


## Problem #1a
### Alex
Write a function called `unique_sites` that will take in the `barley` dataset as a list of dictionaries as a parameter and return a `List` of all the unique sites at which barley was gathered.

In [None]:
def unique_sites(data):
    unique = list() # could also use list, no big deal
    for row in data:
        if row['site'] not in unique:
            unique.append(row['site'])
    return unique


In [None]:
# Testing cell, run to check code
assert set(unique_sites(barley_list)) == set(['University Farm', 'Waseca', 'Morris', 'Crookston', 'Grand Rapids', 'Duluth'])

## Problem #2a
### Alex
Write a function called `site_average` that will take in the `barley` dataset as a list of dictionaries and a `String` representing a particular site as a parameter (e.g. Duluth) and return the average barley yield at that site. You can assume that the `site` parameter will be a valid site in the dataset.

In [None]:
def site_average(data, site):
    yield_count = 0
    yield_total = 0
    for row in data:
        if row['site'] == site:
            yield_count += 1
            yield_total += row['yield']
    return yield_total / yield_count

In [None]:
assert site_average(barley_list, 'Morris') - 35.4000005 < 0.001
assert site_average(barley_list, 'University Farm') - 32.666668 < 0.001

## Problem #3a
### Alex
Write a function called `max_yield` that will take the `barley` dataset as a list of dictionaries as a parameter and return the greatest value in the `yield` column.

In [None]:
def max_yield(data):
    max = 0
    for row in data:
        if row['yield'] > max:
            max = row['yield']
    return max

In [None]:
assert max_yield(barley_list) - 65.7667 < 0.001

## Problem #4a
### Alex
Write a function called `max_yield_info` that will take the `barley` dataset as a parameter and return the a `tuple` of the variety and site at which the greatest yield of barley occurred.

In [None]:
def max_yield_info(data):
    max = 0
    max_site = ''
    max_variety = ''
    for row in data:
        if row['yield'] > max:
            max_site = row['site']
            max_variety = row['variety']
            max = row['yield']
    return (max_variety, max_site)

In [None]:
assert max_yield_info(barley_list) == ('No. 462', 'Waseca')

## Problem #5a
### Alex
Write a function called `max_yield_difference` that takes the `barley` dataset as a list of dictionaries and a `String` representing the variety of barley as parameters and returns the difference between the highest yield of that variety of barley and the lowest yield of that variety.

In [None]:
def max_yield_difference(data, variety):
    min = 0
    max = 0
    for row in data:
        if row['variety'] == variety:
            if row['yield'] < min:
                min = row['yield']
            elif row['yield'] > max:
                max = row['yield']
    return max - min

In [None]:
assert max_yield_difference(barley_list, 'Wisconsin No. 38') - 58.8 < 0.001
assert max_yield_difference(barley_list, 'Manchuria') - 48.86667 < 0.001

## Problem #1b
### Alex
Write a function called `unique_sites_pandas` that will take in the barley dataset as a `DataFrame` as a parameter and return a List of all the unique sites at which barley was gathered.
(Remember - do not use loops in your solution)


In [None]:
def unique_sites_pandas(data):
    return list(data['site'].unique())

In [None]:
assert set(unique_sites_pandas(barley_df)) == set(['University Farm', 'Waseca', 'Morris', 'Crookston', 'Grand Rapids', 'Duluth'])

## Problem #2b
### Alex
Write a function called `site_average_pandas` that will take in the `barley` dataset as a `DataFrame` and a `String` representing a particular site as a parameter (e.g. Duluth) and return the average barley yield at that site. You can assume that the `site` parameter will be a valid site in the dataset.

In [None]:
def site_average_pandas(data, site):
    is_site = data['site'] == site
    data = data[is_site]
    return data['yield'].mean()

In [None]:
assert site_average_pandas(barley_df, 'Morris') - 35.4000005 < 0.001
assert site_average_pandas(barley_df, 'University Farm') - 32.666668 < 0.001

## Problem #3b
### Alex
Write a function called `max_yield_pandas` that will take the `barley` dataset as a list of dictionaries and return the greatest value in the `yield` column.


In [None]:
def max_yield_pandas(data):
    return data['yield'].max()

In [None]:
assert max_yield(barley_list) - 65.7667 < 0.001

## Problem #4b
### Alex
Write a function called `max_yield_info_pandas` that will take the `barley` dataset as a `DataFrame` as a parameter and return the a `tuple` of the variety and site at which the greatest yield of barley occurred.

In [None]:
def max_yield_info_pandas(data):
    return tuple(data.loc[data['yield'].idxmax(), ['variety', 'site']])

In [None]:
assert max_yield_info_pandas(barley_df) == ('No. 462', 'Waseca')

## Problem #5b
### Alex
Write a function called `max_yield_difference_pandas` that takes the `barley` dataset as a `DataFrame` and a `String` representing the variety of barley as parameters and returns the difference between the highest yield of that variety of barley and the lowest yield of that variety.

In [None]:
def max_yield_difference_pandas(data, variety):
    is_variety = data['variety'] == variety
    data = data[is_variety]
    return data['yield'].max() - data['yield'].min()

In [None]:
assert max_yield_difference_pandas(barley_df, 'Wisconsin No. 38') - 58.8 < 0.001
assert max_yield_difference_pandas(barley_df, 'Manchuria') - 48.86667 < 0.001