# Crop reallocation algorithm: toy model
This notebook translates Ishan's Matlab code setting up a toy version of our crop reallocation model into Python

## Questions for Ishan
- Confirm that my overall understanding is correct: that the data we're operating on is region-level (IR or geolev1). It seems like maybe we should be operating at the raster pixel level, but then how do we know how much yield is coming from that pixel?
- Is the way we want to calculate calories actually what’s done in Ishan’s current code? Seems like this assumes calories per acre are uniform across the whole adm1 unit which doesn’t seem right
- How will we break ties in the min yield planted/max yield empty calculation? This could matter especially if the locations have different climate draws/climate sensitivity
- How are we incorporating switching costs, again? Is there some condition where we are forcing costs to weakly exceed benefits?
- discuss correct way to iterate over gamma
- implementation is slow even with this toy example. For this to be feasible, we're going to need to think about how to speed things up significantly.

## Things to follow up on
- Probably want to change data storage so it works with arrays. Will be faster, better able to read projection system outputs, and generally easier to integrate with impact-calculations (need to check all this with James/Brewster)
- Seems useful to develop a suite of visualizations that helps us understand the steps in the algorithm (and where there's possibility for improvement)

## Immediate next steps
- Build in gamma iteration and elasticity iteration

In [67]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Set up parameters
Here I'm just taking the values from Ishan's code, but attempting to structure them in a way that will work for more data

In [68]:
# the list of crops we're working with
crops = ['soy', 'rice']

# Initial conditions
yields = pd.DataFrame({
    'geo0_id': [1, 1, 1, 1, 1, 1],
    'geo1_id': [1, 2, 3, 1, 2, 3],
    'crop': ['soy', 'soy', 'soy', 'rice', 'rice', 'rice'],
    'yield': [10, 20, 15, 20, 10, 15]
})

calories = pd.DataFrame({
    'crop': ['soy', 'rice'],
    'calories': [25, 15]
})

acreage = pd.DataFrame({
    'geo0_id': [1, 1, 1, 1, 1, 1],
    'geo1_id': [1, 2, 3, 1, 2, 3],
    'crop': ['soy', 'soy', 'soy', 'rice', 'rice', 'rice'],
    'acres_planted': [40, 70, 0, 60, 30, 0],
    'total_acres': [100, 100, 100, 100, 100, 100]
})

# may want to turn the lines that return the merged dataframe into a function that
# standardizes datasets--tbd
present_yields = (yields
          .merge(calories, how='left', on='crop')
          .merge(acreage, how='left', on=['crop', 'geo1_id', 'geo0_id'])
         )
    
present_yields['calorie_yield'] = present_yields['yield'] * present_yields['calories']

In [69]:
# Climate shocks
# This will eventually be the real projections from the ag sector.
# Before then, we'll use random draws.
# Just use Ishan's numbers for now.

yield_shocks = pd.DataFrame({
    'geo0_id': [1, 1, 1, 1, 1, 1],
    'geo1_id': [1, 2, 3, 1, 2, 3],
    'crop': ['soy', 'soy', 'soy', 'rice', 'rice', 'rice'],
    'yield_shock': [0.5, 0.8, 1, 0.9, 0.6, 1]
})

future_yields = (yield_shocks
                 .merge(present_yields, how='outer', on=['geo0_id', 'geo1_id', 'crop'])
                )
future_yields['future_yield'] = future_yields['yield'] * future_yields['yield_shock']

# replace the yield column with the future yield column and re-calculate moments
future_yields['yield'] = future_yields['future_yield']
future_yields.drop('future_yield', inplace=True, axis=1)

In [70]:
present_yields

Unnamed: 0,geo0_id,geo1_id,crop,yield,calories,acres_planted,total_acres,calorie_yield
0,1,1,soy,10,25,40,100,250
1,1,2,soy,20,25,70,100,500
2,1,3,soy,15,25,0,100,375
3,1,1,rice,20,15,60,100,300
4,1,2,rice,10,15,30,100,150
5,1,3,rice,15,15,0,100,225


# 2. Set up functions for calculating moments

### Moment 1: Gamma

In [97]:
# update docstring when this is stable
def calculate_gamma(df):
    '''
    Calculate 'gamma', the ratio of total calories produced to possible 
    calories produced.
    
    Parameters:
    -----------
    df: DataFrame
        ***Description here***
        
    '''

    total_cal = sum(df['calorie_yield'] * df['acres_planted'])
    
    # not sure this potential calorie calculation is what we actually want
    potential_cal = (df
        .groupby('geo1_id')
        .agg({'calorie_yield': 'max', 'acres_planted': 'sum'})
    )
    
    potential_cal = sum(potential_cal['calorie_yield'] * potential_cal['acres_planted'])
    
    return  total_cal / potential_cal

In [98]:
calculate_gamma(present_yields)

0.84375

### Moment 2: Phi

In [72]:
def analyze_empty_acreage(df, crop):
    '''
    Returns the yield and plot id for the plot with the highest yield that currently has 
    empty space, as well as the yield and plot id for the plot with the lowest yield that 
    is currently occupied.
    These are the conditions that will be calculated in each iteration of the loop in
    `calculate_phi`
    
   Parameters:
    -----------
    df: DataFrame
        ***Description here***
    
    crop: str
        The name of the crop you want to calculate phi for
    
    '''
    # set up initial conditions
    total_acres = (df[['geo0_id', 'geo1_id', 'acres_planted']]
                   .copy()
                   .groupby(['geo0_id', 'geo1_id'])
                   .sum()
                   .rename(columns={'acres_planted':'total_acres_planted'})
        )
    df = df.merge(total_acres, how='left', on=['geo0_id', 'geo1_id'])
    df['empty_acres'] = df['total_acres'] - df['total_acres_planted']

    assert all(df['empty_acres'] >= 0)

    empty_max_yield = max(
        df.loc[
            (df['empty_acres'] > 0) & (df['crop'] == crop), 'yield'
        ]
    )
    empty_max_id = df.loc[
        (df['yield'] == empty_max_yield) & (df['crop'] == crop), ['geo0_id', 'geo1_id']
    ]
    used_min_yield = min(
        df.loc[
            (df['acres_planted'] > 0) & (df['crop'] == crop), 'yield'
        ]
    )
    used_min_id = df.loc[
        (df['yield'] == used_min_yield) & (df['crop'] == crop), ['geo0_id', 'geo1_id']
    ]
 
    return [empty_max_yield, used_min_yield, empty_max_id, used_min_id]

In [73]:
def reallocate_crops(df, crop, empty_max_yield, used_min_yield, empty_max_id, used_min_id):
    '''
    A loop to reallocate crop area, moving one acre at a time from the lowest-yielding parcel
    of the crop to the highest-yielding unoccupied parcel
    
    Parameters:
    -----------
    df: DataFrame
        ***Description here***
    
    crop: str
        The name of the crop you want to calculate for
        
    cond: bool
        The condition for your while loop
    '''
    df = df.copy()
    
    min_cond = (
        (df['geo0_id'] == used_min_id['geo0_id'].values[0]) &
        (df['geo1_id'] == used_min_id['geo1_id'].values[0]) &
        (df['crop'] == crop)
    )

    val = df.loc[min_cond, 'acres_planted']
    df.loc[min_cond, 'acres_planted'] = val - 1

    # add one acre to the highest-yielding plot that is currently empty
    max_cond = (
        (df['geo0_id'] == empty_max_id['geo0_id'].values[0]) &
        (df['geo1_id'] == empty_max_id['geo1_id'].values[0]) &
        (df['crop'] == crop)
    )
    val = df.loc[max_cond, 'acres_planted']
    df.loc[max_cond, 'acres_planted'] = val + 1
        
    return df

In [88]:
def calculate_phi(df, crop):
    '''
    Calculate 'phi', the ratio of actual yields to the maximum possible
    yield that would be realized in a perfectly frictionless scenario
    with optimal acreage placement in a country. 
    
    Parameters:
    -----------
    df: DataFrame
        ***Description here***
    
    crop: str
        The name of the crop you want to calculate phi for
    '''

    # don't modify the original data
    df = df.copy()
    
    # calculate actual yields
    data = df[df.crop == crop].copy()
    data['total_yield'] = data['acres_planted'] * data['yield']
    actual_yield = sum(data['total_yield'])
    
    # calculate the potential yield
    # set up initial conditions    
    empty_max_yield, used_min_yield, empty_max_id, used_min_id = (
        analyze_empty_acreage(df, crop)
    )
    
    # reallocate crops from lowest used yield to highest empty yield
    while empty_max_yield > used_min_yield:
        df = reallocate_crops(df, 
                              crop, 
                              empty_max_yield, 
                              used_min_yield, 
                              empty_max_id, 
                              used_min_id)
        
        # recalculate empty acreage
        empty_max_yield, used_min_yield, empty_max_id, used_min_id = (
            analyze_empty_acreage(df, crop)
        )
    
    # calculate potential yields using the reallocated data
    data = df[df.crop == crop].copy()
    data['total_yield'] = data['acres_planted'] * data['yield']
    potential_yield = sum(data['total_yield'])
    
    return actual_yield/potential_yield

In [89]:
calculate_phi(present_yields, 'soy')

0.9

## Combine the moments calculation into a single function
Not convinced this step is necessary at this stage. Move on and come back to this if it turns out to be needed.

In [75]:
# note: try to do this in a crop-agnostic way
def calculate_moments(
    soy_yields,
    rice_yields,
    soy_acreage,
    rice_acreage,
    acreage,
    soy_calories_per_bushel,
    rice_calories_per_bushel
    ):
    
    
    return # want to return gamma and crop-specific phis

# 3. Write functions that incorporate a climate shock and match moments

Just doing this inline for now, will be functionalized eventually.

In [76]:
future_yields

Unnamed: 0,geo0_id,geo1_id,crop,yield_shock,yield,calories,acres_planted,total_acres,calorie_yield
0,1,1,soy,0.5,5.0,25,40,100,250
1,1,2,soy,0.8,16.0,25,70,100,500
2,1,3,soy,1.0,15.0,25,0,100,375
3,1,1,rice,0.9,18.0,15,60,100,300
4,1,2,rice,0.6,6.0,15,30,100,150
5,1,3,rice,1.0,15.0,15,0,100,225


In [77]:
def calculate_distances(present, future, crops):
    present_moments = [calculate_gamma(present)] + [calculate_phi(present, c) for c in crops]
    future_moments = [calculate_gamma(future)] + [calculate_phi(future, c) for c in crops]

    distances = [p - f for p, f in zip(present_moments, future_moments)]
    return distances

In [78]:
def match_moments(present, future, crops):
    distances = calculate_distances(present, future, crops)
    while any(d > 0 for d in distances):
        # print(distances)
        if distances[0] > 0:
        # add gamma iteration here
            raise NotImplementedError('gamma iteration not yet implemented')
            
        # reallocate crops simultaneously.
        # note that this means a plot of land could fill up. 
        # will need to tinker with this possibility
        reallocation_info = [analyze_empty_acreage(future, c) for c in crops]
        for i in range(len(crops)):
            if distances[i + 1] > 0:
                future = reallocate_crops(
                    future, 
                    crops[i],
                    reallocation_info[i][0], 
                    reallocation_info[i][1], 
                    reallocation_info[i][2], 
                    reallocation_info[i][3]
                )

            # recalculate distances
            distances = calculate_distances(present, future, crops)
    
    return(future)

In [91]:
match_moments(present_yields, future_yields, ['soy', 'rice'])

Unnamed: 0,geo0_id,geo1_id,crop,yield_shock,yield,calories,acres_planted,total_acres,calorie_yield
0,1,1,soy,0.5,5.0,25,17,100,250
1,1,2,soy,0.8,16.0,25,87,100,500
2,1,3,soy,1.0,15.0,25,6,100,375
3,1,1,rice,0.9,18.0,15,76,100,300
4,1,2,rice,0.6,6.0,15,13,100,150
5,1,3,rice,1.0,15.0,15,1,100,225


### Do some profiling

In [86]:
import cProfile
import timeit
from numba import jit

In [None]:
any([d > 0 for d in distances])


In [None]:
# creat a numba version to see if it's any faster
@jit
def match_moments_faster(present, future, crops):
    distances = calculate_distances(present, future, crops)
    while any([d > 0 for d in distances]):
        # print(distances)
        if distances[0] > 0:
        # add gamma iteration here
            raise NotImplementedError('gamma iteration not yet implemented')
            
        # reallocate crops simultaneously.
        # note that this means a plot of land could fill up. 
        # will need to tinker with this possibility
        reallocation_info = [analyze_empty_acreage(future, c) for c in crops]
        for i in range(len(crops)):
            if distances[i + 1] > 0:
                future = reallocate_crops(
                    future, 
                    crops[i],
                    reallocation_info[i][0], 
                    reallocation_info[i][1], 
                    reallocation_info[i][2], 
                    reallocation_info[i][3]
                )

            # recalculate distances
            distances = calculate_distances(present, future, crops)
    
    return(future)

In [84]:
# time the numba and non-numba versions
def wrapper():
    return match_moments(present_yields, future_yields, ['soy', 'rice'])

In [None]:
def wrapper():
    return match_moments_faster(present_yields, future_yields, ['soy', 'rice'])

timeit.timeit(wrapper, number=1)

In [87]:
cProfile.run(
    "match_moments(present=present_yields, future=future_yields, crops=['soy', 'rice'])"
)

         246889109 function calls (244530806 primitive calls) in 151.456 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    35250    0.024    0.000    0.222    0.000 <__array_function__ internals>:2(all)
    11750    0.013    0.000    0.741    0.000 <__array_function__ internals>:2(allclose)
    11174    0.016    0.000    0.133    0.000 <__array_function__ internals>:2(any)
    12314    0.011    0.000    0.156    0.000 <__array_function__ internals>:2(append)
     5781    0.007    0.000    0.049    0.000 <__array_function__ internals>:2(argsort)
    11174    0.017    0.000    0.179    0.000 <__array_function__ internals>:2(array_equal)
    23112    0.027    0.000    0.163    0.000 <__array_function__ internals>:2(atleast_2d)
     6157    0.007    0.000    0.016    0.000 <__array_function__ internals>:2(bincount)
     5781    0.007    0.000    0.016    0.000 <__array_function__ internals>:2(can_cast)
    53533    0.053    0

   529064    0.396    0.000    2.020    0.000 common.py:577(is_timedelta64_dtype)
   785528    0.366    0.000    2.896    0.000 common.py:608(is_period_dtype)
   808890    0.356    0.000    2.615    0.000 common.py:642(is_interval_dtype)
  1203450    0.737    0.000    4.801    0.000 common.py:678(is_categorical_dtype)
   155568    0.077    0.000    0.410    0.000 common.py:711(is_string_dtype)
   155568    0.077    0.000    0.100    0.000 common.py:741(condition)
    29469    0.037    0.000    0.408    0.000 common.py:814(is_datetimelike)
   321337    0.229    0.000    0.562    0.000 common.py:862(is_dtype_equal)
   292725    0.253    0.000    0.789    0.000 common.py:951(is_integer_dtype)
   194602    0.424    0.000    2.029    0.000 common.py:99(is_bool_indexer)
    25521    0.014    0.000    0.014    0.000 concat.py:117(__init__)
    25521    0.121    0.000    0.270    0.000 concat.py:130(needs_filling)
    25521    0.031    0.000    0.311    0.000 concat.py:139(dtype)
    25521    

    23488    0.043    0.000    0.081    0.000 indexing.py:2352(convert_to_index_sliceable)
    45848    0.131    0.000    2.312    0.000 indexing.py:2377(check_bool_indexer)
    34298    0.051    0.000    0.152    0.000 indexing.py:242(_is_nested_tuple_indexer)
   102894    0.032    0.000    0.046    0.000 indexing.py:243(<genexpr>)
    11174    0.058    0.000    7.437    0.001 indexing.py:247(_convert_tuple)
    16955    0.008    0.000    0.020    0.000 indexing.py:2475(is_nested_tuple)
    68596    0.058    0.000    0.192    0.000 indexing.py:2488(is_label_like)
    11174    0.037    0.000    0.183    0.000 indexing.py:2565(_can_do_equal_len)
    50865    0.116    0.000    0.475    0.000 indexing.py:270(_convert_scalar_indexer)
    11174    0.002    0.000    0.002    0.000 indexing.py:281(_has_valid_setitem_indexer)
    11174    0.264    0.000    8.271    0.001 indexing.py:313(_setitem_with_indexer)
    11174    0.071    0.000    4.097    0.000 indexing.py:472(setter)
    11174    0.

    11562    0.017    0.000    0.017    0.000 {method 'sort' of 'numpy.ndarray' objects}
   245828    0.064    0.000    0.064    0.000 {method 'split' of 'str' objects}
   490019    0.120    0.000    0.120    0.000 {method 'startswith' of 'str' objects}
    17343    0.017    0.000    0.168    0.000 {method 'sum' of 'numpy.ndarray' objects}
     5781    0.005    0.000    0.005    0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
   109721    0.139    0.000    0.139    0.000 {method 'take' of 'numpy.ndarray' objects}
     5781    0.004    0.000    0.004    0.000 {method 'tolist' of 'numpy.ndarray' objects}
    12314    0.009    0.000    0.009    0.000 {method 'transpose' of 'numpy.ndarray' objects}
   225587    0.068    0.000    0.068    0.000 {method 'update' of 'dict' objects}
    17437    0.005    0.000    0.005    0.000 {method 'upper' of 'str' objects}
       94    0.000    0.000    0.000    0.000 {method 'values' of 'collections.OrderedDict' objects}
    37177    0.007    0.000 

In [None]:
cProfile.run(
    "analyze_empty_acreage(future_yields, 'soy')"
)

# 4. Test these new functions out on Ishan's examples

In [None]:
# the list of crops we're working with
crops = ['soy', 'rice']

# Initial conditions
yields = pd.DataFrame({
    'geo0_id': [1, 1, 1, 1, 1, 1],
    'geo1_id': [1, 2, 3, 1, 2, 3],
    'crop': ['soy', 'soy', 'soy', 'rice', 'rice', 'rice'],
    'yield': [10, 20, 15, 20, 10, 15]
})

calories = pd.DataFrame({
    'crop': ['soy', 'rice'],
    'calories': [25, 15]
})

acreage = pd.DataFrame({
    'geo0_id': [1, 1, 1, 1, 1, 1],
    'geo1_id': [1, 2, 3, 1, 2, 3],
    'crop': ['soy', 'soy', 'soy', 'rice', 'rice', 'rice'],
    'acres_planted': [40, 70, 0, 60, 30, 0],
    'total_acres': [100, 100, 100, 100, 100, 100]
})

# may want to turn the lines that return the merged dataframe into a function that
# standardizes datasets--tbd
present_yields = (yields
          .merge(calories, how='left', on='crop')
          .merge(acreage, how='left', on=['crop', 'geo1_id', 'geo0_id'])
         )
    
present_yields['calorie_yield'] = present_yields['yield'] * present_yields['calories']

In [None]:
# Climate shocks
# This will eventually be the real projections from the ag sector.
# Before then, we'll use random draws.
# Just use Ishan's numbers for now.

yield_shocks = pd.DataFrame({
    'geo0_id': [1, 1, 1, 1, 1, 1],
    'geo1_id': [1, 2, 3, 1, 2, 3],
    'crop': ['soy', 'soy', 'soy', 'rice', 'rice', 'rice'],
    'yield_shock': [0.5, 0.8, 1, 0.9, 0.6, 1]
})

future_yields = (yield_shocks
                 .merge(present_yields, how='outer', on=['geo0_id', 'geo1_id', 'crop'])
                )
future_yields['future_yield'] = future_yields['yield'] * future_yields['yield_shock']

# replace the yield column with the future yield column and re-calculate moments
future_yields['yield'] = future_yields['future_yield']
future_yields.drop('future_yield', inplace=True, axis=1)

In [None]:
present_yields

In [None]:
present_yields.loc[present_yields['crop'] == 'soy', 'acres_planted']

In [None]:
x

In [None]:
# function for making barplots of raw data
def make_barplot(col, ylabel, df=present_yields):
    labels = [1, 2, 3]
    x = np.arange(len(labels))
    width = 0.35
    
    fig, ax = plt.subplots()
    bar1 = ax.bar(
        x + width/2,
        df.loc[df['crop'] == 'rice', col],
        width,
        label = 'rice'
    )

    bar2 = ax.bar(
        x - width/2, 
        df.loc[df['crop'] == 'soy', col],
        width,
        label = 'soy'
    )

    ax.set_ylabel(ylabel)
    ax.set_xlabel('Field')
    ax.set_xticks(x)
    ax.legend()
    return plt.show()

In [None]:
# plot total current acreage
make_barplot('acres_planted', 'Acres planted')

In [None]:
# plot current yields
make_barplot('yield', 'Yield')

In [None]:
# plot calories per acre
make_barplot('calorie_yield', 'Calories')

Given the above raw data, our expected values for phi are 0.9 for soy and 0.909 for rice. Does my code give us this?

In [None]:
calculate_phi(present_yields, 'soy')

In [None]:
calculate_phi(present_yields, 'rice')

In [None]:
# plot climate shock ratios
make_barplot(df=future_yields, col='yield_shock', ylabel='Future Yield/Current Yield')

# Benchmark

In [5]:
import timeit

In [21]:
# gamma calculation
timeit.timeit("lambda: calculate_gamma(present_yields)", number=10000)

0.0006674389987892937

In [24]:
# analyze_empty_acreage
timeit.timeit("lambda: analyze_empty_acreage(present_yields, 'soy')")

0.0628344280012243

In [32]:
# reallocate_crops
empty_max_yield, used_min_yield, empty_max_id, used_min_id = (
    analyze_empty_acreage(present_yields, 'soy'))

timeit.timeit("labmda: reallocate_crops(present_yields, 'soy', empty_max_yield, used_min_yield, empty_max_id, used_min_id)", number=10000)

8.828700083540753e-05

In [45]:
timeit.timeit("lambda: calculate_phi(present_yields, 'soy')")

0.062093491998894024

In [62]:
calculate_distances(present_yields, future_yields, ['soy', 'rice'])

[0.0, 0.1325581395348837, 0.08556149732620322]

In [81]:
match_moments(present_yields, future_yields, ['soy', 'rice'])

Unnamed: 0,geo0_id,geo1_id,crop,yield_shock,yield,calories,acres_planted,total_acres,calorie_yield
0,1,1,soy,0.5,5.0,25,17,100,250
1,1,2,soy,0.8,16.0,25,87,100,500
2,1,3,soy,1.0,15.0,25,6,100,375
3,1,1,rice,0.9,18.0,15,76,100,300
4,1,2,rice,0.6,6.0,15,13,100,150
5,1,3,rice,1.0,15.0,15,1,100,225


In [82]:
%timeit -n 1 match_moments(present_yields, future_yields, ['soy', 'rice'])

1min 38s ± 2.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [90]:
%timeit -n 100000 calculate_phi(present_yields, 'soy')

KeyboardInterrupt: 