# Data Integration
### Table of Contents
- [Requirements](#requirements)
- [Structuring Historical Yield Data](#structuring-historical-yield-data)
- [Structuring Historical Price Received Data](#structuring-historical-price-received-data)
- [Structuring Historical Weather Data](#structuring-historical-weather-data)
- [Integrating Data](#integrating-data)

## Requirements

In [1]:
import pandas as pd

In [2]:
## saving state names for later use
states = ['ILLINOIS', 'INDIANA', 'IOWA', 'MINNESOTA', 'MISSOURI', 'NEBRASKA']
months = ['APR', 'AUG', 'DEC', 'FEB', 'JAN', 'JUL', 'JUN', 'MAR', 'MAY', 'NOV', 'OCT', 'SEP']

## reading raw data CSVs
yield_raw = pd.read_csv('../../data/raw/yield_raw.csv') # file path appears as `data/raw/yield_raw.csv` in `integration.py`
price_received_raw = pd.read_csv('../../data/raw/price_received_raw.csv') # file path appears as `data/raw/price_received_raw.csv` in `integration.py`
weather_raw = pd.read_csv('../../data/raw/weather_raw.csv') # file path appears as `data/raw/weather_raw.csv` in `integration.py`

## Structuring Historical Yield Data

In [3]:
## dropping records of unnecessary states and focusing the reference period to only annual records
## dropping duplicate records and preserving the first appearence of year/state record
yield_raw = yield_raw[(yield_raw['state_name'].isin(states)) & (yield_raw['reference_period_desc'] == 'YEAR')]\
    .drop_duplicates(subset=['year', 'state_name', 'util_practice_desc'])

## pivotting the dataframe
yield_raw = yield_raw.pivot(
    index=['year', 'state_name'],
    columns='util_practice_desc',
    values='Value'
).reset_index()

In [4]:
yield_raw['state_name'].value_counts()

state_name
ILLINOIS     159
INDIANA      159
IOWA         159
MINNESOTA    159
MISSOURI     159
NEBRASKA     159
Name: count, dtype: int64

## Structuring Historical Price Received Data

In [5]:
## pivotting the dataframe
price_received_raw = price_received_raw.pivot(
    index=['year', 'state_name'],
    columns='reference_period_desc',
    values='Value'
).reset_index()
for x in price_received_raw.columns:
    price_received_raw.rename(columns={x:x+'_preceived'}, inplace=True) if x in months else x

## Structuring Historical Weather Data

In [7]:
## renaming `weather_raw` dataframe columns to be consistent with `yield_raw` and `price_received_raw` for ease in merging
weather_raw = weather_raw.rename(columns={'Date':'year', 'state':'state_name'})

## Integrating Data

In [9]:
## merge `yield_raw` and `price_received_raw`
temp = yield_raw.merge(price_received_raw, on=['year', 'state_name'], how='outer')

## merge the aforementioned merged dataframe with `weather_raw`
integrated = temp.merge(
    weather_raw,
    on=['year', 'state_name'],
    how='outer'
)

## rename the final dataframe's columns to all lowercase
for x in integrated.columns:
    integrated.rename(columns={x:x.lower()}, inplace=True)

# save the dataframe as a local CSV
integrated.to_csv('../../data/raw/integrated.csv', index=False) # file path appears as `data/raw/integrated.csv` in `integration.py`