# YOUR PROJECT TITLE

For this assignment, I will be using two data sets that are relevant to agriculture and weather in the United States, specifically North Dakota, being one of the biggest wheat producing states in the US. The first data set contains information on yearly corn yield per acre, from around 1900 to 2022. This data is from USDA National Agricultural Statistics Service (NASS). The second data set contains yearly average temperature, precipitation, and the date of the first freeze, from a central weather station in North Dakota, with data ranging from a similar interval as the agricultural data. This data is sourced from the National Oceanic and Atmospheric Administration (NOAA). By combining these data sets, the hope is to investigate the relationship between weather patterns and corn production in the North Dakota over the past century.

Imports and set magics:

In [62]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

To begin, let's read the first data set on wheat yield and inspect it:

In [63]:
NDyield = pd.read_csv('ND_yield.csv')

print(NDyield.head())


  Program  Year Period  Week Ending Geo Level         State  State ANSI  \
0  SURVEY  2022   YEAR          NaN     STATE  NORTH DAKOTA          38   
1  SURVEY  2021   YEAR          NaN     STATE  NORTH DAKOTA          38   
2  SURVEY  2020   YEAR          NaN     STATE  NORTH DAKOTA          38   
3  SURVEY  2019   YEAR          NaN     STATE  NORTH DAKOTA          38   
4  SURVEY  2018   YEAR          NaN     STATE  NORTH DAKOTA          38   

   Ag District  Ag District Code  County  ...  Zip Code  Region  \
0          NaN               NaN     NaN  ...       NaN     NaN   
1          NaN               NaN     NaN  ...       NaN     NaN   
2          NaN               NaN     NaN  ...       NaN     NaN   
3          NaN               NaN     NaN  ...       NaN     NaN   
4          NaN               NaN     NaN  ...       NaN     NaN   

   watershed_code  Watershed  Commodity                             Data Item  \
0               0        NaN      WHEAT  WHEAT - YIELD, MEASURED 

As can be seen, we have a lot of unnecessary information, that should be dropped for simplicity and to make the data set easier to manage. Let us start by making the data set narrower by dropping everything but the year and the average yield per acre (Most other variables have the same value in each row anyways, since all the data is from North Dakota)

In [64]:
drop_variables_yield = ['Program', 'Period', 'Week Ending','Geo Level','State','State ANSI','Ag District','Ag District Code', 'County', 'County ANSI', 'Zip Code', 'Region', 'watershed_code', 'Watershed','Commodity','Data Item', 'Domain', 'Domain Category', 'CV (%)']

NDyield = NDyield.drop(columns = drop_variables_yield)

print(NDyield.head())

   Year  Value
0  2022   48.9
1  2021   32.2
2  2020   47.6
3  2019   48.4
4  2018   47.6


Now we are left with the information relevant to this project, let's continue by importing the weather data:

In [65]:
NDweather = pd.read_csv('ND_weather.csv')

print(NDweather.head(5))

       STATION              NAME  DATE  FZF0   PRCP  TAVG
0  USC00325710  MC CLUSKY, ND US  1917   NaN    NaN   NaN
1  USC00325710  MC CLUSKY, ND US  1919   NaN  13.26   NaN
2  USC00325710  MC CLUSKY, ND US  1920   NaN  14.22   NaN
3  USC00325710  MC CLUSKY, ND US  1921   NaN  20.16   NaN
4  USC00325710  MC CLUSKY, ND US  1923  29.0    NaN  40.9


Again we have some variables that can be dropped, namely 'NAME' and 'STATION', that are the same for each row.

In [66]:
drop_variables_weather = ['STATION', 'NAME']
NDweather = NDweather.drop(columns=drop_variables_weather)
print(NDweather.head(20))

    DATE  FZF0   PRCP  TAVG
0   1917   NaN    NaN   NaN
1   1919   NaN  13.26   NaN
2   1920   NaN  14.22   NaN
3   1921   NaN  20.16   NaN
4   1923  29.0    NaN  40.9
5   1924  32.0    NaN  37.8
6   1925  32.0    NaN  38.2
7   1926  31.0    NaN  38.9
8   1927  27.0    NaN  36.8
9   1928  30.0    NaN  39.9
10  1929  30.0    NaN  36.4
11  1930   NaN    NaN   NaN
12  1931  32.0    NaN  44.5
13  1932  32.0    NaN  39.8
14  1933  32.0  15.63  40.5
15  1934  31.0  10.31  42.7
16  1935  32.0  21.88  39.3
17  1936  30.0   8.81  38.1
18  1937  32.0  19.56  38.4
19  1938  31.0  15.62  41.7


We also have some rows with NaN values. To deal with these let's drop the first 35 rows (since they contain a lot of NaN's). Note that since the data sets starts the same year, the same number of rows should be deleted from each of them:

In [67]:
NDweather = NDweather.drop(index=range(35), axis=0)
NDyield   = NDyield.drop(index=range(35), axis=0)
NDweather.reset_index(inplace = True, drop = True)
NDyield.reset_index(inplace = True, drop = True)
print(NDweather.head(100))


    DATE  FZF0   PRCP  TAVG
0   1954  27.0  22.83  41.3
1   1955  28.0  20.47  39.4
2   1956  30.0  19.78  39.6
3   1957  28.0  16.99  40.5
4   1958  27.0  11.83  41.2
..   ...   ...    ...   ...
64  2019  32.0  31.46  40.4
65  2020  29.0  11.60  44.9
66  2021  30.0  15.90  46.3
67  2022  31.0  16.62  39.6
68  2023   NaN    NaN   NaN

[69 rows x 4 columns]


Finally the last row, 2023, also has NaN values, so it is dropped too:

In [74]:
NDweather = NDweather.drop(NDweather.tail(1).index, inplace=True)
NDyield   = NDyield.drop(NDyield(1).index, inplace=True)
NDweather.reset_index(inplace = True, drop = True)
NDyield.reset_index(inplace = True, drop = True)

AttributeError: 'NoneType' object has no attribute 'drop'

## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

In [68]:
def plot_func():
    # Function that operates on data set
    pass

widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
); 


interactive(children=(Output(),), _dom_classes=('widget-interact',))

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [69]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

NameError: name 'venn2' is not defined

<Figure size 1500x700 with 0 Axes>

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.