# ***FakeMind-ML4VA Project: Detecting Droughts in Virginia***

Team FakeMind is composed of three UVA students: Alex Fetea, Kamil Urbanowski, and Tyler Kim. FakeMind's goal is to predict droughts in Virginia using a dataset found online. This will help farmers take better care of their farms by taking preparing ahead of time for possible droughts.

The link to the datasets can be found below:

https://resilience.climate.gov/datasets/esri2::us-drought-by-state/explore

https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily


In general, this notebook will store our code for the ML4VA project. This notebook will also be divided into 8 Steps:

1. Big Picture & Setup
2. Getting the Data
3. Discovering and Visualizing the Data
4. Data Cleaning
5. Selecting and Training the Models
6. Fine Tuning the Model
7. Presentation
8. Launch


## **1-Big Picture & Setup**

In [None]:
# import the necessary libraries
import sklearn
import numpy as np
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd

np.random.seed(17)

## **2-Getting the Data**

In [None]:
# loads the data and includes some information and statistics about the dataset
def load_data(filepath):
  data = pd.read_csv(filepath)

  file_size = 7.1
  num_entries = 62313
  num_features = 15
  num_categorical = 2
  missing_value_exists = True

  # print the output
  print("File Size:", str(file_size) + "+ MB")
  print("Number of Entries:", str(num_entries))
  print("Number of Features:", str(num_features))
  print("Do Categorical variables exist:", "Yes" if num_categorical > 0 else "No", "(" + str(num_categorical) + ")")
  print("Do missing values exist:", "Yes" if missing_value_exists else "No")
  print("\n")

  print(data.info())
  print(data.describe())

  return data

In [None]:
# loads the data
drought_data = load_data("./datasets/USA_Drought_Intensity_2000_-_Present.csv")

In [None]:
%matplotlib inline
drought_data.hist(bins = 50, figsize = (20, 15))
plt.show()

In [None]:
# stratified test distribution by state abbreviation
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 17)
for train_index, test_index in split.split(drought_data, drought_data['state_abbr']):
  strat_train_set = drought_data.loc[train_index]
  strat_test_set = drought_data.loc[test_index]

In [None]:
# test distributions
strat_test_set["state_abbr"].value_counts() / len(strat_test_set)

In [None]:
drought_data["state_abbr"].value_counts() / len(drought_data)

## **3-Discovering and Visualizing the Data**

In [None]:
import seaborn as sns

plt.style.use("dark_background")

states_of_interest = ["VA", "MD", "WV", "TN", "KY", "NC", "DC"]

for state in states_of_interest:
  sub_df_train = strat_train_set.loc[strat_train_set["state_abbr"] == state]
  plt.title(state + " Drought Levels")

  plt.scatter(sub_df_train["period"], sub_df_train["d0"], color = "lime", label = "Nothing")
  plt.scatter(sub_df_train["period"], sub_df_train["d0"], color = "green", label = "Abnormally Dry")
  plt.scatter(sub_df_train["period"], sub_df_train["d1"], color = "olive", label = "Moderate Drought")
  plt.scatter(sub_df_train["period"], sub_df_train["d2"], color = "yellow", label = "Sever Drought")
  plt.scatter(sub_df_train["period"], sub_df_train["d3"], color = "orange", label = "Extreme Drought")
  plt.scatter(sub_df_train["period"], sub_df_train["d4"], color = "red", label = "Exceptional Drought")
  plt.legend(bbox_to_anchor = (1.05, 1.0), loc = "upper left")
  plt.tight_layout()
  plt.show()


The scatterplots above represent the level of drought for VA and other adjacent states. It seems as if many of the states have a tendency to be abnormally dry but rarely have anything worse.

## **4-Data Cleaning**

Will drop some features since some of it is redundant.

In [None]:
# drop features
drop_list = ["D1_D4", "D2_D4", "D3_D4", "OBJECTID", "ddate", "state_abbr", "admin_fips", "nothing", "d0", "d1", "d2", "d3", "d4", "D0_D4"]
drought_data = drought_data.drop(drop_list, axis = 1)

print(drought_data.head())

In [None]:
output_list = ["nothing", "d0", "d1", "d2", "d3", "d4", "D0_D4"]

y_strat_train_set = strat_train_set[output_list]
X_strat_train_set = strat_train_set.drop(drop_list, axis = 1)

print(y_strat_train_set.head())
print(X_strat_train_set.head())


In [None]:
# numeric
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_attribs = list(drought_data.select_dtypes(include = np.number))
cat_attribs = list(drought_data.select_dtypes(exclude = np.number))

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "constant", fill_value = 0)),
    ("std_scaler", StandardScaler()),
])

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

In [None]:
drought_train_data_prepared = full_pipeline.fit_transform(X_strat_train_set)
print(drought_train_data_prepared.shape)

## **5-Selecting and Training the Models**

In [None]:
import tensorflow as tf
from tensorflow import keras

num_samples = 1000
timesteps = 10
num_features = 15
n_outputs = 7

drought_train_data_prepared = np.random.random((num_samples, timesteps, num_features))

y_strat_train_set = tf.keras.utils.to_categorical(np.random.randint(0, n_outputs, size=(num_samples,)))

def build_simple_rnn_model(input_shape, n_neurons=[20, 20], n_outputs=7):
    model = keras.models.Sequential()
    model.add(keras.layers.SimpleRNN(n_neurons[0], return_sequences=True, input_shape=input_shape))
    for n_neuron in n_neurons[1:]:
        model.add(keras.layers.SimpleRNN(n_neuron))
    model.add(keras.layers.Dense(n_outputs, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


In [None]:
input_shape = (timesteps, num_features)
simple_rnn_model = build_simple_rnn_model(input_shape, n_neurons=[20, 20], n_outputs=n_outputs)

# Train the model
history = simple_rnn_model.fit(
    drought_train_data_prepared,
    y_strat_train_set,
    epochs=30,
    validation_split=0.1
)

In [None]:
keras.backend.clear_session()
np.random.seed(42)

## **6-Fine Tuning the Model**

## **7-Presentation**

## **8-Launch**