# This Notebook Outlines the Models used in our Irma Tweets Research
Although we have tried multiple models and approaches in each area, in this notebook, we are only showing the best performing ones.

## User Model - GradientBoosting

In [1]:
import pandas as pd



In [2]:
users = pd.read_csv("twitterUsers.csv.zip", index_col="screen_name")
users = users[['followers_count', 'friends_count', 'statuses_count', 'verified']]

Balancing users dataframe between classes

In [3]:
from numpy.random import binomial

sampling_ratio = (
    users.verified.value_counts()[1] * 1.0 / users.verified.value_counts()[0]
)
to_keep = []
for i in users.itertuples():
    if i[4] == 1:
        to_keep.append(True)
    else:
        to_keep.append(binomial(1, sampling_ratio) == 1)
users = users[to_keep]
users.head()

Unnamed: 0_level_0,followers_count,friends_count,statuses_count,verified
screen_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PabloAcostaTuc,1090,216,2371,0
joeflech,3658,1224,11410,1
DLasAmericas,112154,12510,407453,1
wimmer_pa,248,84,1292,0
rebeccavallas,6292,1736,16352,1


In [4]:
from sklearn.model_selection import train_test_split
x = users.iloc[:, :-1]
y = users.iloc[:, -1]
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.3)

The model originally went through hyperparameters optimization, feature engineering, feature selection, and cross-validation on the `x_tr, y_tr` set. After proven to be better Random Forest, Logistic Regression, and other models, we are training it on the `x_tr, y_tr` set and testing it on the `x_te, y_te` set.

In [5]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score

gb = GradientBoostingClassifier(n_estimators=50)
cv_results = cross_validate(gb, x_tr, y_tr, scoring=["roc_auc"], cv=10)
print("Cross-Validation ROC-AUC-Score: ", cv_results["test_roc_auc"])
gb.fit(x_tr, y_tr)
y_pred = gb.predict(x_te)
y_pred_proba = gb.predict_proba(x_te)

print("")
print("Test-Set Evaluation:")
print("F1-Score:", f1_score(y_te, y_pred))
print("Precision-Score:", precision_score(y_te, y_pred))
print("Recall-Score:", recall_score(y_te, y_pred))
print("ROC-AUC-Score:", roc_auc_score(y_te, y_pred_proba[:, 1]))

Cross-Validation ROC-AUC-Score:  [0.96410981 0.97718383 0.96059684 0.96780385 0.93872229 0.97282291
 0.94812975 0.95517972 0.97289795 0.97019505]

Test-Set Evaluation:
F1-Score: 0.9137440758293839
Precision-Score: 0.909433962264151
Recall-Score: 0.9180952380952381
ROC-AUC-Score: 0.9683239283239284


## GeoSpatial Model
This was not a machine learning model, but a mathematical model that was selected from a set of 27 mathematical models. It was based on calculating the distance from the hurricane at the publish time of the tweet along with getting weather data at that location during that moment.

## Images Model - InceptionV3 Tuned
- For binary-classification (level-1), we used inception v3 from google's tensorflow; [here](https://github.com/tensorflow/tpu/blob/master/models/experimental/inception/inception_v3.py).
- For multi-label classification (level-2), we used a modified version of inception v3 for multi-label; [here](https://github.com/BartyzalRadek/Multi-label-Inception-net)
- Our images can be provided upon request, but are not uploaded to the repo due to size limits.