# This Notebook Outlines the Models used in our Irma Tweets Research
Although we have tried multiple models and approaches in each area, in this notebook, we are only showing the best performing ones.

## User Model - GradientBoosting

In [2]:
import pandas as pd

In [8]:
users = pd.read_csv("twitterUsers.csv.zip", index_col="screen_name")
users = users[['followers_count', 'friends_count', 'statuses_count', 'verified']]
users.head()

Unnamed: 0_level_0,followers_count,friends_count,statuses_count,verified
screen_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lysette_P,420,401,4787,0
Gboarders,386,1021,5804,0
PabloAcostaTuc,1090,216,2371,0
patigalafassi,169,170,5386,0
tricky_dickie87,320,1344,663,0


Balancing users dataframe between classes

In [9]:
from numpy.random import binomial

sampling_ratio = (
    users.verified.value_counts()[1] * 1.0 / users.verified.value_counts()[0]
)
to_keep = []
for i in users.itertuples():
    if i[4] == 1:
        to_keep.append(True)
    else:
        to_keep.append(binomial(1, sampling_ratio) == 1)
users = users[to_keep]
users.head()

Unnamed: 0_level_0,followers_count,friends_count,statuses_count,verified
screen_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
joeflech,3658,1224,11410,1
DLasAmericas,112154,12510,407453,1
rebeccavallas,6292,1736,16352,1
JRichardsonx,715,717,7001,0
PHRESHER_DGYGZ,4397,102,10708,1


In [11]:
from sklearn.model_selection import train_test_split
x = users.iloc[:, :-1]
y = users.iloc[:, -1]
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.3)

The model originally went through hyperparameters optimization and cross-validation on the `x_tr, y_tr` set. After proven to be better Random Forest, Logistic Regression, and other models, we are training it on the `x_tr, y_tr` set and testing it on the `x_te, y_te` set.

In [30]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score

gb = GradientBoostingClassifier(n_estimators=50)
cv_results = cross_validate(gb, x_tr, y_tr, scoring=["roc_auc"], cv=10)
print("Cross-Validation ROC-AUC-Score: ", cv_results["test_roc_auc"])
gb.fit(x_tr, y_tr)
y_pred = gb.predict(x_te)
y_pred_proba = gb.predict_proba(x_te)

print("")
print("Test-Set Evaluation:")
print("F1-Score:", f1_score(y_te, y_pred))
print("Precision-Score:", precision_score(y_te, y_pred))
print("Recall-Score:", recall_score(y_te, y_pred))
print("ROC-AUC-Score:", roc_auc_score(y_te, y_pred_proba[:, 1]))

Cross-Validation ROC-AUC-Score:  [0.96312816 0.94394886 0.98197294 0.97498513 0.97610021 0.96914957
 0.96703092 0.95905205 0.9640768  0.97967602]

Test-Set Evaluation:
F1-Score: 0.9189704480457579
Precision-Score: 0.8975791433891993
Recall-Score: 0.94140625
ROC-AUC-Score: 0.9636296452702703


## GeoSpatial Model

## Images Model - InceptionV3 Tuned