# Data EDA and Modelling

* [Imports](#Imports)<br>
* [Preparing the Data](#Preparing-the-Data)<br> 
    * [Costructing the Corpus](#Constructing-the-Corpus)<br>
    * [Corpus Cleaning](#Corpus-Cleaning)<br>
* [Model Preparation](#Model-Preparation)<br>
* [Limitations & Conclusions](#Limitations-&-Conclusions)<br>

### Imports

In [1]:
import pandas as pd
import regex as re
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore',category = FutureWarning)



### Preparing the Data

##### Constructing the Corpus

First, we read in the historic tweets from the `BayCountTMC`, `fl511_panhandl`, and `WJHG_TV` in the `account_tweets.csv` and unfiltered tweets from the `searched_tweets.csv` scraped using the GetOldTweets3 library. 

In [2]:
account = pd.read_csv('../datasets/account_tweets.csv')
searched = pd.read_csv('../datasets/searched_tweets.csv')

The `account` dataframe contains the tweets that we will classify as traffic incident or emergency related tweets. 

In [3]:
account.head()

Unnamed: 0,id,username,date,text,hashtags,geo,traffic
0,1051600805755846657,WJHG_TV,2018-10-14 22:29:04+00:00,A lot of disaster assistance info... as well a...,,,1
1,1051595380369096704,fl511_panhandl,2018-10-14 22:07:31+00:00,Cleared: Traffic congestion in Bay on US-231 s...,,,1
2,1051580211702243333,WJHG_TV,2018-10-14 21:07:14+00:00,Jessica and Ryan are about to handle our storm...,#wjhgmichael,,1
3,1051575221671682054,fl511_panhandl,2018-10-14 20:47:24+00:00,Cleared: Object on roadway in Okaloosa on I-10...,,,1
4,1051574997964255232,fl511_panhandl,2018-10-14 20:46:31+00:00,New: Object on roadway in Okaloosa on I-10 wes...,#fl511,,1


The `searched` dataframe contains tweets that have been queried using the same search words but without particular accounts specified. Tweets from here will represent our negative class.

In [4]:
searched.head()

Unnamed: 0,id,username,date,text,hashtags,geo,traffic
0,1051623671951974400,GulfPower,2018-10-14 23:59:56+00:00,“We are pleased to be making steady progress a...,#HurricaneMichael,,0
1,1051623669363965952,Postcards4Potus,2018-10-14 23:59:55+00:00,@realDonaldTrump really doesn't care! Seriousl...,#HurricaneMichael,,0
2,1051623651320184832,LauraHKByrne,2018-10-14 23:59:51+00:00,This is an excellent point. The best and easie...,#HurricaneMichael,,0
3,1051623649197875201,SupplierCom,2018-10-14 23:59:50+00:00,For those affected by #HurricaneMichael member...,#HurricaneMichael,,0
4,1051623615911927808,Heart_to_Heart,2018-10-14 23:59:42+00:00,The devastation from #HurricaneMichael is hard...,#HurricaneMichael #PanamaCity #Florida,,0


Now that both of the scrapes have been read in, they will be concatenated into one dataframe containing both the positive and negative class so that it can be further cleaned and implemented in modeling. 

In [5]:
df = pd.concat([account,searched],axis = 0, ignore_index= True)

After importing the two files into a new notebook it is important to check once again for any blank rows that have been converted to `NaN` values since exporting the scrapes into `.csv` files. 

In [6]:
df.isnull().sum()

id             0
username       0
date           0
text           4
hashtags    1394
geo         4348
traffic        0
dtype: int64

4 null values are present in the text column post export. Since there are only 4 missing values and it is text data, the rows containing the null values can safely be dropped without worrying about any adverse effects. 

In [7]:
df.drop(df.loc[df['text'].isnull()==True].index,inplace = True)

In [8]:
df.isnull().sum()

id             0
username       0
date           0
text           0
hashtags    1390
geo         4344
traffic        0
dtype: int64

Now that there are no null values left in the text column and the traffic column, which will be our corpus and our classification variable respectively, we can move on to the next steps of the data cleaning process. 

##### Corpus Cleaning

In order to feed the model the most meaningful words, the raw text data of the tweets will need to be trimmed down. First, links will need to be removed. 

In [9]:
# Test tweet with https link in it
print(df['text'][0])

A lot of disaster assistance info... as well as shelters, food/water PODS, and curfew information can be found here... https://www.floridadisaster.org/info/


In [10]:
# Removes https links
df['text'] = df['text'].str.replace('http\S+|www.\S+', ' ', case = False)

In [11]:
# link has been removed
print(df['text'][0])

A lot of disaster assistance info... as well as shelters, food/water PODS, and curfew information can be found here...  


Then, all tweets will undergo some generalized cleaning including removing non alphabet characters, converting all letters to lowercase and removing typically meaningless words to a model by removing stopwords. 

In [12]:
def tweet_cleaning(raw):
    # 1. Remove non-letters.
    letters_only = re.sub('[^a-zA-Z]', ' ', raw)
    
    # 2. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 3. Join all the stopwords as a string with " ", remove "'" from the stopwords and split it as a list.
    stops = " ".join(stopwords.words('english')).replace("'", "").split()
    
    # 4. Remove stopwords.
    meaningful_words = [w for w in words if not w in stops]
    
    # 5. Join the words back into one string separated by space and return the result.
    return(" ".join(meaningful_words))

In [13]:
df['text'] = df['text'].apply(tweet_cleaning)

In [14]:
df['text'][0]

'lot disaster assistance info well shelters food water pods curfew information found'

In [15]:
def lemmatizing(tweet):
    
    # 1. Split into individual words.
    words = tweet.split()
    
    # 2. Remove stopwords.
    stops = " ".join(stopwords.words('english')).replace("'","").split()
    meaningful_words = [w for w in words if not w in stops]
    
    # 3. Lemmatizing
    lemmatizer = WordNetLemmatizer()
    lemmatizing = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
    # 4. Join the words back into one string separated by space and return the result.
    return(" ".join(lemmatizing))

In [16]:
df['text'] = df['text'].apply(lemmatizing)

In [17]:
df['text'][0]

'lot disaster assistance info well shelter food water pod curfew information found'

## Model Preparation

Prior to modeling it is important to establish a baseline to compare our models accuracy scores to. This will be done below.

In [18]:
# baseline accuracy score
df['traffic'].value_counts(normalize = True)

0    0.8407
1    0.1593
Name: traffic, dtype: float64

It is evident that with an 84% baseline that there is a strong class imbalance in the data as a result of our account specific tweet scrape resulting in less tweets than the scrape from the `searched_tweets.csv` file. Therefore, bootstrapping the data will be necessary in our models.

In [19]:
y = df['traffic'] # classifier variable
X = df['text'] # corpus

X_train,X_test, y_train,y_test = train_test_split(X,y,stratify = y, random_state = 42)

We will be using the `traffic` column as the classification variable, and the `text` column for the corpus. The following gridsearch will test whether the CountVectorizer or the TfidfVectorizer would be better suited for our classification model over a random forest classifier, an adaboost classification model, and a support vector machine for classification. Parameters for the vectorizers include the list of builtin English stop_words from sklearn in order to account for any difference in the stopwords list used in the lemmatizing function from nltk. Additionally, monogram, bigram, and trigram ranges will be used in order to account for specific combinations of words such as "hurricane michael". The random forest, and adaboost models for classification were selected because the data is automatically implemented in each decision tree in the model. The support vector machine for classification is also used in the gridsearch because when using the any sort of word vectorizer, each unique word will be counted as an independent feature in the model. It is safe to assume that there will be numerous unique words used in the corpus of tweets which will cause our dataset to have a large number of independent features and therefore significantly high dimensionality.

In [20]:
# Instantiating Pipelines for potential models
pipe_rf_cvec = Pipeline([('cvec',CountVectorizer()),                         
                         ('rf', RandomForestClassifier(n_estimators= 100))])
pipe_rf_tfidf = Pipeline([('tfidf', TfidfVectorizer()),                          
                          ('rf', RandomForestClassifier(n_estimators=100))])
pipe_ada_cvec = Pipeline([('cvec', CountVectorizer()),                          
                          ('ada', AdaBoostClassifier(n_estimators=100))])
pipe_ada_tfidf = Pipeline([('tfidf', TfidfVectorizer()),                           
                           ('ada', AdaBoostClassifier(n_estimators = 100))])
pipe_svc_cvec = Pipeline([('cvec', CountVectorizer()),
                           ('svc', SVC(gamma = 'auto',
                                       random_state = 42))])
pipe_svc_tfidf = Pipeline([('tfidf', TfidfVectorizer()),                    
                           ('svc', SVC(gamma = 'auto',
                                       random_state = 42))])
# Instantiating vectorizer parameters
cvec_params = {'cvec__stop_words':[None,'english'],
               'cvec__ngram_range':[(1,1),(1,2),(1,3)]}
tfidf_params = {'tfidf__stop_words': [None, 'english'],
                'tfidf__ngram_range': [(1,1),(1,2),(1,3)]}

# Random Forest GridSearches
grid_rf_cvec = GridSearchCV(pipe_rf_cvec, cvec_params,cv = 5)
grid_rf_tfidf = GridSearchCV(pipe_rf_tfidf, tfidf_params, cv = 5)


# Adaboost GridSearches
grid_ada_cvec = GridSearchCV(pipe_ada_cvec, cvec_params, cv = 5)
grid_ada_tfidf = GridSearchCV(pipe_ada_tfidf, tfidf_params, cv = 5)

# SVC GridSearch
grid_svc_cvec = GridSearchCV(pipe_svc_cvec,cvec_params, cv = 5)
grid_svc_tfidf = GridSearchCV(pipe_svc_tfidf, tfidf_params, cv = 5)


In [21]:
models = [grid_rf_cvec, grid_rf_tfidf, grid_ada_cvec, 
          grid_ada_tfidf, grid_svc_cvec, grid_svc_tfidf]
model_names = ['CountVectorized Random Forest','TFIDF Random Forest',
               'CountVectorized Adaboost','TFIDF Adaboost', 'CountVectorized SVC',
               'TFIDF SVC']

# loops through each gridsearch and prints out accuracy scores and parameters for the best estimator
for (model, model_name) in zip(models, model_names):
    model.fit(X_train,y_train)
    print(f'{model_name}')
    print(f'best params: {model.best_params_}')
    print(f'best estimator train score: {model.best_estimator_.score(X_train,y_train)}')
    print(f'best estimator test score: {model.best_estimator_.score(X_test,y_test)}')
    print()

CountVectorized Random Forest
best params: {'cvec__ngram_range': (1, 1), 'cvec__stop_words': 'english'}
best estimator train score: 1.0
best estimator test score: 0.9475138121546961

TFIDF Random Forest
best params: {'tfidf__ngram_range': (1, 1), 'tfidf__stop_words': 'english'}
best estimator train score: 1.0
best estimator test score: 0.9410681399631676

CountVectorized Adaboost
best params: {'cvec__ngram_range': (1, 1), 'cvec__stop_words': None}
best estimator train score: 0.9904849600982197
best estimator test score: 0.9732965009208103

TFIDF Adaboost
best params: {'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': None}
best estimator train score: 0.994475138121547
best estimator test score: 0.9650092081031307

CountVectorized SVC
best params: {'cvec__ngram_range': (1, 1), 'cvec__stop_words': None}
best estimator train score: 0.8406998158379374
best estimator test score: 0.8406998158379374

TFIDF SVC
best params: {'tfidf__ngram_range': (1, 1), 'tfidf__stop_words': None}
best estima

All of the models performed better than the baseline and show no signs of overfitting except for the support vector machine for classification which actually underperformed the baseline by .01%. The parameters used for the model were generally the default values but this score could be improved upon adding increased regularization, a better kernel method, and finetuning the rest of the parameters. The adaboost model achieved the best test score and training score with nominal bias variance tradeoff. The support vector machine for classification had the lowest variance between testing and training data for both the CountVectorized set and the TfidfVectorized set but due to the default parameters for `gamma`(auto). Below I will test the effect of the training and testing score of the SVC if the `gamma` is changed to scale which will allow the model to better capture the complexity of the data by taking into account the variance of the independent variables as opposed to just the inverse of the number of features.

In [22]:
# Instantiating Pipelines for potential models
pipe_svc_cvec = Pipeline([('cvec', CountVectorizer()),
                           ('svc', SVC(gamma = 'scale',
                                       random_state = 42))])
pipe_svc_tfidf = Pipeline([('tfidf', TfidfVectorizer()),                    
                           ('svc', SVC(gamma = 'scale',
                                       random_state = 42))])
# Instantiating vectorizer parameters
cvec_params = {'cvec__stop_words':[None,'english'],
               'cvec__ngram_range':[(1,1),(1,2),(1,3)]}
tfidf_params = {'tfidf__stop_words': [None, 'english'],
                'tfidf__ngram_range': [(1,1),(1,2),(1,3)]}

# SVC GridSearch
grid_svc_cvec = GridSearchCV(pipe_svc_cvec,cvec_params, cv = 5)
grid_svc_tfidf = GridSearchCV(pipe_svc_tfidf, tfidf_params, cv = 5)


In [23]:
models = [grid_svc_cvec, grid_svc_tfidf] # trimmed down to only run SVC grid searches
model_names = ['CountVectorized SVC', 'TFIDF SVC'] # trimmed down to only run SVC grid searches

for (model, model_name) in zip(models, model_names):
    
    model.fit(X_train,y_train)
    print(f'{model_name}')
    print(f'best params: {model.best_params_}')
    print(f'best estimator train score: {model.best_estimator_.score(X_train,y_train)}')
    print(f'best estimator test score: {model.best_estimator_.score(X_test,y_test)}')
    print()

CountVectorized SVC
best params: {'cvec__ngram_range': (1, 1), 'cvec__stop_words': None}
best estimator train score: 0.9953959484346224
best estimator test score: 0.9511970534069981

TFIDF SVC
best params: {'tfidf__ngram_range': (1, 1), 'tfidf__stop_words': 'english'}
best estimator train score: 0.9978514426028238
best estimator test score: 0.9429097605893186



After changing the `gamma` method to 'scale', the both the CountVectorized SVC and the TFIDF SVC improved beyod the baseline. However, the adaboost models still yielded the highest accuracy scores for the test data and showed less variance and was therefore better fit to the data than the support vector machine. 

In [24]:
# Confusion Matrix for Count Vectorized Adaboost
ada_test_preds = grid_ada_cvec.best_estimator_.predict(X_test)

tn,fp,fn,tp = confusion_matrix(y_test,ada_test_preds).ravel()
sens = tp/(tp+fn)
spec = tn/(tn+fp)
acc = (tp+tn)/(tn+tp+fp+fn)
false_pos_rate = fn/(tp+fn)
print(f'false pos:{fp}, false neg:{fn}')
print(f'true pos: {tp}, true neg: {tn}')
print(f'sensitivity: {sens}, specificity: {spec}')
print(f'false_neg_rate: {false_pos_rate}')

false pos:4, false neg:25
true pos: 148, true neg: 909
sensitivity: 0.8554913294797688, specificity: 0.9956188389923329
false_neg_rate: 0.14450867052023122


In [25]:
# Confusion Matrix for Tfidf Vectorized Adaboost
ada_test_preds = grid_ada_tfidf.best_estimator_.predict(X_test)

tn,fp,fn,tp = confusion_matrix(y_test,ada_test_preds).ravel()
sens = tp/(tp+fn)
spec = tn/(tn+fp)
acc = (tp+tn)/(tn+tp+fp+fn)
false_pos_rate = fn/(tp+fn)
print(f'false pos:{fp}, false neg:{fn}')
print(f'true pos: {tp}, true neg: {tn}')
print(f'sensitivity: {sens}, specificity: {spec}')
print(f'false_neg_rate: {false_pos_rate}')

false pos:8, false neg:30
true pos: 143, true neg: 905
sensitivity: 0.8265895953757225, specificity: 0.9912376779846659
false_neg_rate: 0.17341040462427745


Sensitivity and specificity are also within nominal ranges for the CountVectorized adaboost and the TfidfVectorized adaboost. Upon comparing them to the sensitivity and specificity of the support vector machine...

In [26]:
# Confusion Matrix for Count Vectorized SVC
svc_test_preds = grid_svc_cvec.best_estimator_.predict(X_test)

tn,fp,fn,tp = confusion_matrix(y_test,svc_test_preds).ravel()
sens = tp/(tp+fn)
spec = tn/(tn+fp)
acc = (tp+tn)/(tn+tp+fp+fn)
false_pos_rate = fn/(tp+fn)
print(f'false pos:{fp}, false neg:{fn}')
print(f'true pos: {tp}, true neg: {tn}')
print(f'sensitivity: {sens}, specificity: {spec}')
print(f'false_neg_rate: {false_pos_rate}')

false pos:1, false neg:52
true pos: 121, true neg: 912
sensitivity: 0.6994219653179191, specificity: 0.9989047097480832
false_neg_rate: 0.30057803468208094


In [27]:
# Confusion Matrix for Tfidf SVC
svc_test_preds = grid_svc_tfidf.best_estimator_.predict(X_test)

tn,fp,fn,tp = confusion_matrix(y_test,svc_test_preds).ravel()
sens = tp/(tp+fn)
spec = tn/(tn+fp)
acc = (tp+tn)/(tn+tp+fp+fn)
false_pos_rate = fn/(tp+fn)
print(f'false pos:{fp}, false neg:{fn}')
print(f'true pos: {tp}, true neg: {tn}')
print(f'sensitivity: {sens}, specificity: {spec}')
print(f'false_neg_rate: {false_pos_rate}')

false pos:0, false neg:62
true pos: 111, true neg: 913
sensitivity: 0.6416184971098265, specificity: 1.0
false_neg_rate: 0.3583815028901734


... it is clear that the CountVectorized adaboost model is better for the classification as the model will be better at properly classifying traffic incidents and emergencies and minimizing false negatives compared to the support vector machine. Therefore, the CountVectorized adaboost model will be the one used to classify tweets when using live twitter data.

#### Limitations & Conclusions

Due to our classification assumptions of the dataset being that the positive class is any tweet scraped from the BayCountTMC, fl511_panhandl, and WJHG_TV accounts, the model is technically built to classify between tweets from those accounts and tweets not from those accounts. Therefore, the model may not perform well on live tweets which will consist of tweets from all sorts of accounts that won't have the specific language used in the preformatted announcements and tweets from the accounts we deemed as the positive class.<br>
Despite this, the model that will be used to classify the live tweets for this project will be an adaboost model with a countvectorized corpus after cleaning lemmatization and stopwords from the nltk package removed, with a monogram range, and 100 bootstrapped trees `n_estimators`.