## Predicting County-Level Mask Wearing

This model is an initial prototype that attempts to forecast mask wearing based on the interaction of static determinants of health behaviors, such as demographic, economic, and social indiacotrs, with information exposure from news and social media. The long-term goal would be to develop a model that is responsive to a changing information environment. Specifically, can we measure how changing information in news and media influences peoples' decision to wear a mask or not at the county-level?

The secondary goal of this model is to develop a methodology for integrating realtime information flows with contextual information to predict and/or forecast how different groups of people might change their attitudes, beliefs, and/or behaviors based on an evolving information ecosystem. This type of work could be useful to quickly detect any potential changes in human behavior and help, for example, public health practioners to better allocated resources, design more targeted health communication campaigns, etc. 

This MVP uses as its feature set numeric inputs from the CDC Social Vulnerability Index, Measure of America Youth Disconnection Index, and Apple mobility data at the county-level, combined with county-level geolocated tweets and state geotagged Covid-related news. 




- The target for this model is binary classification about whether more or less than 50 percent of the population for each county wear's a mask. It's derived from the New York Times July 2020 survey into mask wearing.


In [124]:
#!pip install tensorflow-hub
import tensorflow_hub as hub



from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

import pandas as pd
import numpy as np
import re
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn import preprocessing
from keras.losses import BinaryCrossentropy

tf.keras.backend.set_floatx('float64')

pd.set_option('max_columns', 500)

#!pip install talos

#!pip install spacy

#!pip install scikeras

from scikeras.wrappers import KerasClassifier

In [3]:
#read in and process data
df = pd.read_csv('/home/aschharwood/notebooks/covid/feature_target_v1_county_nyt_cdc_moa_tweets_gdelt_apple.csv')

df.drop('Unnamed: 0', axis=1, inplace=True)

#df.info(max_cols=500)

mean = df._get_numeric_data().mean()

#fill empty mobility data with column mean
df.fillna(mean, inplace=True)

#df.head()

df['COUNTYFP'] = df['COUNTYFP'].astype('string')

#df.describe()

import numpy as np

#create binary target
df['always_binary'] = np.where(df['ALWAYS']>.50, 1, 0)

df['always_binary'].value_counts()
df.fillna(df._get_numeric_data().mean(), inplace=True)

In [4]:
#clean up text data
df['text_tokens_str'] = df['text_tokens_str'].replace(r'\n',' ', regex=True)
df['High'] = df['High'].replace(r'\n',' ', regex=True)
df['Low'] = df['Low'].replace(r'\n',' ', regex=True)

In [5]:
#discard unneeded mask wearing columns
df.drop(['NEVER', 'RARELY', 'SOMETIMES', 'FREQUENTLY', 'ALWAYS', 'FIPS'], axis=1, inplace=True)

In [6]:
y = df['always_binary']
X = df.drop('always_binary', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [10]:
#grab spacy's language model
#!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz


In [8]:
#import and load spacy language model
import en_core_web_sm

In [9]:
nlp = en_core_web_sm.load(disable=['tagger', 'parser', 'ner'])

In [63]:
#help text tokenization function
def tokenizer(text, nlp):

    token_list = []
    doc = nlp(text)
    for token in doc:
        if token.is_stop == False and token.is_punct==False and token.like_url==False:
            if token.text != ' ':
                token_list.append((token.lemma_).lower())
    str_tokens = ' '.join(token_list)
    return str_tokens

## Googles News Stacked



In [136]:
y = df['always_binary']
X = df.drop('always_binary', axis=1)

In [137]:
cats = pd.get_dummies(X[['COUNTYFP', 'ST_ABBR']])

In [138]:
X = pd.concat([X, cats], axis=1)

In [139]:
X['High'] = X['High'].apply(lambda x: tokenizer(x, nlp))

In [140]:
X['Low'] = X['Low'].apply(lambda x: tokenizer(x, nlp))
X['text_tokens_str'] = X['text_tokens_str'].apply(lambda x: tokenizer(x, nlp))

In [195]:
X.shape

(3142, 220)

In [196]:
y.shape

(3142,)

In [200]:
pd.concat([X, y], axis=1).to_csv('feature_target_text_processed_1_20_20.csv')

In [15]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)

X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

X2_test = X_test.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

#X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR'], axis=1)
scaler = MinMaxScaler()
X2_train = scaler.fit_transform(X2_train)


X2_test = scaler.transform(X2_test)

#download and set word vector pretrained model
#embedding = "https://tfhub.dev/google/nnlm-en-dim128/2"
#hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)

embedding = 'https://tfhub.dev/tensorflow/cord-19/swivel-128d/3'

print('Google news model downlownded')

X_News_High_train = X_train['High']
X_News_High_test = X_test['High']

X_News_Low_train = X_train['Low']
X_News_Low_test = X_test['Low']

X_tweet_train = X_train['text_tokens_str']
X_tweet_test = X_test['text_tokens_str']

In [33]:


#numeric input
num_input = Input(shape=(214,))
dense_layer_1_num = Dense(10, activation='relu')(num_input)
batch_out_num = tf.keras.layers.BatchNormalization()(dense_layer_1_num) 
num_output = Dense(10, activation='relu')(batch_out_num)

#tweets low
tweets = Input(shape=[], dtype=tf.string)
hub_layer_tw = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(tweets)
dense_1_tw = Dense(16, activation='relu')(hub_layer_tw)
dense_2_tw = Dense(8, activation='relu')(dense_1_tw)
batch_out_tweets = tf.keras.layers.BatchNormalization()(dense_2_tw) 
tw_output = Dense(4, activation='relu')(batch_out_tweets)
#tw_output = Dropout(0.5)(dense_3_tw)

#news low
news_low = Input(shape=[], dtype=tf.string)
hub_layer_nl = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(news_low)
dense_1_nl = Dense(16, activation='relu')(hub_layer_nl)
dense_2_nl = Dense(8, activation='relu')(dense_1_nl)
batch_out_nl = tf.keras.layers.BatchNormalization()(dense_2_nl) 
nl_output = Dense(4, activation='relu')(batch_out_nl)
#nl_output = Dropout(0.5)(dense_3_nl)

#news high
news_high = Input(shape=[], dtype=tf.string)
hub_layer_news_high = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(news_high)
dense_1_news_high = Dense(16, activation='relu')(hub_layer_news_high)
dense_2_nh = Dense(8, activation='relu')(dense_1_news_high)
batch_out_nh = tf.keras.layers.BatchNormalization()(dense_2_nh) 
nh_output = Dense(4, activation='relu')(batch_out_nh)
#nh_output = Dropout(0.5)(dense_3_nh)


#concat layer takes output layers from tweet and num models, which can be passed to other models
concat_layer = Concatenate()([num_output, tw_output, nl_output, nh_output])
# dense_1_cl = Dense(100, activation='relu')(concat_layer)
# dense_2_cl = Dense(80, activation='relu')(dense_1_cl)
# dense_3_cl = Dense(40, activation='relu')(dense_2_cl)
# dense_4_cl = Dense(40, activation='relu')(dense_3_cl)
output = Dense(1, activation='sigmoid')(concat_layer)
model = Model(inputs=[num_input, tweets, news_low, news_high], outputs=output)

optimizer = tf.keras.optimizers.Adam(lr=0.01)
model.compile(optimizer=optimizer,
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])
print('model defined and compiled')



validation_data=([X2_test, X_tweet_test, X_News_Low_test, X_News_High_test], y_test)

print('training model')
history = model.fit(x=[X2_train, X_tweet_train, X_News_Low_train, X_News_High_train], y=y_train, epochs=5, verbose=1, validation_data=validation_data)
print('model training complete')

# KFold Validation

In [143]:
from sklearn.model_selection import StratifiedKFold

In [141]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)

X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

X2_test = X_test.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

#X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR'], axis=1)
scaler = MinMaxScaler()
X2_train = scaler.fit_transform(X2_train)


X2_test = scaler.transform(X2_test)

#download and set word vector pretrained model
#embedding = "https://tfhub.dev/google/nnlm-en-dim128/2"
#hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)

embedding = 'https://tfhub.dev/tensorflow/cord-19/swivel-128d/3'

print('Google news model downlownded')

X_News_High_train = X_train['High']
X_News_High_test = X_test['High']

X_News_Low_train = X_train['Low']
X_News_Low_test = X_test['Low']

X_tweet_train = X_train['text_tokens_str']
X_tweet_test = X_test['text_tokens_str']

Google news model downlownded


In [150]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [152]:
callbacks = [EarlyStopping(monitor='val_loss', patience=2)]

In [147]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)


In [155]:
cvscores = []


In [156]:
kfold = StratifiedKFold(n_splits=3, shuffle=True)
cvscores = []
for train, test in kfold.split(X, y):
    X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

    X2_test = X_test.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

    #X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR'], axis=1)
    scaler = MinMaxScaler()
    X2_train = scaler.fit_transform(X2_train)


    X2_test = scaler.transform(X2_test)

    #download and set word vector pretrained model
    #embedding = "https://tfhub.dev/google/nnlm-en-dim128/2"
    #hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)

    embedding = 'https://tfhub.dev/tensorflow/cord-19/swivel-128d/3'

    print('Google news model downlownded')

    X_News_High_train = X_train['High']
    X_News_High_test = X_test['High']

    X_News_Low_train = X_train['Low']
    X_News_Low_test = X_test['Low']

    X_tweet_train = X_train['text_tokens_str']
    X_tweet_test = X_test['text_tokens_str']
    
    #numeric input
    num_input = Input(shape=(214,))
    dense_layer_1_num = Dense(10, activation='relu')(num_input)
    batch_out_num = tf.keras.layers.BatchNormalization()(dense_layer_1_num) 
    num_output = Dense(10, activation='relu')(batch_out_num)

    #tweets low
    tweets = Input(shape=[], dtype=tf.string)
    hub_layer_tw = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(tweets)
    dense_1_tw = Dense(16, activation='relu')(hub_layer_tw)
    dense_2_tw = Dense(8, activation='relu')(dense_1_tw)
    batch_out_tweets = tf.keras.layers.BatchNormalization()(dense_2_tw) 
    tw_output = Dense(4, activation='relu')(batch_out_tweets)
    #tw_output = Dropout(0.5)(dense_3_tw)

    #news low
    news_low = Input(shape=[], dtype=tf.string)
    hub_layer_nl = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(news_low)
    dense_1_nl = Dense(16, activation='relu')(hub_layer_nl)
    dense_2_nl = Dense(8, activation='relu')(dense_1_nl)
    batch_out_nl = tf.keras.layers.BatchNormalization()(dense_2_nl) 
    nl_output = Dense(4, activation='relu')(batch_out_nl)
    #nl_output = Dropout(0.5)(dense_3_nl)

    #news high
    news_high = Input(shape=[], dtype=tf.string)
    hub_layer_news_high = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(news_high)
    dense_1_news_high = Dense(16, activation='relu')(hub_layer_news_high)
    dense_2_nh = Dense(8, activation='relu')(dense_1_news_high)
    batch_out_nh = tf.keras.layers.BatchNormalization()(dense_2_nh) 
    nh_output = Dense(4, activation='relu')(batch_out_nh)
    #nh_output = Dropout(0.5)(dense_3_nh)


    #concat layer takes output layers from tweet and num models, which can be passed to other models
    concat_layer = Concatenate()([num_output, tw_output, nl_output, nh_output])
    # dense_1_cl = Dense(100, activation='relu')(concat_layer)
    # dense_2_cl = Dense(80, activation='relu')(dense_1_cl)
    # dense_3_cl = Dense(40, activation='relu')(dense_2_cl)
    # dense_4_cl = Dense(40, activation='relu')(dense_3_cl)
    output = Dense(1, activation='sigmoid')(concat_layer)
    model = Model(inputs=[num_input, tweets, news_low, news_high], outputs=output)

    optimizer = tf.keras.optimizers.Adam(lr=0.01)
    model.compile(optimizer=optimizer,
                 loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                 metrics=['accuracy'])
    model.fit(x=[X2_train, X_tweet_train, X_News_Low_train, X_News_High_train], y=y_train, epochs=3, callbacks=callbacks, verbose=1)
    validation_data=([X2_test, X_tweet_test, X_News_Low_test, X_News_High_test], y_test)
    scores = model.evaluate([X2_test, X_tweet_test, X_News_Low_test, X_News_High_test], y_test)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1]*100)
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))

Google news model downlownded
Epoch 1/3
Epoch 2/3
Epoch 3/3
accuracy: 79.33%
Google news model downlownded
Epoch 1/3
Epoch 2/3
Epoch 3/3
accuracy: 75.68%
Google news model downlownded
Epoch 1/3
Epoch 2/3
Epoch 3/3
accuracy: 80.13%


NameError: name 'numpy' is not defined

In [157]:
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

78.38% (+/- 1.94%)


# Hyperparamter Tuning with Keras Tuner and Grid Search

In [163]:
X_num = X.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)
y = y

In [176]:
scaler = MinMaxScaler()
X_num= scaler.fit_transform(X_num)

In [199]:
X_num

array([[6.77821685e-03, 1.78738771e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [6.34264212e-03, 2.02634647e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.79317429e-03, 4.07447947e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.42862392e-02, 2.03347661e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.53644232e-02, 7.97585497e-04, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.64589410e-02, 6.95683898e-04, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])

In [159]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV


In [177]:
def create_model(optimizer='rmsprop'):
    num_input = Input(shape=(214,))
    dense_layer_1_num = Dense(10, activation='relu')(num_input)
    num_output = Dense(1, activation='sigmoid')(dense_layer_1_num)
    model = Model(inputs=[num_input], outputs=num_output)
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=.01),
                 loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                 metrics=['accuracy'])
    return model

In [178]:
neural_network = KerasClassifier(build_fn=create_model, verbose=1)
# Create hyperparameter space
epochs = [5, 7]
batches = [None, 100, 500]
#optimizers = [tf.keras.optimizers.Adam(lr=.01), tf.keras.optimizers.Adam(lr=.001)]

# Create hyperparameter options
hyperparameters = dict(epochs=epochs, batch_size=batches)

In [179]:
# Create grid search
grid = GridSearchCV(estimator=neural_network, cv=3, param_grid=hyperparameters)

# Fit grid search
grid_result = grid.fit(X_num, y)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7


In [180]:
grid_result.best_params_

{'batch_size': None, 'epochs': 7}

# Model - Num Only

- this model performs reasonably well with just the static socioeconomic and demographic data. 

In [193]:
num_input = Input(shape=(214,))
dense_layer_1_num = Dense(10, activation='relu')(num_input)
num_output = Dense(1, activation='sigmoid')(dense_layer_1_num)
model = Model(inputs=[num_input], outputs=num_output)


optimizer = tf.keras.optimizers.Adam(lr=.01)
model.compile(optimizer=optimizer,
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])
print('model defined and compiled')


validation_data=([X2_test], y_test)
x=[X2_train]

print('training model')
history = model.fit(x=X_num, y=y, epochs=15, verbose=1, validation_split=0.2)
print('model training complete')

model defined and compiled
training model
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
model training complete


In [70]:
model.summary()

Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 214)]             0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2150      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 11        
Total params: 2,161
Trainable params: 2,161
Non-trainable params: 0
_________________________________________________________________


# Model Debug - 


In [37]:
embedding = 'https://tfhub.dev/tensorflow/cord-19/swivel-128d/3'
tweets = Input(shape=[], dtype=tf.string)
hub_layer_tw = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(tweets)
dense_1_tw = Dense(16, activation='relu')(hub_layer_tw)
dense_2_tw = Dense(8, activation='relu')(dense_1_tw)
tw_output = Dense(1, activation='sigmoid')(dense_2_tw)

model = Model(inputs=[tweets], outputs=tw_output)


optimizer = tf.keras.optimizers.Adam(lr=.01)
model.compile(optimizer=optimizer,
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])
print('model defined and compiled')


validation_data=([X_tweet_test], y_test)
x=[X_tweet_train]

print('training model')
history = model.fit(x=x, y=y_train, epochs=10, verbose=1, validation_data=validation_data)
print('model training complete')









model defined and compiled
training model
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
model training complete


## Model debug - stacke

In [40]:
embedding = 'https://tfhub.dev/tensorflow/cord-19/swivel-128d/3'


#numeric input
num_input = Input(shape=(214,))
dense_layer_1_num = Dense(10, activation='relu')(num_input)
batch_out_num = tf.keras.layers.BatchNormalization()(dense_layer_1_num) 
num_output = Dense(10, activation='relu')(batch_out_num)

#tweets low
tweets = Input(shape=[], dtype=tf.string)
hub_layer_tw = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)(tweets)
dense_1_tw = Dense(100, activation='relu')(hub_layer_tw)
batch_out_tw = tf.keras.layers.BatchNormalization()(dense_1_tw) 
tw_output = Dense(32, activation='relu')(batch_out_tw)
#tw_output = Dropout(0.5)(dense_3_tw)


#concat layer takes output layers from tweet and num models, which can be passed to other models
concat_layer = Concatenate()([num_output, tw_output])

output = Dense(1, activation='sigmoid')(concat_layer)
model = Model(inputs=[num_input, tweets], outputs=output)



optimizer = tf.keras.optimizers.Adam(lr=.01)
model.compile(optimizer=optimizer,
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])
print('model defined and compiled')


validation_data=([X2_test, X_tweet_test], y_test)
x=[X2_train, X_tweet_train]

print('training model')
history = model.fit(x=x, y=y_train, epochs=5, verbose=1, validation_data=validation_data)
print('model training complete')









model defined and compiled
training model
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
model training complete


In [44]:
#inital preprocessing for text using GloVe. Discarded in favor of a covid-pretrained dataset
def preprocess_text(sen):

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence



X1_train = []
sentences = list(X_train['text_tokens_str'])
for sen in sentences:
    X1_train.append(preprocess_text(sen))
    
X1_test = []
sentences = list(X_test["text_tokens_str"])
for sen in sentences:
    X1_test.append(preprocess_text(sen))
    
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X1_train)

X1_train = tokenizer.texts_to_sequences(X1_train)
X1_test = tokenizer.texts_to_sequences(X1_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X1_train = pad_sequences(X1_train, padding='post', maxlen=maxlen)
X1_test = pad_sequences(X1_test, padding='post', maxlen=maxlen)

from numpy import array
from numpy import asarray
from numpy import zeros

In [46]:
embeddings_dictionary = dict()

glove_file = open('/home/aschharwood/notebooks/covid/notebooks/glove_tweets/glove.twitter.27B.100d.txt')

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [47]:
X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

X2_test = X_test.drop(['COUNTYFP', 'ST_ABBR', 'text_tokens_str', 'High', 'Low'], axis=1)

#X2_train = X_train.drop(['COUNTYFP', 'ST_ABBR'], axis=1)
scaler = MinMaxScaler()
X2_train = scaler.fit_transform(X2_train)


X2_test = scaler.transform(X2_test)

In [48]:
tweet_input = Input(shape=(maxlen,))
num_input = Input(shape=(163,))

In [49]:
#text model
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(tweet_input)
LSTM_Layer_1 = LSTM(128)(embedding_layer)

#num model
dense_layer_1 = Dense(10, activation='relu')(num_input)
dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)

In [51]:
#concat layer takes output layers from tweet and num models, which can be passed to other models
concat_layer = Concatenate()([LSTM_Layer_1, dense_layer_2])
dense_layer_3 = Dense(10, activation='relu')(concat_layer)
output = Dense(1, activation='sigmoid')(dense_layer_3)
model = Model(inputs=[tweet_input, num_input], outputs=output)

In [52]:
model.compile(optimizer='adam',
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])
print('model defined and compiled')

model defined and compiled


In [53]:
print('training model')
history = model.fit(x=[X1_train, X2_train], y=y_train, epochs=3, verbose=1, validation_data=([X1_test, X2_test], y_test))
print('model training complete')

training model
Epoch 1/3




ValueError: in user code:

    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:806 train_function  *
        return step_function(self, iterator)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:796 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:1211 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2585 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2945 _call_for_each_replica
        return fn(*args, **kwargs)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:789 run_step  **
        outputs = model.train_step(data)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:747 train_step
        y_pred = self(x, training=True)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:985 __call__
        outputs = call_fn(inputs, *args, **kwargs)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py:386 call
        inputs, training=training, mask=mask)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py:508 _run_internal_graph
        outputs = node.layer(*args, **kwargs)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:976 __call__
        self.name)
    /anaconda/envs/covid_env/lib/python3.7/site-packages/tensorflow/python/keras/engine/input_spec.py:216 assert_input_compatibility
        ' but received input with shape ' + str(shape))

    ValueError: Input 0 of layer dense_76 is incompatible with the layer: expected axis -1 of input shape to have value 163 but received input with shape [None, 214]
