# 1) Setting up the foundation
link:https://colab.research.google.com/drive/1Y-0iiPquMVccla4cc0ejlZb5spZk523g


In [1]:
#Importing standard packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import keras
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [2]:
#Importing keras tools needed for creating the network
from keras.layers import Dense
from keras.models import Sequential
model = Sequential()




In [3]:
#Setting up a basic model with one layer
model = Sequential()
model.add(Dense(100, input_shape=(100,), activation='relu'))
model.add(Dense(100,activation='relu'))
#Output layers, with which is true or false, and therefor use sigmoid.
model.add(Dense(1,activation='sigmoid'))
#Compiling model using binary_crossentorpy since have a binary classification problem.
#It's the same reason we are using sigmoid for the final activation
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 20,301
Trainable params: 20,301
Non-trainable params: 0
_________________________________________________________________
None


# 2) Preparing the data

In [0]:
#Loading the data
data=pd.read_csv('https://raw.githubusercontent.com/DeepLearnI/trump_tweet_classifier/master/code/tweet_labels.csv')
df=pd.DataFrame(data=data)

In [5]:
#Looking at the data
df.iloc[0,0]

'To every one of the HEROES we recognized today — THANK YOU and God Bless You All!pic.twitter.com/JWKwylpdiO'

I chose not to clean the links, since they could indicate things Trump might do, which the fakes wouldn't, and vice versa.

In [0]:
#Importing tools needed to preprocess the data to get it ready to vectorize
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [0]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['tweet'])
sequences = tokenizer.texts_to_sequences(df['tweet'])

In [8]:
#Viewing now vectorized tweets
print(sequences)

[[2, 321, 100, 4, 1, 870, 16, 2915, 113, 1090, 77, 25, 3, 561, 816, 25, 30, 76, 38, 31, 9782], [16, 21, 48, 1503, 42, 9783, 7, 1647, 3, 9, 23, 22, 3240, 2, 13, 1, 106, 4, 14, 20, 2063, 1, 2205, 15, 1648, 134, 1, 204, 1319, 630, 16, 21, 48, 3437, 14, 704, 2364, 68, 28, 150, 81, 7, 1, 204, 3944, 95], [1, 162, 223, 55, 6, 35, 2460, 83, 2, 1, 5262, 4, 1, 2780, 89, 19, 11, 154, 1399, 2, 216, 1, 36, 186, 7, 678, 47, 9, 12, 72, 314, 345, 11, 282, 635], [1, 153, 4, 3945, 7, 1, 195, 109, 4, 85, 55, 6, 1649, 10, 2365, 83, 4, 1, 89, 42, 1341, 66, 7341, 4, 1, 229, 202, 19, 11, 286, 3, 19, 11, 1187, 2461, 28, 190, 158, 14, 39, 6, 82, 249, 51, 754, 239], [5, 622, 4, 61, 57, 6, 78, 689, 9, 10, 7342, 1, 936, 3, 2064, 7343, 15, 5, 548, 1541, 599, 250, 17, 1, 4265, 24, 237, 6, 521, 10, 272, 266, 18, 6, 88, 2, 735, 3, 449, 47, 7, 24, 390, 10, 1400, 22, 2, 1, 450, 89, 1874, 2, 1574], [1, 56, 159, 26, 2206, 62, 7, 1, 144, 347, 342, 347, 95, 4, 218, 2291, 68, 609, 150, 159, 7, 1, 163, 4, 14, 39, 16, 65, 716

In [0]:
#I pad out the vectors so they have the same length
X = pad_sequences(sequences, maxlen=100)
y = df['labels']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)

In [11]:
#Viewing a single paded vector in the training data
X_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,   30,   36,   44, 2093,   46,
       1561,    8,  547,  778,    4,  183,  914, 2368,  418,  190,   63,
        333,  348,    4,  114,  708,  144,  381,    1,  235,    4,    9,
        952], dtype=int32)

#3) Tuning the basic model

Before tuning the basic model i quickly train it and check its accuracy.

In [0]:
#To prevent overfitting of the model, i limit the amout of epocs using early stop and modelcheckpoint to save the best version
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stop = EarlyStopping(monitor='val_acc', patience=10)
modelCheckpointBasic = ModelCheckpoint('first_trump_model.hdf5', save_best_only=True)

In [13]:
model.fit(x=X_train,y=y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stop,modelCheckpointBasic])




Train on 12278 samples, validate on 3070 samples
Epoch 1/100





Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100


<keras.callbacks.History at 0x7fadedea5748>

In [14]:
from keras.models import load_model
load_model('first_trump_model.hdf5').evaluate(X_test, y_test)



[7.113448850650353, 0.5537459283581774]

In [0]:
#Now we can try to improve the model, starting with the hidden layers.(number of layers, activation functions, and number of nodes)
#To do this we first have to define a function which we can manipulate
def create_model(activation='relu',nl=1,nn=50):	
  model = Sequential()
  model.add(Dense(nn, input_shape=(100,), activation=activation))
  for i in range(nl):
    model.add(Dense(nn, activation=activation))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  return model

In [16]:
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from keras.wrappers.scikit_learn import KerasClassifier

model = KerasClassifier(build_fn=create_model)

# Define the parameter that with be tested. 50,100,200 nodes are just abtraraly selected same with number of layers
# These are the parameters which the model has time to run.
params = {'activation':['relu', 'tanh'], 'nl':[1,2,3,4], 
          'nn':[50, 100, 200],'epochs':[100]}

random_search = RandomizedSearchCV(model, param_distributions=params, cv=(3))
random_search.fit(X_train,y_train,validation_data=(X_test, y_test), callbacks=[early_stop])

Train on 8185 samples, validate on 3070 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Train on 8185 samples, validate on 3070 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Train on 8186 samples, validate on 3070 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Train on 8185 samples, validate on 3070 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/1

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=<keras.wrappers.scikit_learn.KerasClassifier object at 0x7fade8770e48>,
                   iid='warn', n_iter=10, n_jobs=None,
                   param_distributions={'activation': ['relu', 'tanh'],
                                        'epochs': [100], 'nl': [1, 2, 3, 4],
                                        'nn': [50, 100, 200]},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring=None, verbose=0)

In [20]:
#View the results of the model searching.
print("Best: %f using %s" % (random_search.best_score_, random_search.best_params_))
means = random_search.cv_results_['mean_test_score']
stds = random_search.cv_results_['std_test_score']
params = random_search.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.613374 using {'nn': 100, 'nl': 1, 'epochs': 100, 'activation': 'tanh'}
0.609708 (0.004240) with: {'nn': 200, 'nl': 2, 'epochs': 100, 'activation': 'tanh'}
0.602541 (0.005854) with: {'nn': 100, 'nl': 4, 'epochs': 100, 'activation': 'tanh'}
0.503014 (0.036833) with: {'nn': 50, 'nl': 1, 'epochs': 100, 'activation': 'relu'}
0.476706 (0.000665) with: {'nn': 200, 'nl': 4, 'epochs': 100, 'activation': 'relu'}
0.613374 (0.003049) with: {'nn': 100, 'nl': 1, 'epochs': 100, 'activation': 'tanh'}
0.613048 (0.004328) with: {'nn': 200, 'nl': 1, 'epochs': 100, 'activation': 'tanh'}
0.601157 (0.002623) with: {'nn': 100, 'nl': 3, 'epochs': 100, 'activation': 'tanh'}
0.492670 (0.022296) with: {'nn': 50, 'nl': 4, 'epochs': 100, 'activation': 'relu'}
0.476706 (0.000665) with: {'nn': 200, 'nl': 1, 'epochs': 100, 'activation': 'relu'}
0.610034 (0.002820) with: {'nn': 100, 'nl': 2, 'epochs': 100, 'activation': 'tanh'}


Models using the tan h activation have much higher accuracy generaly around 60%

Models with 50 nodes or 3 hiddenlayers didn't make the top of the list

There are more versions of the model with 100 nodes that do well in the model. 

They also performed better when using more then 1 hidden layer

I'm choosing a model with 100 nn since it shows up most frequently, and use TanH activation since it gives much better results.

2 layers give the best performance when  looking give the other two chosen parameters


In [21]:
modelCheckpointTwo = ModelCheckpoint('second_trump_model.hdf5', save_best_only=True)
create_model(activation='tanh',nl=2,nn=100).fit(x=X_train,y=y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stop,modelCheckpointTwo])

Train on 12278 samples, validate on 3070 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100


<keras.callbacks.History at 0x7fade539bcc0>

In [22]:
load_model('second_trump_model.hdf5').evaluate(X_test, y_test)



[0.6435222195491729, 0.6247557002480721]

#4) Adding more advanced layers
Our optimized baseline network got an accuracy of 62, and we can now start to edit it. 

***Options***:

*   Using Convolutional neural networks (CNN)
*   Using Recurrent neural networks (LSTM)
*   Using Embedded layers

Using convution we take a filter for the first layer and look at parts of the input. This will give us more insight on patters and how words go togeter. This will give us more insigts into to meaning of words based on their context.

Using LSTM networks will give a simular resluts since we would be training a netowrk to reacognize words that go together.

I will be using CNN and therefor not be using LSTM




In [0]:
from keras.layers import Embedding,Conv1D,MaxPool1D,Flatten,Dense



In [0]:
model=Sequential()
#10 embedding dims (arbetraraly set)
model.add(Embedding(30000,10,input_length=100))
#16 is the filter and 3 is the kernal_size
model.add(Conv1D(16,3,activation='relu'))
model.add(MaxPool1D())
model.add(Flatten())
model.add(Dense(100, activation='tanh'))
model.add(Dense(100, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [0]:
modelCheckpointFinal = ModelCheckpoint('Final_trump_model.hdf5', save_best_only=True)
model.fit(x=X_train,y=y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stop,modelCheckpointFinal])

In [0]:
load_model('Final_trump_model.hdf5').evaluate(X_test, y_test)