# Predict example
1. The predict procedure takes lots of time, make sure u start earlier. Note that in this notebook, I use 7767 texts to predicts, it takes around 7 minutes. The Colab may automatedly terminate due to long execution(maybe 6 hours or 12 hours), u can split your dataset to do the predict.
2. The main predict steps are as followed:
   * Step 1: load csv
   * Step 2: preprocess the text (but if the text are already processed, this step could be skipped) Btw in this step, it would use a "contraction.csv" file to do preprocessing.
   * Step 3: Load model, tokenizer (2 files)
   * Step 4: do predict and save files


In [1]:
from tensorflow.keras.models import load_model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

import pandas as pd

## Read csv

In [2]:
# mount in google drive, make sure your files (model, csv data) are located in Google drive. 
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# load sample data to predict 
biden_sample_df = pd.read_csv("/content/gdrive/My Drive/Course/Colab Notebooks/ele_us2020_data/hashtag_joebiden_sample.csv",encoding = "ISO-8859-1", lineterminator='\n')

In [4]:
print("biden sample data set size: ", len(biden_sample_df))

biden sample data set size:  7769


## Preprocess the text (if needed)

In [15]:
import re
# Reading contractions.csv and storing it as a dict.
contractions = pd.read_csv('/content/gdrive/My Drive/Course/Colab Notebooks/contractions.csv', index_col='Contraction')
contractions.index = contractions.index.str.lower()
contractions.Meaning = contractions.Meaning.str.lower()
contractions_dict = contractions.to_dict()['Meaning']

# Defining regex patterns.
urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|(www\.)[^ ]*)"
userPattern       = '@[^\s]+'
hashtagPattern    = '#[^\s]+'
alphaPattern      = "[^a-z0-9<>]"
sequencePattern   = r"(.)\1\1+"
seqReplacePattern = r"\1\1"

# Defining regex for emojis
smileemoji        = r"[8:=;]['`\-]?[)d]+"
sademoji          = r"[8:=;]['`\-]?\(+"
neutralemoji      = r"[8:=;]['`\-]?[\/|l*]"
lolemoji          = r"[8:=;]['`\-]?p+"

def preprocess_apply(tweet):

    tweet = tweet.lower()

    # Replace all URls with '<url>'
    tweet = re.sub(urlPattern,'<url>',tweet)
    # Replace @USERNAME to '<user>'.
    tweet = re.sub(userPattern,'<user>', tweet)
    
    # Replace #Hashtags to '<hashtags>'.
    # note that i don't remove hashtag during training, so ~ 
    #tweet = re.sub(hashtagPattern,'<hashtag>', tweet)
    
    # Replace 3 or more consecutive letters by 2 letter.
    tweet = re.sub(sequencePattern, seqReplacePattern, tweet)

    # Replace all emojis.
    tweet = re.sub(r'<3', '<heart>', tweet)
    tweet = re.sub(smileemoji, '<smile>', tweet)
    tweet = re.sub(sademoji, '<sadface>', tweet)
    tweet = re.sub(neutralemoji, '<neutralface>', tweet)
    tweet = re.sub(lolemoji, '<lolface>', tweet)

    for contraction, replacement in contractions_dict.items():
        tweet = tweet.replace(contraction, replacement)

    # Remove non-alphanumeric and symbols
    tweet = re.sub(alphaPattern, ' ', tweet)

    # Adding space on either side of '/' to seperate words (After replacing URLS).
    tweet = re.sub(r'/', ' / ', tweet)
    return tweet

In [16]:
# do preprocess, and store in a new column, in df
%%time
biden_sample_df['processed_text'] = biden_sample_df.tweet.apply(preprocess_apply)

CPU times: user 709 ms, sys: 0 ns, total: 709 ms
Wall time: 713 ms


In [17]:
# have a look at processed text
print("Raw text: ")
print(biden_sample_df.tweet[15])
print("Processed text:")
print(biden_sample_df.processed_text[15])
print("Raw text: ")
print(biden_sample_df.tweet[19])
print("Processed text:")
print(biden_sample_df.processed_text[19])

Raw text: 
New York Post leak is probably from #Trump.  #Biden is one of the masterminds of Ukraine's slaughter of Maidan = #Clinton's takeover of #Ukraine, and the message is to be careful because he is trying to do the same in the United States in the wake of the US presidential election https://t.co/kSxRjxR5lb
Processed text:
new york post leak is probably from  trump    biden is one of the masterminds of ukraineis slaughter of maidan    clintonis takeover of  ukraine  and the message is to be careful because he is trying to do the same in the united states in the wake of the us presidential election <url>
Raw text: 
Duh, original disc was imaged (copied) on that date most likely. #Burisma @nypost #Biden #Ukraine #InfluencePeddling #UnofficialMeeting #JoeCantRemember @BillKristol https://t.co/3xBBBaz1VU
Processed text:
duh  original disc was imaged  copied  on that date most likely   burisma <user>  biden  ukraine  influencepeddling  unofficialmeeting  joecantremember <user> <url>


## Predict the SA
* using BiLSTM model and processed tweet text, store the score in a new col "predict_score"
* if score < 0.5, represent negative
* if score >= 0.5, represent positive
### Model:
* The BiLSTM model trained by 95,0000 tweets, tested by 5,0000 tweets with 0.839 accuracy

In [18]:
# load model 
model = load_model("/content/gdrive/My Drive/Course/Colab Notebooks/BiLSTM_model/BiLSTM_gensim_0839_15epo_100wdataset.h5")

In [19]:
# load the tokenizer
import pickle
# loading tokenizer
with open('/content/gdrive/My Drive/Course/Colab Notebooks/BiLSTM_model/Tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

In [20]:
# predict function
def predict(text):
    
    # Tokenize text
    x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=60)
    # Predict
    score = model.predict([x_test])[0]
    # Decode sentiment
    #label = -1 if score < 0.5 else 1
    out_score = round(float(score),4)

    return out_score

In [21]:
# try predict the sample dataset and estimate the time, cpu runtime type
%%time
biden_sample_df["predict_score_bi"] = biden_sample_df.processed_text.apply(lambda x: predict(x))
# store the result
biden_sample_df.to_csv("/content/gdrive/My Drive/Course/Colab Notebooks/ele_us2020_data/biden_sample_protext_scored.csv", index=False)

CPU times: user 6min 58s, sys: 6.77 s, total: 7min 5s
Wall time: 5min 54s
