## Case Study 2 : Final functions

<br>
<li> This notebook has two functions. function1 takes one or more inputs and predicts the output of % change and it's sign for those points. </li>
<li> function2 takes three inputs X,Y1 and Y2. X is in in pandas dataframe format while Y1 & Y2 are in dict format specified in further part of this notebook and it returns mean deviation for Y1 while correct prediction ratio for Y2 with the pretrained model. </li>
<li> Pretrained Multichannel convolution network with loss weights for balancing imbalance in sign prediction was used in this notebook. </li>
<li> Model was saved in "Handling_Imbalanced_data.ipynb" while pickle files for scalars were saved at each instance they were used in "EDA_and_Preprocessing_of_combined_Data.ipynb". </li>

In [2]:
# importing libraries

import pandas as pd
import numpy as np
import tensorflow as tf
import os
# for some models GPU capacity was not enough hence trained those models by disabling GPU with below line
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pickle
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from tqdm import tqdm
import tensorflow_text as text
import pathlib
import tensorflow_addons as tfa
from keras.models import load_model

### 1. Defining function1 and function2

### 1.1 "function1"

In [3]:
def function1(X):
    '''
        1.  This function takes input in pandas dataframe format with following columns and predicts
            the % chnage in closing price and sign of change.
            
            ['ticker_symbol', 'post_date', 'body', 'comment_num', 'retweet_num','like_num']
        
        2.  The input can be of one date,company or multiple.
    '''
    # Extracting features from input
    
    # function to detect whether URL is present or not.
    def Find_url(string):  
        return ('https' in string or 'http' in string)

    X['URL_flag'] = X.body.apply(lambda x:1 if Find_url(x) else 0)
    
    # fuunction to know whether hashtags are present in tweet text or not
    def Find_hashtag(string):
        temp = re.search(r"#(\w+)", string)     
        return temp

    # Extracting hashtag_flag feature
    X['hastags_flag'] = X.body.apply(lambda x:1 if Find_hashtag(x) else 0)
    
    # referred and modified below link to extract hashtags from tweets
    # https://www.geeksforgeeks.org/python-extract-hashtags-from-text/
    def Find_mention(string):
        temp = re.search(r"@(\w+)", string)     
        return temp

    # extracting mention_flag features
    X['mention_flag'] = X.body.apply(lambda x:1 if Find_mention(x) else 0)
    
    # referred below link to extract hashtags from tweets
    # https://www.geeksforgeeks.org/python-extract-hashtags-from-text/

    def get_hashtag(string):
        hashtags  = re.findall(r"#(\w+)", string)
        return hashtags

    # extracting and storing hashtags in dataframe
    X['hashtags'] = X.body.apply(lambda x:','.join(get_hashtag(x)))
    
    # referred stopwords from below link
    # https://gist.github.com/sebleier/554280
    stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]



    # referred this cleaning function from Donor Choose assignments
    def preprocess(text):
        text = text.lower()
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r"#(\w+)", '', text)
        text = re.sub(r"@(\w+)", '', text)
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"can\'t", "can not", text)
        text = re.sub(r"n\'t", " not", text)
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'s", " is", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ll", " will", text)
        text = re.sub(r"\'t", " not", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"\'m", " am", text)
        text = text.replace('\\r', ' ')
        text = text.replace('\\n', ' ')
        text = text.replace('\\"', ' ')
        text = re.sub('[^A-Za-z0-9]+', ' ', text)
        text = text.replace('\\r', ' ')
        text = text.replace('\\n', ' ')
        text = text.replace('\\"', ' ')
        text = ' '.join(e for e in text.split() if e.lower() not in stopwords)
        return text


    # cleaning tweet text and storing in 'tweet_cleaned' column
    X['tweet_cleaned'] = X.body.apply(lambda x:preprocess(x))
    
    sid = SentimentIntensityAnalyzer()

    senti_score_train = [sid.polarity_scores(x_body) for x_body in X['tweet_cleaned']]


    X['neg'] = [senti['neg'] for senti in senti_score_train]
    X['neu'] = [senti['neu'] for senti in senti_score_train]
    X['pos'] = [senti['pos'] for senti in senti_score_train]
    X['compound'] = [senti['compound'] for senti in senti_score_train]
    
    [retweet_num_scalar,comment_num_scalar,like_num_scalar] = pickle.load(open("scalars.pkl","rb"))

    X['retweet_num'] = retweet_num_scalar.fit_transform(X['retweet_num'].values.reshape(-1,1))
    X['comment_num'] = comment_num_scalar.fit_transform(X['comment_num'].values.reshape(-1,1))
    X['like_num'] = like_num_scalar.fit_transform(X['like_num'].values.reshape(-1,1))
    
    X['hashtags'] = X.hashtags.apply(lambda x:0 if len(x)<1 else x)
    
    keep_indices = []
    for i in range(X.shape[0]):
        if len(X.iloc[0]['tweet_cleaned'])>1:
            keep_indices.append(i)
    keep_indices = np.array(keep_indices)
    X = X.iloc[keep_indices]
    

    dates = X['post_date'].unique()
    companies = X['ticker_symbol'].unique()  

    # empty list to store combined tweet text
    tweets = {}

    for date in dates:
        tweet_data = X[X.post_date == date]
        # empty string to store and accumulate tweet text for a day
        all_tweets = ''
        # loop to iterate through all tweets on that day
        for tweet in tweet_data['tweet_cleaned']: 
            all_tweets += tweet

        # storing all tweets to tweets dictionary
        tweets[date] = all_tweets

    
    # referred this preprocess function from Donor Choose assignments
    def preprocess(text):
        text = text.lower()
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"can\'t", "can not", text)
        text = re.sub(r"n\'t", " not", text)
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'s", " is", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ll", " will", text)
        text = re.sub(r"\'t", " not", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"\'m", " am", text)
        text = text.replace('\\r', ' ')
        text = text.replace('\\n', ' ')
        text = text.replace('\\"', ' ')
        text = re.sub('[^A-Za-z0-9]+', ' ', text)
        return text

    # Getting Glove vec dictionary
    with open('glove_vectors', 'rb') as f:
        glove_dict = pickle.load(f)

    # Function to get hashtag vectors
    def vec_hashtag(hashtags):
        vec = np.zeros(300)   # empty vector of dimension 300
        n_letters = 1         # counting letters in all hashtags
        if hashtags:
            hashtags = hashtags.split(',')         # getting all hashtags for current tweet
            for hashtag in hashtags:               # loop to iterate through all hashtags
                hashtag = preprocess(hashtag)      # preprocessing all hashtags
                hashtag = hashtag.replace(" ", "") # removing spaces in hashtags
                for letter in hashtag:           
                    vec += glove_dict[letter]
                    n_letters += 1
            vec /= n_letters                       # finding average of all letter vectors
        return np.array(vec)

    # referred array padding from below two links
    # https://numpy.org/doc/stable/reference/generated/numpy.pad.html
    # https://stackoverflow.com/questions/35751306/python-how-to-pad-numpy-array-with-zeros

    feature_mat = {}         # emtpy list to store all feature matrices
    req_dim = 500            # selected tweets for a day
    for date in dates: # loop to iterate through all dates
        vec_data = X[X.post_date == date]    # filtering data for current date

        # getting feature vector for current date with all other features
        temp = vec_data.drop(['body','ticker_symbol','hashtags','post_date','tweet_cleaned'],axis=1).values

        # empty list to store hashtag vector
        hash_vec = []

        # loop to iterate thorugh all tweet hashtags and getting their vector representation
        for tweet in vec_data['hashtags'].values:
            hash_vec.append(vec_hashtag(tweet))
        hash_vec = np.array(hash_vec)

        # combining hashtag vectors and other feature vectors for current date
        hash_vec = np.hstack((temp,hash_vec))

        try:       # padding matrix with 0's if number of tweet vectors are less than req_dim
            hash_vec = np.pad(hash_vec, [(0, req_dim-hash_vec.shape[0]), (0, 0)])
        except:    # else select first 500 tweet vectors
            hash_vec = hash_vec[:req_dim,:]

        # adding feature matrix to defined list
        # added one more dimension so that while training neural network, these arrrays will be treated as single channel images
        feature_mat[date] = np.expand_dims(hash_vec,axis=-1)
    
    structured_data = pd.DataFrame(columns=['post_date','tweet_text','company_name','feat_mat'])
    
    for company in companies:
        for date in dates:
            structured_data = structured_data.append({'post_date':date,
                                    'tweet_text':tweets[date],
                                    'company_name':company,
                                    'feat_mat':feature_mat[date]},ignore_index=True)
    
    x_text = structured_data['tweet_text'].values
    x_feat = list(structured_data['feat_mat'].values)
    x_feat = np.array(x_feat)
    x_company = structured_data['company_name'].values
    
    bert_tokenizer_params=dict(lower_case=True)
    reserved_tokens=[]

    bert_vocab_args = dict(
        # The target vocabulary size
        vocab_size = 8000,
        # Reserved tokens that must be included in the vocabulary
        reserved_tokens=reserved_tokens,
        # Arguments for `text.BertTokenizer`
        bert_tokenizer_params=bert_tokenizer_params,
        # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
        learn_params={},
    )

    # creating BertTokenizer object with vocab text file genrated as per reference link stated above
    en_tokenizer = text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)
    vocab_size_text = len(pathlib.Path('en_vocab.txt').read_text().splitlines())+ 1
    
    # Using BertTokenizer to tokenize train data
    encoded_text = en_tokenizer.tokenize(x_text)
    encoded_text = encoded_text.merge_dims(-2,-1)   # reducing dimension of ragged tensor
    encoded_text = encoded_text.to_list()           # converting to list to pad the sequences
    max_length = 5000                               # max length of padding
    x_text = pad_sequences(encoded_text, maxlen=max_length, padding='post')
    
    tokenizer = pickle.load(open("tokenizer.pkl","rb"))
    # getting tokenized train data
    train_comp = np.array(tokenizer.texts_to_sequences(x_company))
    
    loaded_model = load_model("trained_model.h5")
    prediction = (x_company,loaded_model.predict([x_text,x_feat,train_comp]))
    
    prediction[1][0] = [i[0] for i in prediction[1][0]]
    prediction[1][1] = [1 if i[0]>=0.5 else 0 for i in prediction[1][1]]
    
    return prediction

### 1.2 "function2"

In [4]:
def function2(X,Y1,Y2):
    '''
        1.  This function takes input in pandas dataframe format with following columns and gives outputs
            as mean deviation in chnage prediction and ratio of correct signs predicted.
            
            ['ticker_symbol', 'post_date', 'body', 'comment_num', 'retweet_num','like_num']
        
        2.  The input can be of multiple dates and companies.
    '''
    
    def Find_url(string):  
        return ('https' in string or 'http' in string)

    X['URL_flag'] = X.body.apply(lambda x:1 if Find_url(x) else 0)
    
    # fuunction to know whether hashtags are present in tweet text or not
    def Find_hashtag(string):
        temp = re.search(r"#(\w+)", string)     
        return temp

    # Extracting hashtag_flag feature
    X['hastags_flag'] = X.body.apply(lambda x:1 if Find_hashtag(x) else 0)
    
    # referred and modified below link to extract hashtags from tweets
    # https://www.geeksforgeeks.org/python-extract-hashtags-from-text/
    def Find_mention(string):
        temp = re.search(r"@(\w+)", string)     
        return temp

    # extracting mention_flag features
    X['mention_flag'] = X.body.apply(lambda x:1 if Find_mention(x) else 0)
    
    # referred below link to extract hashtags from tweets
    # https://www.geeksforgeeks.org/python-extract-hashtags-from-text/

    def get_hashtag(string):
        hashtags  = re.findall(r"#(\w+)", string)
        return hashtags

    # extracting and storing hashtags in dataframe
    X['hashtags'] = X.body.apply(lambda x:','.join(get_hashtag(x)))
    
    stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]



    # referred this cleaning function from Donor Choose assignments
    def preprocess(text):
        text = text.lower()
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r"#(\w+)", '', text)
        text = re.sub(r"@(\w+)", '', text)
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"can\'t", "can not", text)
        text = re.sub(r"n\'t", " not", text)
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'s", " is", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ll", " will", text)
        text = re.sub(r"\'t", " not", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"\'m", " am", text)
        text = text.replace('\\r', ' ')
        text = text.replace('\\n', ' ')
        text = text.replace('\\"', ' ')
        text = re.sub('[^A-Za-z0-9]+', ' ', text)
        text = text.replace('\\r', ' ')
        text = text.replace('\\n', ' ')
        text = text.replace('\\"', ' ')
            # https://gist.github.com/sebleier/554280
        text = ' '.join(e for e in text.split() if e.lower() not in stopwords)
        return text


    # cleaning tweet text and storing in 'tweet_cleaned' column
    X['tweet_cleaned'] = X.body.apply(lambda x:preprocess(x))
    
    sid = SentimentIntensityAnalyzer()

    senti_score_train = [sid.polarity_scores(x_body) for x_body in tqdm(X['tweet_cleaned'])]


    X['neg'] = [senti['neg'] for senti in senti_score_train]
    X['neu'] = [senti['neu'] for senti in senti_score_train]
    X['pos'] = [senti['pos'] for senti in senti_score_train]
    X['compound'] = [senti['compound'] for senti in senti_score_train]
    
    X = X[X.retweet_num > 1].copy()
    [retweet_num_scalar,comment_num_scalar,like_num_scalar] = pickle.load(open("scalars.pkl","rb"))

    X['retweet_num'] = retweet_num_scalar.transform(X['retweet_num'].values.reshape(-1,1))
    X['comment_num'] = comment_num_scalar.transform(X['comment_num'].values.reshape(-1,1))
    X['like_num'] = like_num_scalar.transform(X['like_num'].values.reshape(-1,1))
    
    X['hashtags'] = X.hashtags.apply(lambda x:0 if len(x)<1 else x)
    
    keep_indices = []
    for i in range(X.shape[0]):
        if len(X.iloc[0]['tweet_cleaned'])>1:
            keep_indices.append(i)
    keep_indices = np.array(keep_indices)
    X = X.iloc[keep_indices]
    

    dates = X['post_date'].unique()
    companies = X['ticker_symbol'].unique()  

    # empty list to store combined tweet text
    tweets = {}

    for date in tqdm(dates):
        tweet_data = X[X.post_date == date]
        # empty string to store and accumulate tweet text for a day
        all_tweets = ''
        # loop to iterate through all tweets on that day
        for tweet in tweet_data['tweet_cleaned']: 
            all_tweets += tweet

        # storing all tweets to tweets dictionary
        tweets[date] = all_tweets

    
    # referred this preprocess function from Donor Choose assignments
    def preprocess(text):
        text = text.lower()
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"can\'t", "can not", text)
        text = re.sub(r"n\'t", " not", text)
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'s", " is", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ll", " will", text)
        text = re.sub(r"\'t", " not", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"\'m", " am", text)
        text = text.replace('\\r', ' ')
        text = text.replace('\\n', ' ')
        text = text.replace('\\"', ' ')
        text = re.sub('[^A-Za-z0-9]+', ' ', text)
        return text

    # Getting Glove vec dictionary
    with open('glove_vectors', 'rb') as f:
        glove_dict = pickle.load(f)

    # Function to get hashtag vectors
    def vec_hashtag(hashtags):
        vec = np.zeros(300)   # empty vector of dimension 300
        n_letters = 1         # counting letters in all hashtags
        if hashtags:
            hashtags = hashtags.split(',')         # getting all hashtags for current tweet
            for hashtag in hashtags:               # loop to iterate through all hashtags
                hashtag = preprocess(hashtag)      # preprocessing all hashtags
                hashtag = hashtag.replace(" ", "") # removing spaces in hashtags
                for letter in hashtag:           
                    vec += glove_dict[letter]
                    n_letters += 1
            vec /= n_letters                       # finding average of all letter vectors
        return np.array(vec)

    # referred array padding from below two links
    # https://numpy.org/doc/stable/reference/generated/numpy.pad.html
    # https://stackoverflow.com/questions/35751306/python-how-to-pad-numpy-array-with-zeros

    feature_mat = {}         # emtpy list to store all feature matrices
    req_dim = 500            # selected tweets for a day
    for date in tqdm(dates): # loop to iterate through all dates
        vec_data = X[X.post_date == date]    # filtering data for current date

        # getting feature vector for current date with all other features
        temp = vec_data.drop(['body','ticker_symbol','hashtags','post_date','tweet_cleaned'],axis=1).values

        # empty list to store hashtag vector
        hash_vec = []

        # loop to iterate thorugh all tweet hashtags and getting their vector representation
        for tweet in vec_data['hashtags'].values:
            hash_vec.append(vec_hashtag(tweet))
        hash_vec = np.array(hash_vec)

        # combining hashtag vectors and other feature vectors for current date
        hash_vec = np.hstack((temp,hash_vec))

        try:       # padding matrix with 0's if number of tweet vectors are less than req_dim
            hash_vec = np.pad(hash_vec, [(0, req_dim-hash_vec.shape[0]), (0, 0)])
        except:    # else select first 500 tweet vectors
            hash_vec = hash_vec[:req_dim,:]

        # adding feature matrix to defined list
        # added one more dimension so that while training neural network, these arrrays will be treated as single channel images
        feature_mat[date] = np.expand_dims(hash_vec,axis=-1)
    
    structured_data = pd.DataFrame(columns=['post_date','tweet_text','company_name','feat_mat'])
    
    for company in companies:
        for date in tqdm(dates):
            structured_data = structured_data.append({'post_date':date,
                                    'tweet_text':tweets[date],
                                    'company_name':company,
                                    'feat_mat':feature_mat[date]},ignore_index=True)
    x_dates = structured_data['post_date'].values
    x_text = structured_data['tweet_text'].values
    x_feat = list(structured_data['feat_mat'].values)
    x_feat = np.array(x_feat)
    x_company = structured_data['company_name'].values
    
    
    bert_tokenizer_params=dict(lower_case=True)
    reserved_tokens=[]

    bert_vocab_args = dict(
        # The target vocabulary size
        vocab_size = 8000,
        # Reserved tokens that must be included in the vocabulary
        reserved_tokens=reserved_tokens,
        # Arguments for `text.BertTokenizer`
        bert_tokenizer_params=bert_tokenizer_params,
        # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
        learn_params={},
    )

    # creating BertTokenizer object with vocab text file genrated as per reference link stated above
    en_tokenizer = text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)
    vocab_size_text = len(pathlib.Path('en_vocab.txt').read_text().splitlines())+ 1
    
    # Using BertTokenizer to tokenize train data
    encoded_text = en_tokenizer.tokenize(x_text)
    encoded_text = encoded_text.merge_dims(-2,-1)   # reducing dimension of ragged tensor
    encoded_text = encoded_text.to_list()           # converting to list to pad the sequences
    max_length = 5000                               # max length of padding
    x_text = pad_sequences(encoded_text, maxlen=max_length, padding='post')
    
    tokenizer = pickle.load(open("tokenizer.pkl","rb"))
    # getting tokenized train data
    train_comp = np.array(tokenizer.texts_to_sequences(x_company))
    
    loaded_model = load_model("trained_model.h5")
    prediction = (x_company,loaded_model.predict([x_text,x_feat,train_comp]))
    
    deviation = []

    count = 0
    sign_pred = [1 if i[0]>=0.5 else 0 for i in prediction[1][1]]
    for i,date,company in zip(range(len(x_dates)),x_dates,x_company):
        if Y1[date][company] == -1 or Y2[date][company] == -1:
            continue
        deviation.append(abs(prediction[1][0][i]-Y1[date][company]))
        if sign_pred[i] == Y2[date][company]:
            count += 1   
            
    print("Mean Deviation for % change in closing prices is : ",np.mean(deviation))
    print("Number of correct signs predicted out of {} are : {}".format(len(sign_pred),count))
    return [np.mean(deviation),count/len(sign_pred)]

## 2. Testing Functions

In [5]:
# reading the combined tweet data from all files 

data = pd.read_csv('final_data.csv')
data.head()

Unnamed: 0,tweet_id,ticker_symbol,writer,post_date,body,comment_num,retweet_num,like_num,close_value,volume,open_value,high_value,low_value,close_value_change,change_label,class_sign
0,550803612197457920,AAPL,SentiQuant,2015-01-01,#TOPTICKERTWEETS $AAPL $IMRS $BABA $EBAY $AMZN...,0,0,1,110.38,41304780,112.82,113.13,110.21,0.0,0.0,1
1,550803610825928706,AAPL,SentiQuant,2015-01-01,#SENTISHIFTUP $K $FB $GOOGL $GS $GOLD $T $AAPL...,0,0,1,110.38,41304780,112.82,113.13,110.21,0.0,0.0,1
2,550803225113157632,AAPL,MacHashNews,2015-01-01,Rumor Roundup: What to expect when you're expe...,0,0,0,110.38,41304780,112.82,113.13,110.21,0.0,0.0,1
3,550802957370159104,AAPL,WaltLightShed,2015-01-01,"An $AAPL store line in Sapporo Japan for the ""...",2,4,4,110.38,41304780,112.82,113.13,110.21,0.0,0.0,1
4,550802855129382912,AAPL,2waystrading,2015-01-01,$AAPL - Will $AAPL Give Second entry opportuni...,0,0,0,110.38,41304780,112.82,113.13,110.21,0.0,0.0,1


<li> We can see there are many columns in this combined data and are not required for model to predict. </li>
<li> What model requires is to predict the output based on tweets on a particular day. </li>
<li> Hence let us drop the non required columns. </li>

In [6]:
temp = data.drop(['tweet_id', 'writer','close_value', 'volume',
       'open_value', 'high_value','low_value','change_label','class_sign','close_value_change'],axis=1).copy()

temp.head()

Unnamed: 0,ticker_symbol,post_date,body,comment_num,retweet_num,like_num
0,AAPL,2015-01-01,#TOPTICKERTWEETS $AAPL $IMRS $BABA $EBAY $AMZN...,0,0,1
1,AAPL,2015-01-01,#SENTISHIFTUP $K $FB $GOOGL $GS $GOLD $T $AAPL...,0,0,1
2,AAPL,2015-01-01,Rumor Roundup: What to expect when you're expe...,0,0,0
3,AAPL,2015-01-01,"An $AAPL store line in Sapporo Japan for the ""...",2,4,4
4,AAPL,2015-01-01,$AAPL - Will $AAPL Give Second entry opportuni...,0,0,0


<li> The dataframe has ticker_symbol column where the company name for which we need to predict is present for all rows while other columns are related to tweet information. </li>
<li> Here we can see that only information related to tweets is present and does not contain any information related to prices. </li>

### 2.1 Testing "function1"

In [7]:
date = '2019-06-03'
company = 'AAPL'

X = temp[(temp.post_date == date) & (temp.ticker_symbol == company)].copy()
y_true_change = data[(data.post_date == date) & (data.ticker_symbol == company)]['change_label'].unique()[0]
y_true_sign = data[(data.post_date == date) & (data.ticker_symbol == company)]['class_sign'].unique()[0]

In [8]:
X.columns

Index(['ticker_symbol', 'post_date', 'body', 'comment_num', 'retweet_num',
       'like_num'],
      dtype='object')

In [9]:
# calling function1

pred = function1(X.copy())
print("For company {}, the % change in closing price on {} was {} and predicted as {} \n".format(company,date,y_true_change,pred[1][0][0]))
print("For company {}, the sign for change in closing price on {} was {} and predicted as {} \n".format(company,date,y_true_sign,pred[1][1][0]))

For company AAPL, the % change in closing price on 2019-06-03 was 1.0110241617638556 and predicted as 1.7048200368881226 

For company AAPL, the sign for change in closing price on 2019-06-03 was 0 and predicted as 0 



<li> Here 0 sign change means negative change while 1 sign change means positive change. </li>

### 2.1 Testing "function2"

In [10]:
data = pd.read_csv('final_data.csv')

In [11]:
# let us take test set as last 6 months of available data

data = data[data.post_date >= '2019-06-01']
data.head()

Unnamed: 0,tweet_id,ticker_symbol,writer,post_date,body,comment_num,retweet_num,like_num,close_value,volume,open_value,high_value,low_value,close_value_change,change_label,class_sign
3861206,1134610603715309569,AAPL,ArchieAndrews85,2019-06-01,The patent portfolio of those two companies co...,1,0,2,175.07,27043580,176.23,177.99,174.99,0.0,0.0,1
3861207,1134610842643836929,AAPL,PortfolioBuzz,2019-06-01,Having 10 different news tabs open for $AAPL $...,0,0,0,175.07,27043580,176.23,177.99,174.99,0.0,0.0,1
3861208,1134612405697335296,AAPL,ppolitics,2019-06-01,$AAPL has $245 billion in cash on hand - I thi...,0,0,1,175.07,27043580,176.23,177.99,174.99,0.0,0.0,1
3861209,1134612972020654083,AAPL,SusanLiTV,2019-06-01,Using #tariffs 2 battle illegal #immigration? ...,6,4,38,175.07,27043580,176.23,177.99,174.99,0.0,0.0,1
3861210,1134614110010826752,AAPL,TalkMarkets,2019-06-01,FANG Stocks Update: Leading The Marketwide Cha...,0,0,2,175.07,27043580,176.23,177.99,174.99,0.0,0.0,1


In [12]:
# extracting class labels in dict format from dataset in order to provide those as inputs
# the dict format is y_true[date][company name] = true value
# Hence it is nested dictionary

# getting unique dates and companies
dates = data['post_date'].unique()
companies = data['ticker_symbol'].unique()

# creating dictionaries to store class labels
y_change_label = {}
y_class_sign = {}
for date in tqdm(dates):
    # creating empty nested dictionaries
    y_change_label[date] = {}
    y_class_sign[date] = {}
    for company in companies:
        temp = data[(data.post_date == date) & (data.ticker_symbol == company)]

        try:
            y_change_label[date][company] = temp['change_label'].unique()[0]
            y_class_sign[date][company] = temp['class_sign'].unique()[0]
        except:
            # for missing values of class labels, -1 was assigned and this is handled in function2
            y_change_label[date][company] = -1
            y_class_sign[date][company] = -1

100%|████████████████████████████████████████████████████████████████████████████████| 214/214 [00:58<00:00,  3.69it/s]


In [13]:
# dropping non-required columns
X = data.drop(['tweet_id', 'writer','close_value', 'volume',
       'open_value', 'high_value','low_value','change_label','class_sign','close_value_change'],axis=1).copy()

X.head()

Unnamed: 0,ticker_symbol,post_date,body,comment_num,retweet_num,like_num
3861206,AAPL,2019-06-01,The patent portfolio of those two companies co...,1,0,2
3861207,AAPL,2019-06-01,Having 10 different news tabs open for $AAPL $...,0,0,0
3861208,AAPL,2019-06-01,$AAPL has $245 billion in cash on hand - I thi...,0,0,1
3861209,AAPL,2019-06-01,Using #tariffs 2 battle illegal #immigration? ...,6,4,38
3861210,AAPL,2019-06-01,FANG Stocks Update: Leading The Marketwide Cha...,0,0,2


#### Input data format is similar to function1 the change here is class label as inputs

In [14]:
# calling function2

pred = function2(X,y_change_label,y_class_sign)

100%|████████████████████████████████████████████████████████████████████████| 465826/465826 [01:27<00:00, 5347.40it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 295.18it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:02<00:00, 101.28it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 585.28it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 606.46it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 589.86it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 608.47it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 591.75it/s]
100%|███████████████████████████████████

Mean Deviation for % change in closing prices is :  0.8286616
Number of correct signs predicted out of 1284 are : 873


In [15]:
print("predicted chnage mean deviation is {} and correct prediction ratio is {}".format(pred[0],pred[1]))

predicted chnage mean deviation is 0.8286616206169128 and correct prediction ratio is 0.6799065420560748


#### These results states that tweeter does have an impact on stock markets but it is not the only factor affecting stock market. 