# Twitter Sentiment Extraction for Tesla Tweets from 2017 to 2020

- Creating text classification model using LSTM deep learning networks. The model was trained using the Sentiment-140 dataset created by Alec Go, Richa Bhayani, and Lei Huang from Standford University

- The dataset contains 1.6m tweets labelled as 0-negative or 4-positive. The LSTM model learns to predict the sentiment (positive/negative) of a given tweet and provides a confidence score of each category. 

- Using trained model to generate sentiment scores for Tesla Tweets

    - Cleaning Date Column. Creating Date column with YYYY-MM-DD format

    - Cleaning Tweet Content. Removing special characters, mentions, hastags, links, and other special characters from tweets

    - Sentiment Extraction. Using the LSTM model to generate the sentiment score for each Tesla tweet

    - Aggregating sentiment scores for each day (mean average) to obtain overall sentiment for each day
    
    - Creating final Twitter dataframe with data from all years of tweets and their daily sentiment scores

In [1]:
# Importing required libraries
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
import re
import matplotlib.pyplot as plt
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, BatchNormalization, LSTM, Bidirectional, Embedding, Dropout
from keras.models import Sequential
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

from datetime import datetime

pd.set_option("display.max_colwidth" , 100, "display.max_columns", 20)
np.random.seed(50)
tf.random.set_seed(50)

## Loading and Cleansing Raw Data
- The Sentiment 140 Dataset contains 1.6m tweets classified as positive or negative. Extracting only the tweet and the label from the dataset. Cleaning each tweet before performing vectorization of text.  
- Using tweets that are between 100 and 300 in length for training LSTM text classifier

 

In [2]:
#Loading Sentiment 140 raw data
!wget https://www.dropbox.com/s/ht3rge16u835h24/Sent140.zip?dl=0
!unzip -o "/content/Sent140.zip?dl=0" 

--2021-04-04 03:31:26--  https://www.dropbox.com/s/ht3rge16u835h24/Sent140.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6022:18::a27d:4212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/ht3rge16u835h24/Sent140.zip [following]
--2021-04-04 03:31:27--  https://www.dropbox.com/s/raw/ht3rge16u835h24/Sent140.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucc89da4804f92a1fa7e3bf5777d.dl.dropboxusercontent.com/cd/0/inline/BL-KX98OimTEynkw7W1HwNgHAM_1Wi0fouR6vQov2SOTEt76zm6LL_UhaE5_TupQCMSy073tLZWapkfOFfyJlmGVLPfM2ZgliLCz2GQKLEvYSYCTsHnei42ssgynHHk5s67g7tZmtZ2QX9d3rDZpdkkI/file# [following]
--2021-04-04 03:31:27--  https://ucc89da4804f92a1fa7e3bf5777d.dl.dropboxusercontent.com/cd/0/inline/BL-KX98OimTEynkw7W1HwNgHAM_1Wi0fouR6vQov2SOTEt76zm6LL_UhaE5_TupQCMSy073tLZWapkfOFfyJl

In [3]:
# The Sentiment-140 dataset contains around 1.6 million tweets with labelled sentiments 
raw_data = pd.read_csv("/content/Sent140.csv", encoding='latin-1')
raw_data.columns = ['Label', 'ID', 'Date', 'Flag', 'User', 'Tweet']

# Extracting columns required for training
raw_data = raw_data[["Label", "Tweet"]]

#Creating labels for encoding
raw_data['Label'] = raw_data['Label'].replace(4,'Positive')
raw_data['Label'] = raw_data['Label'].replace(0,'Negative')

# Removing tweets that are too small, only keeping tweets between 100 and 300
raw_data = raw_data[(raw_data['Tweet'].map(len) > 100) & (raw_data['Tweet'].map(len) < 300)]

# Extracting 100000 positive and 100000 negative tweets
set_neg = raw_data.iloc[:100000,:]
set_pos = raw_data.iloc[-100000:,:]
raw_data = pd.concat([set_neg,set_pos],axis = 0)

print(raw_data['Label'].value_counts())
print(raw_data.shape)
print(raw_data.head())

Negative    100000
Positive    100000
Name: Label, dtype: int64
(200000, 2)
       Label  \
0   Negative   
3   Negative   
13  Negative   
14  Negative   
20  Negative   

                                                                                                  Tweet  
0   is upset that he can't update his Facebook by texting it... and might cry as a result  School to...  
3   @nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you a...  
13  @smarrison i would've been the first, but i didn't have a gun.    not really though, zac snyder'...  
14  @iamjazzyfizzle I wish I got to watch it with you!! I miss you and @iamlilnicki  how was the pre...  
20  one of my friend called me, and asked to meet with her at Mid Valley today...but i've no time *s...  


In [4]:
#Cleaning our labelled training data
def tweet_cleanse(text):

    #Removing hyperlinks with text
    text = re.sub(r'https?:\/\/\S+','', text) 

    #Removing $ and any text appearing after
    text = re.sub(r'\$[A-za-z0-9]+','', text) 

    #Removing pattern "Read More: and MSFT tokens"
    text = re.sub(r'Read more:|MSFT','', text) 

    #Removing @mentions
    text = re.sub(r'@[A-Za-z0-9]+', '', text) 

    #Removing = sybmol and text coming after
    text = re.sub(r'=[\S\D\s]+', '', text)
    
    #Removing all other special characters
    text = re.sub('[^A-Za-z0-9?!\']+', ' ', text)

    return text

def gen_clean_tweets(input_df, col): 
    input_df[col] = input_df[col].apply(lambda x: tweet_cleanse(x))
    return input_df

#Generating clean tweets from current text
raw_data = gen_clean_tweets(raw_data, 'Tweet')
print(raw_data.shape)
print(raw_data.head())

(200000, 2)
       Label  \
0   Negative   
3   Negative   
13  Negative   
14  Negative   
20  Negative   

                                                                                                  Tweet  
0   is upset that he can't update his Facebook by texting it and might cry as a result School today ...  
3            no it's not behaving at all i'm mad why am i here? because I can't see you all over there   
13   i would've been the first but i didn't have a gun not really though zac snyder's just a douchec...  
14                            I wish I got to watch it with you!! I miss you and how was the premiere?!  
20     one of my friend called me and asked to meet with her at Mid Valley today but i've no time sigh   


In [5]:
#Extracting Raw Tweets and Labels 
def extract_tweet_labels(clean_data):
    raw_tweet, raw_y = clean_data["Tweet"].values, clean_data["Label"].values
    return raw_tweet, raw_y

# Shufffling the raw data to disorient the negative and positive sentiments
# So that when we create training and testing data, we get instances of both labels
raw_data = raw_data.sample(frac=1, random_state=50)

# Creating Raw X (tweets) and Raw Y (classified label)
raw_X, raw_y = extract_tweet_labels(raw_data)
print(raw_X)
print(raw_y)

['just got back from quot work quot and is already tired still going to lunch with madee! I need my friends '
 'okay my dear tweets! i have to get up early and hang out with the small chitlins again tomorrow! love yous guys! night!'
 'haha six flags was amazing wanted to go on king da ka but it was closed other than that awesome two thumbs up!!!!! '
 ...
 ' I saw the previews for land of the lost and thought Ummmm NO! LOL Thanks for confirming my thought on that '
 ' Exactly Numbers on twitter mean nothing unless you actually quot connect quot with your followers And you do! Smart man '
 'Have a listen to our music on Mixposure a great place to see some fantastic reviews ']
['Positive' 'Positive' 'Negative' ... 'Positive' 'Positive' 'Positive']


## Creating Training Data and Testing Data

In [6]:
# Vectorizing the tweets
# Limit the dataset to top 50000 words 
max_words = 50000

# Finding Max number of words in each tweet
max_tweet_len = 300

#Embedding size for vecotr embedding layer
embedding_size = 128

#Creating a tokenizer object and vectorizing the tweets  
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(raw_X)
X_tokenized = tokenizer.texts_to_sequences(raw_X)

#Truncating the input sequences to the max tweet length
X = pad_sequences(X_tokenized, padding='post', maxlen=max_tweet_len)
print('Shape of X:', X.shape)
print(raw_X[0])
print(X[0])
print(raw_y[0])

Shape of X: (200000, 300)
just got back from quot work quot and is already tired still going to lunch with madee! I need my friends 
[   20    46    60    43    26    62    26     5    12   219   259    77
    52     3   441    23 47612     1    92     6   197     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0    

In [7]:
# Encoding Y to create labels
Y = pd.get_dummies(raw_y).values
print('Shape Encoded Y: ', Y.shape)
print('\nFirst 5 labels of Y:\n', Y[0:5])


Shape Encoded Y:  (200000, 2)

First 5 labels of Y:
 [[0 1]
 [0 1]
 [1 0]
 [0 1]
 [0 1]]


In [8]:
#Train Test Split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.1, random_state = 42, shuffle=True)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(180000, 300) (180000, 2)
(20000, 300) (20000, 2)


## Sentiment Classficiation Model Training and Evaluation

In [9]:
# Model Definition
def build_model():
    
    model = Sequential()
    model.add(Embedding(max_words, embedding_size,input_length=X.shape[1]))

    model.add(Bidirectional(LSTM(units = 128)))
    model.add(Dropout(0.5))

    model.add(Dense(units = 16, activation='relu'))
    model.add(Dropout(0.5))

    model.add(Dense(units = 2, activation='softmax'))
    print(model.summary())

    return model 

# Compiling Model
def compile_model(model):
    
    model.compile(loss = 'categorical_crossentropy', optimizer=Adam(learning_rate=0.0001), metrics = ['accuracy'])
    return model

# Training Model
def train_model(model, X_tr, Y_tr):
    batch_size = 256
    epochs = 8
    history = model.fit(X_tr, Y_tr, batch_size = batch_size, epochs=epochs, verbose=1, validation_split = 0.1)
    return model, history

# Model Evaluation
def eval_model(m, test_X, test_Y):

    test_loss, test_accuracy = m.evaluate(test_X, test_Y, verbose=0)
            
    print('MODEL EVALUATION : \n')
    print("The testing set loss : ", test_loss)
    print("The testing set accuracy : ", test_accuracy)
       
    return None


model = build_model()
model = compile_model(model)
model, history = train_model(model, X_train, Y_train)


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 128)          6400000   
_________________________________________________________________
bidirectional (Bidirectional (None, 256)               263168    
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 16)                4112      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 34        
Total params: 6,667,314
Trainable params: 6,667,314
Non-trainable params: 0
______________________________________________

In [10]:
# Printing Confusion Matrix, Classification Report
tf.keras.utils.plot_model(model, to_file='Sentiment Analysis Model.png', show_shapes=True)
eval_model(model, X_test, Y_test)

MODEL EVALUATION : 

The testing set loss :  0.5739759802818298
The testing set accuracy :  0.7409499883651733


In [11]:
#Using the Model to generate sentiment confidence scores for sample tweet
tweets = [["I pretty sure Tesla's stock price is going to double by this time next yr"]]

for sample_tweet in tweets:
    seq = tokenizer.texts_to_sequences(sample_tweet)
    padded = pad_sequences(seq, padding='post', maxlen=300)

    pred = model.predict(padded)
    print(pred)

    if (np.argmax(pred[0]) == 0):
        sentiment_score = -1 * pred[0][0]
    elif (np.argmax(pred[0]) == 1):
        sentiment_score = pred[0][1] 

    print (sentiment_score)


[[0.6517518  0.34824818]]
-0.6517518162727356


# Creating Sentiment Scores for Tesla Tweets
- Using the trained model from above to classify tweets related to Tesla

In [12]:
#Loading raw twitter data containing Date and Tweets pertaining to Tesla Stock
df_2017 = pd.read_csv("https://raw.githubusercontent.com/DDave94/Stock-Prediction-DL/main/datasets/raw/Tesla/tsla-tweets-2017.csv")
df_2018 = pd.read_csv("https://raw.githubusercontent.com/DDave94/Stock-Prediction-DL/main/datasets/raw/Tesla/tsla-tweets-2018.csv")
df_2019 = pd.read_csv("https://raw.githubusercontent.com/DDave94/Stock-Prediction-DL/main/datasets/raw/Tesla/tsla-tweets-2019.csv")
df_2020 = pd.read_csv("https://raw.githubusercontent.com/DDave94/Stock-Prediction-DL/main/datasets/raw/Tesla/tsla-tweets-2020.csv")

## Cleaning Tesla Tweets - Date Column
###### *Creating Date column by removing time component, since we want to aggregrate sentiment scores on a daily basis*

In [13]:
def date_generation(date):
    # Create date object from given time format in dataframe
    my_date = datetime.strptime(date, "%Y-%m-%d %H:%M:%S+00:00")
    return my_date.date()

def date_cleanse(input_df):
    input_df['Date'] = input_df['Datetime'].apply(lambda x: date_generation(x))
    return input_df

# Creating clean date columns
df_2017 = date_cleanse(df_2017)
df_2018 = date_cleanse(df_2018)
df_2019 = date_cleanse(df_2019)
df_2020 = date_cleanse(df_2020)

## Cleaning Tweet content
###### *Removing links, mentions, hashtags etc.*

In [14]:
#Generating clean tweets from current text
df_2017 = gen_clean_tweets(df_2017, 'Text')
df_2018 = gen_clean_tweets(df_2018, 'Text')
df_2019 = gen_clean_tweets(df_2019, 'Text')
df_2020 = gen_clean_tweets(df_2020, 'Text')

## Tweet Sentiment Calculations
###### *Generating the sentiment score for each tweet using the trained LSTM text classifier model*

In [15]:
def get_sentiment(tweet, sent_model): 
    
    #Vectorizing and Padding tweet using max length of Tesla tweets
    tweet_tokenized = tokenizer.texts_to_sequences([tweet])
    tweet_padded = pad_sequences(tweet_tokenized, padding='post', maxlen=300)

    #Generating prediction using trained model
    pred = sent_model.predict(tweet_padded)
    
    #Extracting the sentiment confidence score
    if (np.argmax(pred[0]) == 0):
        sentiment_score = -1 * pred[0][0]
    elif (np.argmax(pred[0]) == 1):
        sentiment_score = pred[0][1] 

    return sentiment_score

#Creates a sentiment score for each tweet using model and Tesla tweets
def sentiment_generation(sent_model, input_df): 

    max_len = input_df['Text'].apply(len).max()
    input_df['TwitterSentiment'] = input_df['Text'].apply(get_sentiment, sent_model = sent_model)
    sentiment_df = input_df

    return sentiment_df 

df_2017 = sentiment_generation(model, df_2017)
df_2018 = sentiment_generation(model, df_2018)
df_2019 = sentiment_generation(model, df_2019)
df_2020 = sentiment_generation(model, df_2020)

print(df_2017[['Text', 'TwitterSentiment']].head(10))
df_2017.to_csv("/content/TweetsAndScores.csv")

                                                                                                  Text  \
0                                             Tesla's Stock Starts 2017 At A Critical Juncture Forbes    
1                                             Tesla s Stock Starts 2017 At A Critical Juncture Forbes    
2                             Tesla's Stock Starts 2017 At A Critical Juncture Forbes Google News Tech   
3                                     Tesla's Stock Starts 2017 At A Critical Juncture Forbes ggtechmy   
4                                    ggtechmy Tesla's Stock Starts 2017 At A Critical Juncture Forbes    
5                                             Tesla's Stock Starts 2017 At A Critical Juncture Forbes    
6  Tesla's Stock Starts 2017 At A Critical Juncture Forbes technology Tesla's Stock Starts 2017 At ...   
7                                   Tesla's Stock Starts 2017 At A Critical Juncture Forbes technology   
8                                   Tesla's St

##Sentiment Score Aggregation
###### *Aggregating the final sentiment scores to gather scores for each day*

In [16]:
#Creates an aggregated sentiment score for each day of tweets
def aggregated_df (input_df):
    agg_df = input_df[['Date', 'TwitterSentiment']]
    agg_df =  input_df.groupby(["Date"], as_index=False)['TwitterSentiment'].mean()
    return agg_df

final_2017 = aggregated_df(df_2017)
final_2018 = aggregated_df(df_2018)
final_2019 = aggregated_df(df_2019)
final_2020 = aggregated_df(df_2020)

##Creating full input dataframe for further analysis and stock price predictions
###### *Combining twitter data from 2017,2018,2019,2020*

In [17]:
#Creating full twitter sentiment data
twitter_data = final_2017.append([final_2018, final_2019, final_2020])
print(twitter_data.shape)
print(twitter_data.head())
print(twitter_data.tail())

twitter_data.to_csv('/content/tsla-tweet-sentiments-lstm.csv', index=False, encoding= 'utf-8-sig') 

(1456, 2)
         Date  TwitterSentiment
0  2017-01-01         -0.789756
1  2017-01-02          0.408966
2  2017-01-03          0.282842
3  2017-01-04         -0.129541
4  2017-01-05         -0.351872
           Date  TwitterSentiment
359  2020-12-25          0.014247
360  2020-12-26          0.068771
361  2020-12-27          0.088822
362  2020-12-28         -0.170788
363  2020-12-29          0.525323
