# Sentiment Analysis on Stocktwist Tweets

## Objective

In this notebook, we goal to build a sentiment analysis model. The dataest was scraped from https://stocktwits.com/, with FAANG-stocks-related tweets. The tweets' author are the ones who labeled it with either bearish or bullish, whcih obviously concludes the sentiment at the time of its writing. 

## Columns Description 


-    id : The tweet ID on the website

-    text: The tweet text body

-    time: The time the tweet was posted

-    sentiment: Bearish (-) or Bullish (+)

## Data Wrangling

### Importing Libraries

In [None]:
!pip install utils

Collecting utils
  Downloading utils-1.0.1-py2.py3-none-any.whl (21 kB)
Installing collected packages: utils
Successfully installed utils-1.0.1


In [None]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.4-py3-none-any.whl (235 kB)
[?25l[K     |█▍                              | 10 kB 19.6 MB/s eta 0:00:01[K     |██▉                             | 20 kB 10.1 MB/s eta 0:00:01[K     |████▏                           | 30 kB 8.6 MB/s eta 0:00:01[K     |█████▋                          | 40 kB 7.8 MB/s eta 0:00:01[K     |███████                         | 51 kB 4.5 MB/s eta 0:00:01[K     |████████▍                       | 61 kB 5.3 MB/s eta 0:00:01[K     |█████████▊                      | 71 kB 5.5 MB/s eta 0:00:01[K     |███████████▏                    | 81 kB 5.7 MB/s eta 0:00:01[K     |████████████▌                   | 92 kB 6.3 MB/s eta 0:00:01[K     |██████████████                  | 102 kB 5.4 MB/s eta 0:00:01[K     |███████████████▎                | 112 kB 5.4 MB/s eta 0:00:01[K     |████████████████▊               | 122 kB 5.4 MB/s eta 0:00:01[K     |██████████████████              | 133 kB 5.4 MB/s eta 0:00:01

In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 5.1 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 45.8 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.21


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.metrics import classification_report

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import contractions
import unidecode
# To plot visualizations inline with the notebook
%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Loading Data

In [None]:
df= pd.read_csv('/content/drive/MyDrive/data/df.csv')
data= df.copy()
df.head(5)

Unnamed: 0.1,Unnamed: 0,id,text,time,sentiment,m
0,0,434779043,$GOOG those puts puts rotting in your account📈,2022-02-09 15:05:45,Bullish,
1,1,434773104,$PYPL $FB $GOOG 👀,2022-02-09 14:56:17,,
2,2,434771897,$GOOG Strong like a bull. Gap close was the re...,2022-02-09 14:54:18,Bullish,
3,3,434768594,$GOOG this market is more bearish than usual,2022-02-09 14:48:40,,
4,4,434762084,*Update 2/2/22* \n \n$GOOG Bounce of the Wall...,2022-02-09 14:37:25,,


### NULL Values

In [None]:
df.drop(columns='m', inplace=True)

In [None]:
df=df.dropna()
df.shape

(139759, 5)

### Text Processing




#### Text Cleaning & Contractions Exanding

In [None]:
#adding new contractions to the contractions list which is already here
# https://github.com/kootenpv/contractions/blob/master/contractions/data/contractions_dict.json
contractions.add('isnt', 'is not')
contractions.add('arent', 'are not')
contractions.add('doesnt', 'does not')
contractions.add('dont', 'do not')
contractions.add('didnt', 'did not')
contractions.add('cant', 'can not')
contractions.add('couldnt', 'could not')
contractions.add('hadnt', 'had not')
contractions.add('hasnt', 'has not')
contractions.add('havenot', 'have not')
contractions.add('shouldnt', 'should not')
contractions.add('wasnt', 'was not')
contractions.add('werent', 'were not')
contractions.add('wont', 'will not')
contractions.add('wouldnt', 'would not')
contractions.add('cannot', 'can not')
contractions.add('can\'t', 'can not')
contractions.add( "can't've", "can not have")

In [None]:
def preprocess(doc):
    doc = unidecode.unidecode(doc) # transliterates any unicode string into the closest possible representation in ascii text.
    doc = contractions.fix(doc) # expands contractions                   
    doc = re.sub('[\t\n]', ' ', doc) # remove newlines and tabs
    doc = re.sub(r'@[A-Za-z0-9_]+', '', doc) # remove mentions
    doc = re.sub(r'#[A-Za-z0-9_]+', '', doc) #remove hashtags
    doc = re.sub(r'https?://[^ ]+', '', doc)
    doc = re.sub(r'www.[^ ]+', '', doc)
    doc = re.sub('[^A-Za-z]+', ' ', doc) # remove all characters other than alphabet
    doc = re.sub(' +', ' ', doc) # substitute any number of space with one space only
    doc = doc.strip().lower() # remove spaces from begining and end and lower the text
    return doc

In [None]:
df['processed'] = df['text'].apply(preprocess)

In [None]:
df['segmented'] = df['processed'].apply(lambda x: x.split()) 

#### Stemming and lemmatization

It is the process of reducing the derived words to their roots to be easier to be handled and embedded

In [None]:
from nltk.corpus import wordnet
# Map pos tag from nltk library to characeters accepted by the wordnet Lemmatizer to understand word's POS 
def get_wordnet_pos(word): 
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
from nltk.stem import WordNetLemmatizer
# Lemmatize all words in a list of words using their POS
def lemmatizerHelper(words):
    lemmatizer = WordNetLemmatizer()
    l = []
    for w in words:
        l.append(lemmatizer.lemmatize(w , get_wordnet_pos(w)))
    return l

In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
df['stemmed'] = df['segmented'].apply(lemmatizerHelper) # stemming the words


In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

#### Stop Words

In [None]:
from string import ascii_lowercase

stop_words = set(nltk.corpus.stopwords.words('english'))
exclude_words = set(("not", "no"))
new_stop_words = stop_words.difference(exclude_words)

# adding single characters to new_stop_words
for c in ascii_lowercase:
    new_stop_words.add(c)

In [None]:
df['stopRemoved'] = df['stemmed'].apply(lambda words: [word for word in words if word not in new_stop_words])

#### Tokenization

In [None]:
negationWords = ['not', 'no', 'never']

# A function that replaces negationWords in a tokenized array with not concatenated with the next nonNegation word (bigram but conctenated)
# for example ['never', no', 'not', 'happy', 'journey'] will be ['nothappy', 'jo']
def bigramNegationWords(words):
    l = []
    metNegation = False
    bigram = ''
    for w in words:
        if w in negationWords:
            if metNegation == False:
                bigram += 'not'
                metNegation = True
            else:
                continue
        else:
            if metNegation == True:
                bigram += w
                l.append(bigram)
                metNegation = False
                bigram = ''
            else:
                l.append(w)
    return l


In [None]:
df['negated'] = df['stopRemoved'].apply(bigramNegationWords)


In [None]:
df=df[df['negated'].map(lambda d: len(d)) > 1]

In [None]:
def convToDict(words):
    freq= dict()
    for word in words:
        if word== 'amzn' or word== 'fb' or word=='goog':
          continue
        if word in freq:
            freq[word] +=1
        else:
            freq[word] = 1
    return freq


In [None]:
df['words'] = df['negated'].apply(convToDict)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,text,time,sentiment,processed,segmented,stemmed,stopRemoved,negated,words
0,0,434779043,$GOOG those puts puts rotting in your account📈,2022-02-09 15:05:45,Bullish,goog those puts puts rotting in your account,"[goog, those, puts, puts, rotting, in, your, a...","[goog, those, put, put, rot, in, your, account]","[goog, put, put, rot, account]","[goog, put, put, rot, account]","{'put': 2, 'rot': 1, 'account': 1}"
2,2,434771897,$GOOG Strong like a bull. Gap close was the re...,2022-02-09 14:54:18,Bullish,goog strong like a bull gap close was the reas...,"[goog, strong, like, a, bull, gap, close, was,...","[goog, strong, like, a, bull, gap, close, be, ...","[goog, strong, like, bull, gap, close, reason,...","[goog, strong, like, bull, gap, close, reason,...","{'strong': 1, 'like': 1, 'bull': 1, 'gap': 1, ..."
8,8,434746357,$GOOG Pessimistic on its future. Respect all u...,2022-02-09 14:01:15,Bearish,goog pessimistic on its future respect all use...,"[goog, pessimistic, on, its, future, respect, ...","[goog, pessimistic, on, it, future, respect, a...","[goog, pessimistic, future, respect, user, res...","[goog, pessimistic, future, respect, user, res...","{'pessimistic': 1, 'future': 1, 'respect': 2, ..."
9,9,434743052,$SPY NEW ALL TIME HIGHS COMING WITHIN A FEW WE...,2022-02-09 13:51:53,Bullish,spy new all time highs coming within a few wee...,"[spy, new, all, time, highs, coming, within, a...","[spy, new, all, time, high, come, within, a, f...","[spy, new, time, high, come, within, week, vix...","[spy, new, time, high, come, within, week, vix...","{'spy': 2, 'new': 1, 'time': 1, 'high': 1, 'co..."
13,13,434725757,$GOOG Alphabet a screaming buy! Should be a Do...,2022-02-09 12:55:31,Bullish,goog alphabet a screaming buy should be a dow ...,"[goog, alphabet, a, screaming, buy, should, be...","[goog, alphabet, a, scream, buy, should, be, a...","[goog, alphabet, scream, buy, dow, component, ...","[goog, alphabet, scream, buy, dow, component, ...","{'alphabet': 1, 'scream': 1, 'buy': 1, 'dow': ..."


#### Glove Embedding

Word embedding pre trained Glove and all similar embedding models aim to overcome the dimensionality limitation, dealing with each word as a feature which is impossible for training due to the memory limitations besides ignoring the words' context and their relations,  by representing each word in a dense, low-dimension, continuous vector space. The objective of any word embedding model is to encode the context of the word and its relationship to other words in the corpus in the vector representation. Semantically and / or syntactically similar words should be close to each other in the embedding space.



In [None]:
import gensim.downloader as api
wv = api.load('glove-twitter-200')



In [None]:
def wvContains(word):
    try:
        x = wv[word]
        return True
    except KeyError:
        return False

In [None]:
def doc2vec(x): 
    word_dict = x
    sv = np.zeros(200)
    s_freq = 0
    for word, freq in word_dict.items():
        
        if wvContains(word):
            sv += (wv[word] * freq)
            s_freq += freq
        else:
            # If it doesn't contain the word, then it can be either our bigram that begins with not
            if word[0:3] == 'not' and word[0:7] != 'nothing':
                if wvContains(word[3:]):
                    sv += (wv[word[0:3]] +  wv[word[3:]]) * freq
                    s_freq += 2 * freq
                else:
                    end = 3
                    while (end > 1) and (not wvContains(word[end:])):
                        end += 1
                    sv += (wv[word[0:3]] +  wv[word[end:]]) * freq
                    s_freq += 2 * freq
            else:
                # Or it can be a word like
                # ummmm, loveee, omggg, ahhhhhhhhhhh
                # so, we remove the latest characters until wv recognizes it or we only have two characters left
                end = len(word)-1
                while (end > 1) and (not wvContains(word[0:end])):
                    end -= 1
                
                if wvContains(word[0:end]):
                    sv += (wv[word[0:end]] * freq)
                    s_freq += freq
    if s_freq != 0:
        return (1/s_freq) * sv
    else:
        return np.zeros(200)

In [None]:
df['Vec'] = df['words'].apply(doc2vec)

In [None]:
columns_names = []
for i in range(200):
    columns_names.append('v_' + str(i))


In [None]:
df

Unnamed: 0.1,Unnamed: 0,id,text,time,sentiment,processed,segmented,stemmed,stopRemoved,negated,words,Vec
0,0,434779043,$GOOG those puts puts rotting in your account📈,2022-02-09 15:05:45,Bullish,goog those puts puts rotting in your account,"[goog, those, puts, puts, rotting, in, your, a...","[goog, those, put, put, rot, in, your, account]","[goog, put, put, rot, account]","[goog, put, put, rot, account]","{'put': 2, 'rot': 1, 'account': 1}","[0.27986499667167664, 0.26837900839746, -0.147..."
2,2,434771897,$GOOG Strong like a bull. Gap close was the re...,2022-02-09 14:54:18,Bullish,goog strong like a bull gap close was the reas...,"[goog, strong, like, a, bull, gap, close, was,...","[goog, strong, like, a, bull, gap, close, be, ...","[goog, strong, like, bull, gap, close, reason,...","[goog, strong, like, bull, gap, close, reason,...","{'strong': 1, 'like': 1, 'bull': 1, 'gap': 1, ...","[0.014937801565974951, -0.12589470557868482, 0..."
8,8,434746357,$GOOG Pessimistic on its future. Respect all u...,2022-02-09 14:01:15,Bearish,goog pessimistic on its future respect all use...,"[goog, pessimistic, on, its, future, respect, ...","[goog, pessimistic, on, it, future, respect, a...","[goog, pessimistic, future, respect, user, res...","[goog, pessimistic, future, respect, user, res...","{'pessimistic': 1, 'future': 1, 'respect': 2, ...","[0.04207199811935425, 0.4953599959611893, -0.4..."
9,9,434743052,$SPY NEW ALL TIME HIGHS COMING WITHIN A FEW WE...,2022-02-09 13:51:53,Bullish,spy new all time highs coming within a few wee...,"[spy, new, all, time, highs, coming, within, a...","[spy, new, all, time, high, come, within, a, f...","[spy, new, time, high, come, within, week, vix...","[spy, new, time, high, come, within, week, vix...","{'spy': 2, 'new': 1, 'time': 1, 'high': 1, 'co...","[0.053019252000376584, 0.1266183504834771, 0.1..."
13,13,434725757,$GOOG Alphabet a screaming buy! Should be a Do...,2022-02-09 12:55:31,Bullish,goog alphabet a screaming buy should be a dow ...,"[goog, alphabet, a, screaming, buy, should, be...","[goog, alphabet, a, scream, buy, should, be, a...","[goog, alphabet, scream, buy, dow, component, ...","[goog, alphabet, scream, buy, dow, component, ...","{'alphabet': 1, 'scream': 1, 'buy': 1, 'dow': ...","[0.11021167288223901, 0.1597609973202149, 0.10..."
...,...,...,...,...,...,...,...,...,...,...,...,...
310158,310158,433186085,$GOOG $GOOGL $TSLA $AAPL watching for end of d...,2022-02-03 19:23:39,Bullish,goog googl tsla aapl watching for end of day r...,"[goog, googl, tsla, aapl, watching, for, end, ...","[goog, googl, tsla, aapl, watch, for, end, of,...","[goog, googl, tsla, aapl, watch, end, day, ral...","[goog, googl, tsla, aapl, watch, end, day, ral...","{'googl': 1, 'tsla': 1, 'aapl': 1, 'watch': 1,...","[0.05356875213328749, 0.11627999972552061, 0.1..."
310162,310162,433182485,$GOOG looks like oh yea they did beat,2022-02-03 19:15:04,Bullish,goog looks like oh yea they did beat,"[goog, looks, like, oh, yea, they, did, beat]","[goog, look, like, oh, yea, they, do, beat]","[goog, look, like, oh, yea, beat]","[goog, look, like, oh, yea, beat]","{'look': 1, 'like': 1, 'oh': 1, 'yea': 1, 'bea...","[0.10256560165435076, 0.048490796238183975, -0..."
310163,310163,433181187,$GOOG be like,2022-02-03 19:11:40,Bullish,goog be like,"[goog, be, like]","[goog, be, like]","[goog, like]","[goog, like]",{'like': 1},"[-0.015537000261247158, 0.1115799993276596, -0..."
310165,310165,433178927,"$HEXO Just a short attack, that&#39;s what thi...",2022-02-03 19:06:03,Bullish,hexo just a short attack that s what this whol...,"[hexo, just, a, short, attack, that, s, what, ...","[hexo, just, a, short, attack, that, s, what, ...","[hexo, short, attack, whole, drop, risky, go, ...","[hexo, short, attack, whole, drop, risky, go, ...","{'hexo': 2, 'short': 1, 'attack': 1, 'whole': ...","[0.10189482061700388, -0.022125458852811294, 0..."


In [None]:
ll = []
for i in range(len(df)):
    ll.append(df['Vec'].iloc[i])

In [None]:
dd = pd.DataFrame(ll, columns=columns_names)
dd.head()

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,...,v_190,v_191,v_192,v_193,v_194,v_195,v_196,v_197,v_198,v_199
0,0.279865,0.268379,-0.147138,0.078502,0.293288,-0.30326,0.737762,-0.363812,0.405097,0.448958,...,-0.357808,-0.014092,0.015782,0.175955,0.31829,-0.035553,0.017645,0.183933,-0.217057,-0.157776
1,0.014938,-0.125895,0.14286,-0.060451,-0.081608,0.259958,0.548813,-0.096434,-0.18173,-0.204038,...,-0.079029,-0.205554,0.203361,0.006617,-0.085437,-0.11053,0.065429,0.076048,-0.042373,0.022726
2,0.042072,0.49536,-0.429579,0.368006,-0.131992,0.05398,0.618166,0.05438,-0.078632,0.253386,...,-0.415578,-0.027236,0.130077,-0.217058,0.13422,0.030662,0.341852,0.236134,-0.10142,-0.301104
3,0.053019,0.126618,0.177829,-0.065922,0.04001,-0.015653,0.450136,-0.11641,-0.022298,-0.153948,...,0.081481,0.069976,0.261568,0.025245,-0.168796,-0.022949,-0.027676,-0.009023,0.0497,-0.073593
4,0.110212,0.159761,0.104954,0.086427,-0.106692,-0.031113,0.471419,0.045153,0.209708,-0.222888,...,0.216143,0.099845,0.085075,0.117538,0.329022,-0.084232,0.071955,0.365174,0.093542,0.079625


### Label Encoding

In [None]:
l= df['sentiment'].to_list()

In [None]:
dd['label']=l

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(dd['label'])
dd['label']=le.transform(dd['label'])

In [None]:
dd.head()

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,...,v_191,v_192,v_193,v_194,v_195,v_196,v_197,v_198,v_199,label
0,0.279865,0.268379,-0.147138,0.078502,0.293288,-0.30326,0.737762,-0.363812,0.405097,0.448958,...,-0.014092,0.015782,0.175955,0.31829,-0.035553,0.017645,0.183933,-0.217057,-0.157776,1
1,0.014938,-0.125895,0.14286,-0.060451,-0.081608,0.259958,0.548813,-0.096434,-0.18173,-0.204038,...,-0.205554,0.203361,0.006617,-0.085437,-0.11053,0.065429,0.076048,-0.042373,0.022726,1
2,0.042072,0.49536,-0.429579,0.368006,-0.131992,0.05398,0.618166,0.05438,-0.078632,0.253386,...,-0.027236,0.130077,-0.217058,0.13422,0.030662,0.341852,0.236134,-0.10142,-0.301104,0
3,0.053019,0.126618,0.177829,-0.065922,0.04001,-0.015653,0.450136,-0.11641,-0.022298,-0.153948,...,0.069976,0.261568,0.025245,-0.168796,-0.022949,-0.027676,-0.009023,0.0497,-0.073593,1
4,0.110212,0.159761,0.104954,0.086427,-0.106692,-0.031113,0.471419,0.045153,0.209708,-0.222888,...,0.099845,0.085075,0.117538,0.329022,-0.084232,0.071955,0.365174,0.093542,0.079625,1


In [None]:
dd.iloc[:,:-1]

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,...,v_190,v_191,v_192,v_193,v_194,v_195,v_196,v_197,v_198,v_199
0,0.279865,0.268379,-0.147138,0.078502,0.293288,-0.303260,0.737762,-0.363812,0.405097,0.448958,...,-0.357808,-0.014092,0.015782,0.175955,0.318290,-0.035553,0.017645,0.183933,-0.217057,-0.157776
1,0.014938,-0.125895,0.142860,-0.060451,-0.081608,0.259958,0.548813,-0.096434,-0.181730,-0.204038,...,-0.079029,-0.205554,0.203361,0.006617,-0.085437,-0.110530,0.065429,0.076048,-0.042373,0.022726
2,0.042072,0.495360,-0.429579,0.368006,-0.131992,0.053980,0.618166,0.054380,-0.078632,0.253386,...,-0.415578,-0.027236,0.130077,-0.217058,0.134220,0.030662,0.341852,0.236134,-0.101420,-0.301104
3,0.053019,0.126618,0.177829,-0.065922,0.040010,-0.015653,0.450136,-0.116410,-0.022298,-0.153948,...,0.081481,0.069976,0.261568,0.025245,-0.168796,-0.022949,-0.027676,-0.009023,0.049700,-0.073593
4,0.110212,0.159761,0.104954,0.086427,-0.106692,-0.031113,0.471419,0.045153,0.209708,-0.222888,...,0.216143,0.099845,0.085075,0.117538,0.329022,-0.084232,0.071955,0.365174,0.093542,0.079625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132455,0.053569,0.116280,0.182626,-0.025245,-0.120307,0.139814,0.464864,-0.386391,-0.114090,0.067444,...,0.168529,-0.067951,0.344485,0.114856,-0.179466,-0.085015,0.139586,0.029999,0.104841,0.176854
132456,0.102566,0.048491,-0.201700,0.432200,-0.220000,0.117360,0.353924,-0.113964,-0.021586,0.091249,...,-0.413358,0.027226,0.223704,-0.048636,-0.136261,-0.440366,0.137046,-0.156233,-0.083904,0.166464
132457,-0.015537,0.111580,-0.235990,0.758950,-0.454890,0.077948,0.735210,-0.380760,-0.333870,-0.345510,...,-0.265240,-0.158830,0.129980,-0.053881,-0.172170,-0.298530,0.390660,-0.037577,-0.135390,0.459650
132458,0.101895,-0.022125,0.112834,-0.007842,0.218090,0.005458,0.574503,-0.142385,0.241668,-0.106457,...,0.122567,0.125360,-0.020866,-0.034110,0.299185,-0.207684,0.188729,0.084678,-0.053094,-0.199585


In [None]:
dd.to_csv("ScrapedEmbed.csv")
X= dd.iloc[:,:-1]
y= dd['label']

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
y_test.value_counts()

1    17924
0     8568
Name: label, dtype: int64

### Models Training

#### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
NV_clf = GaussianNB().fit(X_train, y_train)
NV_pred = NV_clf.predict(X_test)

#### Naive Bayes Evaluation

In [None]:
print(classification_report(y_test, NV_pred))

              precision    recall  f1-score   support

           0       0.41      0.37      0.39      8568
           1       0.71      0.75      0.73     17924

    accuracy                           0.63     26492
   macro avg       0.56      0.56      0.56     26492
weighted avg       0.62      0.63      0.62     26492



#### MLP Classifier

In [None]:
from sklearn.neural_network import MLPClassifier
NN_clf_stop = MLPClassifier(random_state=1, max_iter=30, hidden_layer_sizes=(16,16), tol=1e-5, early_stopping=True, learning_rate_init=0.01)
NN_clf_stop.fit(X_train, y_train)
NN_pred_stop = NN_clf_stop.predict(X_test)

  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does 

#### MLP Classifier Evaluation

In [None]:
print(classification_report(y_test, NN_pred_stop))

              precision    recall  f1-score   support

           0       0.69      0.51      0.59      8568
           1       0.79      0.89      0.84     17924

    accuracy                           0.77     26492
   macro avg       0.74      0.70      0.72     26492
weighted avg       0.76      0.77      0.76     26492



### Test

In [None]:
test= {'great': 1, 'good': 1, 'company':1}
x= doc2vec(test)
x= np.array(x)
out= NV_clf.predict([x])
out

  "X does not have valid feature names, but"


array([1])

In [None]:
import joblib
filename = 'SentAnalysis_model.sav'
joblib.dump(NN_clf_stop, filename)

['SentAnalysis_model.sav']

### Conclusion

With the small dataset we could scrape due to the time limitations, We could get 139,000 labeled samples, and train two basic sentiments analyzer to get an accuracy of 63% by using Gaussian Naive Bayes, and 77% by using MLP Classifier, which works as a simple neural network.
The next step is to use this model to predict the tweets' sentiments for the FAANG stocks in a desired period to be used next as a feature, besides the stock open and close prices for the time series analysis LSTM model we have created to make a future forecast for the stocks' prices.

## References:

- https://developer.twitter.com/en/docs/tutorialshow-to-analyze-the-sentiment-of-your-own-tweets
- https://www.analyticsvidhya.com/blog/2021/06/twitter-sentiment-analysis-a-nlp-use-case-for-beginners/
- https://towardsdatascience.com/step-by-step-twitter-sentiment-analysis-in-python-d6f650ade58d