Each headline corresponds with a stock symbol (currently only tracking US markets - companies represented by the NASDAQ and DJIA) and a sentiment label. The sentiment labels are as follows:<br>

0 -> The stock went down by market close the day the article was published (negative)<br>

1 -> The stock went up by market close the day the article was published (positive)<br>

2 -> The stock stayed the same by market close the day the article was published (neutral)

**Goal :  create a Word2Vec model which can provide a market sentiment form input news feed provided by user**

In [50]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [51]:
# show all rows & columns , not required here
# pd.pandas.set_option("display.max_rows",None)
# pd.pandas.set_option("display.max_columns",None)

In [52]:
#importing the dija & nasdaq dataframe
df_dija=pd.read_csv('../input/news-sentiment-analysis-for-stock-data-by-company/djia_news copy.csv/djia_news copy.csv',encoding = "ISO-8859-1")
df_nasdaq=pd.read_csv('../input/news-sentiment-analysis-for-stock-data-by-company/nasdaq.csv/nasdaq.csv',encoding = "ISO-8859-1")

In [53]:
#combine the dataframes together
df=pd.concat([df_dija,df_nasdaq])

In Us stock market dija(Dow Jones Industrial Average) list 30 prominent companies and nasdaq(National Association of Securities Dealers Automated Quotations) refer to all company listed in NYSE(New York Stock Exchange) , Nasdaq is also the name of the Stck exchage.

Previously DIJA index was used for Industry heavy companies & NASDAQ was for tech,internet based companies but now that is not the case anymore


**Hypothesis : Combining both Nasdaq & The Dow will give holistic view of the market , so I've combined both of dija & nasdaq (whether training them separately will give better model or not that needs to be tested)**

In [54]:
df.head(5)

In [55]:
df.info()

For visualizing the sentiment distribution I'm using dija because :
1. it's small in size , so we can observe any pattern easily
2. it contains the best performers of market, we can get an overview of economy & industry from this index

In [56]:
#counting cumulative frequency of labels : 1 - positive , 0 - negative , 2- neutral
df_dija.groupby(by='Ticker').count()

Quite a large number of label valus , so we need answers for . 
1. Is there any duplicate headline ?
2. If there is duplicate,does market behave in a consistent manner when it encounters these headline ?

In [57]:
#separating unique & duplicate values , storing them in different dataframe
df_dija_dup=df_dija[df_dija.duplicated('Headline',keep=False)].sort_values('Headline')
df_dija_unique=df_dija[~df_dija.duplicated('Headline',keep=False)].sort_values('Headline')

In [58]:
# now counting number of sentiments(+ve,-ve and neutral) present for each & every stock of dija
df_dija_unique.groupby(by=['Ticker','Label']).count()

In [59]:
#plotting distribution of unique values
fig,ax=plt.subplots(1,1,figsize=(20,10))
sns.countplot(x=df_dija_unique['Ticker'],hue=df_dija_unique['Label'])
labels = [item.get_text() for item in ax.get_xticklabels()]
ax.set_xticklabels(labels)
plt.show()

 Overall review
 -----------------------------------------------------------------------------------
**most of the stocks that showing predominantly -ve trend or bearish trend**<br>
**NKE,CCEP - very volatile i.e price fluctuating**<br>
**AAAU,AIQ,ADME - high -ve sentiment**<br>
**for rest of the stock - no high craze, but news is circulating & that is effecting buyer-seller sentiment (at a moderate rate)**

In [60]:
fig,ax=plt.subplots(1,1,figsize=(20,10))
sns.countplot(x=df_dija_dup['Ticker'],hue=df_dija_dup['Label'])
labels = [item.get_text() for item in ax.get_xticklabels()]
ax.set_xticklabels(labels)
plt.show()

now this is interesting

what are the possible reasons of duplicate news/headline of the same stock?
1. different platform is taking news from same media house (similar sentiment)
2. diferent media house is trying to create dissimilar sentiment (price manipulation might be an intention , can say for sure) for the same stock

**Effect**:<br>
well market participants are showing bullish sentiment compared situations in previous case where participants got almost correct information (people are buying more when they are not sure about the market position)<br>

One thing to remember : This headlines were taken at different timestamp so same stock has behaved 
differently over the timespan<br>

Overal review
---------------
**market volatile , slightly bullish<br>
ANQ,JNJ showing -ve trend (exception)<br>
news circulation very limited i.e limited person has this news of somewhat sudden somewhat +ve trend/increase in volatility in market**

In [61]:
fig,ax=plt.subplots(1,1,figsize=(20,10))
sns.countplot(x=df_dija['Ticker'],hue=df_dija['Label'])
labels = [item.get_text() for item in ax.get_xticklabels()]
ax.set_xticklabels(labels)
plt.show()

If someone is looking for volatility or scalping or intraday , these are the opportunities<br>
buy & sell both (scalping) - AXP,HD,HON,CCF,PG<br>
short sell(intraday) - AAAU,AIO,ADME<br>


In [62]:
#categorize each sentiments (+ve,-ve,neutral) for each stock
df=df.groupby(by=['Ticker','Label'])['Headline'].apply(' '.join).reset_index()

In [63]:
from wordcloud import WordCloud, STOPWORDS

In [64]:
# let's check which words are causing havoc or creating opportunity
comment_words = ''
stopwords = set(STOPWORDS)

for val in df['Headline'].str.lower():
    tokens = val.split()
    comment_words += " ".join(tokens)+" "
    
wordcloud = WordCloud(width = 1000, height = 1000,background_color ='black',
                stopwords = stopwords,min_font_size = 10).generate(comment_words)
 
                     
plt.figure(figsize = (18, 18), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

In [65]:
#combine all sentiments into 3 groups (+ve,-ve,neutral) based on headline
def combine_text(data):
    headline0=[] # list to store -ve healine
    headline1=[] # list to store +ve healine
    headline2=[] # list to store neutral healine
    for i in range(len(data)):
        label=data.iloc[i,1]
        headline=data.iloc[i,2].lower()
        # we compare the levels & store to corresponding list
        if label==0:
            headline0.append(''.join(str(x)for x in headline))
        elif label==1:
            headline1.append(''.join(str(x)for x in headline))
        elif label==2:
            headline2.append(''.join(str(x)for x in headline))
    
    # convert list to text & storing them inside another list so we can access it by index
    sentence_list=[]
    if len(headline0)>0:
        text0=(''.join(str(x)for x in headline0))
        sentence_list.append(text0)
    if len(headline1)>0:
        text1=(''.join(str(x)for x in headline1))
        sentence_list.append(text1)
    if len(headline2)>0:
        text2=(''.join(str(x)for x in headline2))
        sentence_list.append(text2)
    
    return sentence_list

In [66]:
whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')

In [67]:
#perform text cleaning 
def tokenize_text(sentences):
    stop = set()
    for i in range(len(sentences)):
        # remove non-letter
        sentences[i] = ''.join(filter(whitelist.__contains__, sentences[i]))
        #splits the headline text into sentences
        sentences[i]=nltk.sent_tokenize(sentences[i])
        #remove article,verb,conjunction etc. & convert it to root form of the word
        sentences[i] = [nltk.word_tokenize(sentence) for sentence in sentences[i]]
        #remove stop words
        for j in range(len(sentences[i])):
            sentences[i][j] = [word for word in sentences[i][j] if word not in stopwords.words('english')]
    return sentences

**run below cell twice to make sure stopwords have been downloaded properly, otherwise it won't be able to find stopword & give error**

In [68]:
import nltk
nltk.download()
from nltk.corpus import stopwords

In [69]:
from gensim.models import Word2Vec

In [70]:
#df['Headline'].replace("[^a-zA-Z]"," ",regex=True, inplace=True)
sentences=combine_text(df)
sentences=tokenize_text(sentences)
models=[] # store Word2Vec model
words=[] # store vocabulary corresponding to each model
#each step calculating a model corresponding to each sentiment label (-ve,_ve,neutral consecutively)
for i in range(len(sentences)):
    print(f"Running model : {i}")
    model=Word2Vec(sentences[i], min_count=1)
    models.append(model)
    # store vocab for this model
    words.append(list(model.wv.index_to_key))

In [71]:
models

In [72]:
#testing 
models[1].wv.doesnt_match("france england germany berlin".split())

output interpretation : <br>
england - outside Europe unlike other 3<br>
berline - city, capital of germany , not a country <br>
france/germany - belong to Europe & country

In [73]:
# testing - goldman sach was a part of orginal data
similar = models[0].wv.most_similar('goldman')
print(similar)

In [74]:

#preprocessing for custom text
def custom_tokenizer(text):
    text = ''.join(filter(whitelist.__contains__, text))
    text = text.lower()
    sentences = nltk.sent_tokenize(text)
    sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    for i in range(len(sentences)):
        sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
    return sentences

In [75]:
# displays the similary or dissimilary if found any for a unidentified(sentiment not known) news headline
def check_sentiment(text):
    sentences=custom_tokenizer(text)
    print(f"Tokenized text : {sentences}")
    for w in sentences:
        for i in range(len(models)):
            try:
                similar=models[i].wv.most_similar(w)
                print(f"Similarity Score for {i}-th trend :")
                for element in similar:
                    print(element)
                dissimilar= models[i].wv.doesnt_match(text.split())
                print(f"Dissimilarity Score for {i}-th trend : {dissimilar}")
            except:
                print(f"similarity/Dissimilarity with {w} not found against model {i}")
                continue

In [76]:
text="Is the stock-market bottom in? What the pros say after S&P 500 tests 4,000."
check_sentiment(text)

high -ve sentiment with some +ve shift ( Dissimilarity Score for 0-th trend : say )

In [77]:
text="Wall Street closes higher boosted by strong Tesla earnings"
check_sentiment(text)

we can see both -ve sentiment (model-0) & +ve sentiment (model-1) based on similarity score<br>
we can see slight +ve sentiment in between -ve trends ( Dissimilarity Score for 0-th trend : boosted )<br>
we see slight -ve sentiment in between +ve trends (Dissimilarity Score for 1-th trend : closes)

In [81]:
text="Weekend reads: Prepare for a big week — a crucial Federal Reserve decision, tech-company earnings and a gdp report"
check_sentiment(text)

no similarity or dissimilary found

In [79]:
text="Apple to Pay $50 Million Settlement Over Butterfly Keyboard Lawsuit "
check_sentiment(text)

In [80]:
text="IBM, Apple, JNJ, and More of Tuesday's Stock Market Movers"
check_sentiment(text)

We can see the accuracy is not too great , perhaps BERT may produce better result , will try to implement that later on