
## FINANCIAL DATA
MODULE 6 | LESSON 1


---



# **SENTIMENT ANALYSIS: JSON AND TWEETS**


|  |  |
|:---|:---|
|**Reading Time** |  25 minutes |
|**Prior Knowledge** | Basic Python, Sentiment Analysis (API)  |
|**Keywords** |Affect, JSON (JavaScript Object Notation), Application Programming Interface (API), NLTK (Natural Language Toolkit) |   	

---



*In this module, we are going to register with Twitter as applications developers in order to get the keys and access tokens to use the Twitter application programming interface (API). Then, we can search tweets from the past seven days that mention whatever query word we like, for example a company's name or ticker. Once we have this dataset of tweets, we can perform our sentiment analysis: How frequently are various basic emotions evidenced in the text of the tweets that mention the company we are interested in? But tweets and their relevant data (country of origin, creation time, language) are stored in JSON objects. JSON objects are a very common format for data, so we will invest some time in understanding the structure and syntax of this data object.*

## 1. The Twitter API
API stands for application programming interface. An API is how one computer or program "interfaces" or interacts with another computer or program. That means it is also how we users interface with computers or programs. The Twitter API is how we users (use the Python noteboooks on our computers to) interface with the Twitter database of tweets.

* Go to Twitter.com and get a Twitter account if you don't have one yet.
* Sign in to the developer site [https://developer.twitter.com/en](https://developer.twitter.com/en) using your Twitter account
* Now you need to register as an applications (app) developer
* Go to [https://apps.twitter.com/app/new](https://apps.twitter.com/app/new) 
* Give your new app a name and a description - you can be creative
* Enter anything for website 
* Leave callback url blank
* Read and accept the terms and conditions to create an account
* Then, click on the "Keys and Access Tokens" tab
* Copy the two consumer keys to a text file
* Scroll down and click "Generate Access key and Secret"
* Copy the two access keys to the text file

NOTE: With this free Twitter API v2, you will get an error unless your search dates are within the last seven days. Also, you can only download 500,000 tweets per month. For this reason, be conservative and restrict your start and end times, especially if you are investigating tweets for very well-known companies, like AAPL (Apple) and GOOG (Alphabet / Google). Just a 30-minute block of time (the difference between start time and end time is 30 minutes) is actually sufficient to do some exploratory analysis on such large companies. Much smaller companies may require several hours or even days. 

Note too that using the client.sample() method does not count against the 500K tweet limit. These tweets may not include a reference to the company you are investigating, but it could still be interesting. Once downloaded, you can also filter these tweets for whatever interests you.

Twarc is a library that will make it a lot easier for us to search and access Twitter through its API. [For more background on Twarc, see this site,](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/) [and this site.](https://github.com/DocNow/twarc/blob/4bff06bb8f2e279ea0a2064d6bfaec3c9b4b3aff/docs/api/library.md)

Note from the documentation above that Twarc is a command line tool, but we have adapted it with [this code](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/f3b5b40d197ecaac133299eea46fefcced569d21/modules/6b-labs-code-standard-python.md) to run from a notebook as part of a function we define. 

You will notice the reference to JSON objects. Luckily, we use Twarc to convert these into a `DataFrame`. Still, in the next section, you will need to become more familiar with JSON objects given how prevalent they are as a data type.  

In [None]:
from IPython.display import VimeoVideo

VimeoVideo("714293481", h="e42c792a9e", width=600)

##### [Access video transcript here](https://drive.google.com/file/d/1i7aDRSZgaHg9w0i_wjOutoi87gjbovwA/view?usp=sharing)

## 2. Quantifying Affect 



The NRCLex library measures the emotion or "affect" in a piece of text, so we will use it for the actual sentiment analysis in this program.
[Find more background on the NRCLex library you will be using here.](https://pypi.org/project/NRCLex/)

And [here is information about another similar approach to sentiment analysis.](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk)
After getting familiar with the code and information in this notebook, please use the code provided in the article in your own notebook to try it for yourself, and discuss any issues or insights you have with your fellow students on the forum. 


## 3. JSON

JSON is an important format for structuring data. <br>
Read these required readings for an introduction to JSON: <br>
https://www.geeksforgeeks.org/javascript-json/ <br>
https://www.geeksforgeeks.org/json-data-types/ <br>
<br>

For the information you need about using JSON in Python, see this page of required reading:<br>
https://www.geeksforgeeks.org/python-json/?ref=lbp <br>
Please also read each of the following 5 pages, which are a continuation of the one above (also required reading).<br>
https://www.geeksforgeeks.org/working-with-json-data-in-python/?ref=lbp <br>
https://www.geeksforgeeks.org/read-write-and-parse-json-using-python/?ref=lbp <br>
https://www.geeksforgeeks.org/append-to-json-file-using-python/?ref=lbp <br>
https://www.geeksforgeeks.org/serializing-json-data-in-python/?ref=lbp <br>
https://www.geeksforgeeks.org/deserialize-json-to-object-in-python/?ref=lbp <br>
<br>

See also these (required reading) pages for clarification of the JSON related methods. <br>
https://www.geeksforgeeks.org/json-load-in-python/ <br>
https://www.geeksforgeeks.org/json-loads-in-python/ <br>
https://www.geeksforgeeks.org/json-dumps-in-python/ <br>
https://www.geeksforgeeks.org/json-dump-in-python/ <br>




In [None]:
from datetime import datetime, timezone

import nltk
import pandas as pd
import plotly.express as px
from nrclex import NRCLex
from twarc import Twarc2
from twarc_csv import DataFrameConverter

nltk.download("punkt")

In [None]:
Bearer_Token = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"  # Replace the Xs with your new bearer token from the Twitter development site.

In [None]:
def TweetSearch(query, start_time, end_time):
    client = Twarc2(bearer_token=Bearer_Token)
    converter = DataFrameConverter()

    # The search_recent method call the recent search endpoint to get Tweets based on the query, start and end times
    search_results = client.search_recent(
        query=query, start_time=start_time, end_time=end_time, max_results=50
    )

    # Twarc returns all Tweets for the criteria set above, so we page through the results
    # for count, page in enumerate(search_results):

    Tweets = converter.process(search_results)
    Tweets_str = Tweets.to_string()

    text_object = NRCLex(Tweets_str)
    data = text_object.affect_frequencies

    emotion_df = pd.DataFrame.from_dict(data, orient="index")
    emotion_df = emotion_df.reset_index()
    emotion_df = emotion_df.rename(
        columns={"index": "Emotion Classification", 0: "Emotion Count"}
    )
    emotion_df = emotion_df.sort_values(by=["Emotion Count"], ascending=False)

    emotion_df.drop(
        emotion_df[emotion_df["Emotion Classification"] == "anticip"].index,
        inplace=True,
    )  # This line just fixes a small cosmetic bug in the Twarc library

    fig = px.bar(
        emotion_df,
        x="Emotion Count",
        y="Emotion Classification",
        color="Emotion Classification",
        orientation="h",
        width=800,
        height=400,
    )
    fig.show()

    return

In [None]:
# Specify your query, like the stock ticker or name of the company you are investigating
query = "AAPL lang:en"

# Specify the start and end times in UTC for the time period you want Tweets from
# These dates and times must be within the past week!
start_time = datetime(2022, 4, 20, 0, 0, 0, 0, timezone.utc)
end_time = datetime(2022, 4, 20, 1, 0, 0, 0, timezone.utc)

scrape_new_data = False

if scrape_new_data:
    df_aapl = TweetSearch(query, start_time, end_time)
else:
    df_aapl = pd.read_csv("AAPL_lang_en_emotion.csv")

df_aapl

Now let's choose a different company.

In [None]:
# Specify your query, like the stock ticker or name of the company you are investigating
query = "IBM lang:en"

# Be default we are using the same start and end time.
# This is also intentional because we want a good comparison

if scrape_new_data:
    df_ibm = TweetSearch(query, start_time, end_time)
else:
    df_ibm = pd.read_csv("IBM_lang_en_emotion.csv")

df_ibm

A very simplistic approach to sentiment analysis might be to buy the stock with the more "positive" emotion counts.

Alternatively, subtract the negative from the positive for a net positive vs. negative emotion, and buy the stock with the higher of these values--and sell the other one (assuming you are evaluating pairs of stocks).
 
Or, you might do some more analysis, looking at historical levels of these values--and buy when there is a significant increase in a particular value.

Which value? And what constitutes significant? You will need to perform a lot of experimentation--testing and backtesting to find your answer.


## 4. Conclusion

In this lesson, we learned how to download actual tweets and related data from Twitter.com. In order to do this, we needed developer credentials from Twitter to use in the Twitter API. Because tweets (like much of the data we will encounter as data scientists and financial engineers) are stored as JSON objects, we needed to thoroughly understand the JSON syntax and related methods.

Then, we focused our tweet search on companies that we were interested in by supplying the company name as a search term. Given the potential volume of tweets, we did not consider reading them; instead, we used a sentiment analysis library to show us the frequency of basic emotions in order to gauge how Twitter users (which potentially include investors, customers, etc.) feel about the company. We assumed that these feelings might help us indicate how, say, the stock will perform in the near future. We also read about a more thorough and technical approach, which includes tokenizing, normalizing, and denoising the text data before using a Naive Bayes Classifier model to determine which words (and/or emoticons) are most informative about whether a tweet is emotionally positive or negative.

<b> References </b> 

* "Append to JSON file using Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/append-to-json-file-using-python/?ref=lbp.

* Bailey, Mark M. "NRCLex." *Python Package Index (PyPI)*. 2019. https://pypi.org/project/NRCLex/. 

* Daityari, Shaumik. "How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)." *Digital Ocean*. 26 Sep 2019. https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk.

* "Deserialize JSON to Object in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/deserialize-json-to-object-in-python/?ref=lbp.

* "JavaScript JSON." *Geeks for Geeks*. https://www.geeksforgeeks.org/javascript-json/.

* "JSON|Data Types." *Geeks for Geeks*. https://www.geeksforgeeks.org/json-data-types/.

* "json.load() in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/json-load-in-python/. 

* "json.loads() in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/json-loads-in-python/.

* "json.dumps() in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/json-dumps-in-python/. 

* "json.dump() in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/json-dump-in-python/.

* "Python JSON." *Geeks for Geeks*. https://www.geeksforgeeks.org/python-json/?ref=lbp.

* "Read, Write and Parse JSON using Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/read-write-and-parse-json-using-python/?ref=lbp.

* "Serializing JSON data in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/serializing-json-data-in-python/?ref=lbp.

* Summers, Ed, and Nick Ruest. "Twarc." *GitHub*, https://github.com/DocNow/twarc/blob/4bff06bb8f2e279ea0a2064d6bfaec3c9b4b3aff/docs/api/library.md

* Summers, Ed, and Nick Ruest. "twarc." *twarc*. https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/.

* Twitter Developer Relations. "getting-started-with-the-twitter-api-v2-for-academic-research." *GitHub*, https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/f3b5b40d197ecaac133299eea46fefcced569d21/modules/6b-labs-code-standard-python.md

* "Working With JSON Data in Python." *Geeks for Geeks*. https://www.geeksforgeeks.org/working-with-json-data-in-python/?ref=lbp.



---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
