# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [46]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from newsapi import NewsApiClient
%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\scott\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [47]:
# Read your api key environment variable
load_dotenv()
news_api_key = os.getenv("news_api")

In [48]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=news_api_key)

In [49]:
# Fetch the Bitcoin news articles
bitcoin_headlines = newsapi.get_everything(q='bitcoin', language='en', sort_by='relevancy')
print(f"Total articles about Bitcoin: {bitcoin_headlines['totalResults']}")
bitcoin_headlines["articles"][0]

Total articles about Bitcoin: 7047


{'source': {'id': 'wired', 'name': 'Wired'},
 'author': 'Gian M. Volpicelli',
 'title': 'The Rise and Fall of a Bitcoin Mining Sensation',
 'description': 'Compass Mining grew quickly during crypto’s halcyon days. Now, its customers and their thousands of mining machines are stuck.',
 'url': 'https://www.wired.com/story/compass-mining-bitcoin-russia/',
 'urlToImage': 'https://media.wired.com/photos/62e9c5e1d7368105da057de3/191:100/w_1280,c_limit/BitRiver-Mining-Center-Rise-And-Fall-Of-Bitcoin-Mining-Business-1184520941.jpg',
 'publishedAt': '2022-08-03T11:00:00Z',
 'content': "It was 8:45 in the morning of June 13 when Bill Stewart, the CEO of Maine-based bitcoin mining business Dynamics Mining, received a call from one of his employees. He's like, Every machine inside of … [+3472 chars]"}

In [50]:
# Fetch the Ethereum news articles
ethereum_headlines = newsapi.get_everything(q='ethereum', language='en', sort_by='relevancy')
print(f"Total articles about Ethereum: {ethereum_headlines['totalResults']}")
ethereum_headlines["articles"][0]

Total articles about Ethereum: 4869


{'source': {'id': 'wired', 'name': 'Wired'},
 'author': 'Gian M. Volpicelli',
 'title': "Ethereum's 'Merge' Is a Big Deal for Crypto—and the Planet",
 'description': 'One of the most influential cryptocurrency projects is set to finally ditch proof-of-work mining.',
 'url': 'https://www.wired.com/story/ethereum-merge-big-deal-crypto-environment/',
 'urlToImage': 'https://media.wired.com/photos/62fe63bcfd602ff2f11e6fbf/191:100/w_1280,c_limit/Ethereum-Ditches-Crypto-Business-1036181110.jpg',
 'publishedAt': '2022-08-18T16:09:33Z',
 'content': 'Cryptocurrencies are often criticized for being bad for the planet. Every year, bitcoin mining consumes more energy than Belgium, according to the University of Cambridges Bitcoin Electricity Consump… [+3829 chars]'}

In [51]:
# Create the Bitcoin sentiment scores DataFrame
bitcoin_sentiments = []

for article in bitcoin_headlines["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        bitcoin_sentiments.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
            
        })
        
    except AttributeError:
        pass
    
bitcoin_df = pd.DataFrame(bitcoin_sentiments)
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
bitcoin_df = bitcoin_df[cols]

bitcoin_df

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2022-08-03,It was 8:45 in the morning of June 13 when Bil...,0.5574,0.119,0.000,0.881
1,2022-08-02,"Tools to trace cryptocurrencies have, over jus...",0.0000,0.000,0.000,1.000
2,2022-08-18,Cryptocurrencies are often criticized for bein...,-0.5584,0.068,0.170,0.763
3,2022-08-02,Posted \r\nFrom Bitcoin highs to blockchain br...,-0.2960,0.000,0.086,0.914
4,2022-08-27,Aug 27 (Reuters) - Bitcoin was off 1.63% at $1...,0.0000,0.000,0.000,1.000
...,...,...,...,...,...,...
95,2022-08-11,August 2022\r\nA criteria for judging when a\r...,0.2023,0.057,0.000,0.943
96,2022-08-08,7 p.m. Soccer night. I drive to the field in S...,0.0000,0.000,0.000,1.000
97,2022-08-19,"Good Friday morning! Jordan Parker Erb here, r...",0.5010,0.103,0.000,0.897
98,2022-08-23,Facebook stunned the world last October when i...,-0.1027,0.000,0.038,0.962


In [52]:
# Create the Ethereum sentiment scores DataFrame
ethereum_sentiments = []

for article in ethereum_headlines["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        ethereum_sentiments.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
            
        })
        
    except AttributeError:
        pass
    
ethereum_df = pd.DataFrame(ethereum_sentiments)
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
ethereum_df = ethereum_df[cols]

ethereum_df

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2022-08-18,Cryptocurrencies are often criticized for bein...,-0.5584,0.068,0.170,0.763
1,2022-08-04,The non-fungible token\r\n (NFT) market has fa...,-0.0217,0.048,0.051,0.901
2,2022-08-02,"It's a day of the week ending in the letter ""y...",-0.2732,0.059,0.115,0.827
3,2022-08-11,Developers have picked a number of so-called t...,-0.6124,0.036,0.145,0.820
4,2022-08-08,"BANGKOK, Aug 8 (Reuters) - Crypto exchange Zip...",0.0000,0.000,0.000,1.000
...,...,...,...,...,...,...
95,2022-08-09,If you bought Ethereum(ETH -3.05%) back in Jul...,0.6486,0.187,0.058,0.754
96,2022-08-09,"In the crypto market, it can be much more diff...",-0.4927,0.000,0.104,0.896
97,2022-08-09,The New Consumer— by Dan Frommer\r\nA publicat...,0.7316,0.186,0.000,0.814
98,2022-08-02,Valuing cryptocurrencies is difficult because ...,0.3818,0.187,0.121,0.692


In [53]:
# Describe the Bitcoin Sentiment
bitcoin_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.028193,0.06184,0.05519,0.88296
std,0.452947,0.068611,0.071427,0.089061
min,-0.9081,0.0,0.0,0.628
25%,-0.296,0.0,0.0,0.82225
50%,0.0,0.057,0.0,0.8825
75%,0.3612,0.10225,0.1155,0.951
max,0.9246,0.372,0.307,1.0


In [54]:
# Describe the Ethereum Sentiment
ethereum_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.174679,0.07746,0.03879,0.88379
std,0.442409,0.071417,0.060184,0.086943
min,-0.8519,0.0,0.0,0.679
25%,0.0,0.0,0.0,0.81775
50%,0.2143,0.064,0.0,0.889
75%,0.528675,0.13125,0.05625,0.94625
max,0.8402,0.249,0.243,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: Ethereum

Q: Which coin had the highest compound score?

A: Bitcoin

Q. Which coin had the highest positive score?

A: Bitcoin

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [55]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [56]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
nltk.download('stopwords')
sw = set(stopwords.words('english'))
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Expand the default stopwords list if necessary
sw_addon = {'the'}

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\scott\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\scott\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\scott\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\scott\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [60]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    regex = re.compile("[^a-zA-Z ]")
    # Remove the punctuation from text
    re_clean = regex.sub('', text)
    sw_addon = {'the'}
    # Create a tokenized list of the words
    words = word_tokenize(re_clean)
    
    # Lemmatize words into root words
    lemmatize = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase & remove the stop words
    tokens = [word.lower() for word in words if word.lower() not in sw.union(sw_addon)]
    
    return tokens

In [61]:
# Create a new tokens column for Bitcoin
bitcoin_df['tokens'] = bitcoin_df['text'].apply(tokenizer)
bitcoin_df

Unnamed: 0,date,text,compound,positive,negative,neutral,tokens
0,2022-08-03,It was 8:45 in the morning of June 13 when Bil...,0.5574,0.119,0.000,0.881,"[morning, june, bill, stewart, ceo, mainebased..."
1,2022-08-02,"Tools to trace cryptocurrencies have, over jus...",0.0000,0.000,0.000,1.000,"[tools, trace, cryptocurrencies, last, several..."
2,2022-08-18,Cryptocurrencies are often criticized for bein...,-0.5584,0.068,0.170,0.763,"[cryptocurrencies, often, criticized, bad, pla..."
3,2022-08-02,Posted \r\nFrom Bitcoin highs to blockchain br...,-0.2960,0.000,0.086,0.914,"[posted, bitcoin, highs, blockchain, bridge, l..."
4,2022-08-27,Aug 27 (Reuters) - Bitcoin was off 1.63% at $1...,0.0000,0.000,0.000,1.000,"[aug, reuters, bitcoin, late, afternoon, europ..."
...,...,...,...,...,...,...,...
95,2022-08-11,August 2022\r\nA criteria for judging when a\r...,0.2023,0.057,0.000,0.943,"[august, criteria, judging, blockchain, applic..."
96,2022-08-08,7 p.m. Soccer night. I drive to the field in S...,0.0000,0.000,0.000,1.000,"[pm, soccer, night, drive, field, santa, monic..."
97,2022-08-19,"Good Friday morning! Jordan Parker Erb here, r...",0.5010,0.103,0.000,0.897,"[good, friday, morning, jordan, parker, erb, r..."
98,2022-08-23,Facebook stunned the world last October when i...,-0.1027,0.000,0.038,0.962,"[facebook, stunned, world, last, october, rebr..."


In [62]:
# Create a new tokens column for Ethereum
ethereum_df['tokens'] = ethereum_df['text'].apply(tokenizer)
ethereum_df

Unnamed: 0,date,text,compound,positive,negative,neutral,tokens
0,2022-08-18,Cryptocurrencies are often criticized for bein...,-0.5584,0.068,0.170,0.763,"[cryptocurrencies, often, criticized, bad, pla..."
1,2022-08-04,The non-fungible token\r\n (NFT) market has fa...,-0.0217,0.048,0.051,0.901,"[nonfungible, token, nft, market, fallen, clif..."
2,2022-08-02,"It's a day of the week ending in the letter ""y...",-0.2732,0.059,0.115,0.827,"[day, week, ending, letter, inevitably, means,..."
3,2022-08-11,Developers have picked a number of so-called t...,-0.6124,0.036,0.145,0.820,"[developers, picked, number, socalled, total, ..."
4,2022-08-08,"BANGKOK, Aug 8 (Reuters) - Crypto exchange Zip...",0.0000,0.000,0.000,1.000,"[bangkok, aug, reuters, crypto, exchange, zipm..."
...,...,...,...,...,...,...,...
95,2022-08-09,If you bought Ethereum(ETH -3.05%) back in Jul...,0.6486,0.187,0.058,0.754,"[bought, ethereumeth, back, july, first, launc..."
96,2022-08-09,"In the crypto market, it can be much more diff...",-0.4927,0.000,0.104,0.896,"[crypto, market, much, difficult, longterm, in..."
97,2022-08-09,The New Consumer— by Dan Frommer\r\nA publicat...,0.7316,0.186,0.000,0.814,"[new, consumer, dan, frommera, publication, pe..."
98,2022-08-02,Valuing cryptocurrencies is difficult because ...,0.3818,0.187,0.121,0.692,"[valuing, cryptocurrencies, difficult, fundame..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [63]:
from collections import Counter
from nltk import ngrams

In [64]:
# Generate the Bitcoin N-grams where N=2
bitcoin_text = ' '.join(bitcoin_df.text)
bitcoin_processed = tokenizer(bitcoin_text)
bitcoin_ngrams = Counter(ngrams(bitcoin_processed, n=2))
print(dict(bitcoin_ngrams.most_common(10)))

{('aug', 'reuters'): 27, ('chars', 'us'): 16, ('us', 'stocks'): 16, ('chars', 'aug'): 10, ('federal', 'reserve'): 6, ('worlds', 'biggest'): 5, ('chars', 'bitcoin'): 5, ('bitcoin', 'mining'): 4, ('chars', 'london'): 4, ('london', 'aug'): 4}


In [65]:
# Generate the Ethereum N-grams where N=2
ethereum_text = ' '.join(ethereum_df.text)
ethereum_processed = tokenizer(ethereum_text)
ethereum_ngrams = Counter(ngrams(ethereum_processed, n=2))
print(dict(ethereum_ngrams.most_common(10)))

{('aug', 'reuters'): 7, ('ethereum', 'blockchain'): 6, ('chars', 'us'): 5, ('chars', 'new'): 5, ('crypto', 'market'): 5, ('chars', 'crypto'): 4, ('tornado', 'cash'): 4, ('retail', 'investors'): 4, ('chars', 'nonfungible'): 3, ('september', 'ethereum'): 3}


In [66]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [67]:
# Use token_count to get the top 10 words for Bitcoin
top_10_bitcoin = token_count(bitcoin_ngrams, 10)
top_10_bitcoin

[(('aug', 'reuters'), 27),
 (('chars', 'us'), 16),
 (('us', 'stocks'), 16),
 (('chars', 'aug'), 10),
 (('federal', 'reserve'), 6),
 (('worlds', 'biggest'), 5),
 (('chars', 'bitcoin'), 5),
 (('bitcoin', 'mining'), 4),
 (('chars', 'london'), 4),
 (('london', 'aug'), 4)]

In [68]:
# Use token_count to get the top 10 words for Ethereum
top_10_ethereum = token_count(ethereum_ngrams, 10)
top_10_ethereum

[(('aug', 'reuters'), 7),
 (('ethereum', 'blockchain'), 6),
 (('chars', 'us'), 5),
 (('chars', 'new'), 5),
 (('crypto', 'market'), 5),
 (('chars', 'crypto'), 4),
 (('tornado', 'cash'), 4),
 (('retail', 'investors'), 4),
 (('chars', 'nonfungible'), 3),
 (('september', 'ethereum'), 3)]

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [73]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

ImportError: cannot import name 'is_directory' from 'PIL._util' (c:\Users\scott\anaconda3\envs\pyvizenv\envs\pyvizenv2\lib\site-packages\PIL\_util.py)

In [72]:
# Generate the Bitcoin word cloud
bitcoin_cloud = WordCloud().generate(bitcoin_text)
plt.imshow(bitcoin_cloud)

NameError: name 'WordCloud' is not defined

In [None]:
# Generate the Ethereum word cloud
ethereum_cloud = WordCloud().generate(ethereum_text)
plt.imshow(ethereum_cloud)

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [74]:
import spacy
from spacy import displacy

In [75]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [76]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [77]:
# Concatenate all of the Bitcoin text together
bitcoin_text

'It was 8:45 in the morning of June 13 when Bill Stewart, the CEO of Maine-based bitcoin mining business Dynamics Mining, received a call from one of his employees. He\'s like, Every machine inside of … [+3472 chars] Tools to trace cryptocurrencies have, over just the last several years, allowed law enforcement agencies to convict dark web black market administrators, recover millions in ransomware payments, seiz… [+3510 chars] Cryptocurrencies are often criticized for being bad for the planet. Every year, bitcoin mining consumes more energy than Belgium, according to the University of Cambridges Bitcoin Electricity Consump… [+3829 chars] Posted \r\nFrom Bitcoin highs to blockchain bridge lows, to why some of the worlds biggest technology companies are freezing jobs, we round up the weeks big stories in the world of virtual money. Krist… [+17 chars] Aug 27 (Reuters) - Bitcoin was off 1.63% at $19,920 by late afternoon in Europe on Saturday, down $330 from its previous close.\r\nBitcoin

In [78]:
# Run the NER processor on all of the text
bitcoin_doc = nlp(bitcoin_text)

# Add a title to the document
bitcoin_doc.user_data["title"] = "Bitcoin NER"

In [79]:
# Render the visualization
displacy.render(bitcoin_doc, style='ent')

In [80]:
# List all Entities
for ent in bitcoin_doc.ents:
    print(ent.text, ent.label_)

8:45 in the morning of June 13 TIME
Bill Stewart PERSON
Maine GPE
Dynamics Mining ORG
the last several years DATE
millions CARDINAL
Every year DATE
Belgium GPE
the University of Cambridges Bitcoin Electricity Consump ORG
the weeks DATE
Krist ORG
Reuters ORG
1.63% PERCENT
19,920 MONEY
late afternoon TIME
Europe LOC
Saturday DATE
330 MONEY
58 CARDINAL
24,000 MONEY
first ORDINAL
August DATE
US GPE
this week DATE
as much as 4% PERCENT
24,191 MONEY
Reuters ORG
a good month DATE
months DATE
more than 17% PERCENT
July DATE
October DATE
July 30 DATE
Reuters ORG
3.36% PERCENT
24,584.24 MONEY
GMT WORK_OF_ART
Saturday DATE
798.93 MONEY
39.7% PERCENT
the year DATE
Aug 19 DATE
Reuters ORG
Friday DATE
three-week DATE
Reuters ORG
the United States GPE
20,000 MONEY
Central African Republics ORG
week DATE
Reuters ORG
this month DATE
25,000 MONEY
first ORDINAL
one CARDINAL
Today DATE
MicroStrategy ORG
Michael Saylor PERSON
quarterly DATE
more than $900 million MONEY
21 million CARDINAL
19 million CARDIN

---

### Ethereum NER

In [81]:
# Concatenate all of the Ethereum text together
ethereum_text = ' '.join(ethereum_df.text)
ethereum_text

'Cryptocurrencies are often criticized for being bad for the planet. Every year, bitcoin mining consumes more energy than Belgium, according to the University of Cambridges Bitcoin Electricity Consump… [+3829 chars] The non-fungible token\r\n (NFT) market has fallen off a cliff\r\n, but that\'s not stopping Instagram from doubling down on digital collectibles. After a test launch in May\r\n, the app is expanding its NF… [+1097 chars] It\'s a day of the week ending in the letter "y," which inevitably means there\'s news of another\r\nmessy\r\nsaga\r\n in the cryptocurrency world. The Securities and Exchange Commission has charged 11 peopl… [+1855 chars] Developers have picked a number of so-called total terminal difficulty required of the final block mined in Ethereum before the network switches to new software. Figuring out the exact date range whe… [+1084 chars] BANGKOK, Aug 8 (Reuters) - Crypto exchange Zipmex will release Ethereum and Bitcoin tokens from this week, a spokesperson sa

In [83]:
# Run the NER processor on all of the text
ethereum_doc = nlp(ethereum_text)

# Add a title to the document
ethereum_doc.user_data["title"] = "Ethereum NER"

In [84]:
# Render the visualization
displacy.render(ethereum_doc, style='ent')

In [85]:
# List all Entities
for ent in ethereum_doc.ents:
    print(ent.text, ent.label_)

Every year DATE
Belgium GPE
the University of Cambridges Bitcoin Electricity Consump ORG
NFT ORG
Instagram ORG
May DATE
NF ORG
The Securities and Exchange Commission ORG
11 CARDINAL
Ethereum ORG
Reuters ORG
Ethereum and Bitcoin ORG
this week DATE
Monday DATE
60% PERCENT
15 September DATE
Vivaldi PERSON
Jon von Tetzchner PERSON
+3393 ORG
The US Treasury Department's ORG
Office of Foreign Asset Control ORG
Monday DATE
Tornado Cash PERSON
North Korean NORP
Ameri ORG
as much as 75% PERCENT
JPMorgan ORG
Monday DATE
36% PERCENT
102% PERCENT
mid-June DATE
second ORDINAL
more than 99% PERCENT
the next month DATE
ENS ORG
eth.link PERSON
CoinDesk PRODUCT
Europe LOC
one CARDINAL
millions of pounds QUANTITY
Mangnall PERSON
November 2021 DATE
Joe Hovde PERSON
New York GPE
July 30 DATE
Reuters ORG
3.36% PERCENT
24,584.24 MONEY
GMT WORK_OF_ART
Saturday DATE
798.93 MONEY
39.7% PERCENT
the year DATE
Reuters ORG
1.63% PERCENT
19,920 MONEY
late afternoon TIME
Europe LOC
Saturday DATE
330 MONEY
58 CARDINA

---