# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [131]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\61421\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [132]:
# Read your api key environment variable
load_dotenv()
from newsapi import NewsApiClient
api_key = os.getenv("news_api")
type(api_key)

str

In [133]:
# Create a newsapi client
newsapi = NewsApiClient(api_key)

In [134]:
# Fetch the Bitcoin news articles
btc_news_en = newsapi.get_everything(
    q="bitcoin",
    language="en"
)

btc_news_en["totalResults"]

7976

In [135]:
# Fetch the Ethereum news articles
eth_news_en = newsapi.get_everything(
    q="ethereum",
    language="en"
)

eth_news_en["totalResults"]

2748

In [136]:
# Create Bitcoin sentiment scores DataFrame
btc_sentiments = []

for article in btc_news_en["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        btc_sentiments.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
            
        })
        
    except AttributeError:
        pass
    
# Create DataFrame
btc_sentiments_df = pd.DataFrame(btc_sentiments)

# Reorder DataFrame columns
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
btc_sentiments_df = btc_sentiments_df[cols]

btc_sentiments_df.head()


Unnamed: 0,date,text,compound,positive,negative,neutral
0,2021-11-05,A similar hoax earlier this year tied Walmart ...,-0.2732,0.0,0.063,0.937
1,2021-10-10,"Specifically, why did someone make a massive p...",0.5461,0.121,0.0,0.879
2,2021-10-28,Theres a big new presence slurping up power fr...,0.3612,0.096,0.0,0.904
3,2021-10-08,"Last week, the Wall Street Journal ran a piece...",0.34,0.099,0.0,0.901
4,2021-10-26,"For all the talk of democratizing finance, the...",0.0,0.0,0.0,1.0


In [137]:
# Create the Ethereum sentiment scores DataFrame
eth_sentiments = []

for article in eth_news_en["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        eth_sentiments.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
            
        })
        
    except AttributeError:
        pass
    
# Create DataFrame
eth_sentiments_df = pd.DataFrame(eth_sentiments)

# Reorder DataFrame columns
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
eth_sentiments_df = eth_sentiments_df[cols]

eth_sentiments_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2021-10-16,A new cross-chain bridge is currently connecte...,0.0,0.0,0.0,1.0
1,2021-10-14,Mark Cuban has some advice for people who are ...,0.0,0.0,0.0,1.0
2,2021-11-05,Ethereum and bitcoin are the two biggest crypt...,0.4588,0.094,0.0,0.906
3,2021-11-01,Elon Musk\r\npicture alliance / Getty Images\r...,0.5267,0.093,0.0,0.907
4,2021-11-01,Cryptocurrency and business continuity line im...,0.4588,0.097,0.0,0.903


In [138]:
# Describe the Bitcoin Sentiment
btc_sentiments_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,0.22853,0.05925,0.00315,0.9376
std,0.284345,0.065847,0.014087,0.064354
min,-0.2732,0.0,0.0,0.801
25%,0.0,0.0,0.0,0.89375
50%,0.148,0.032,0.0,0.9365
75%,0.485175,0.10625,0.0,1.0
max,0.7558,0.199,0.063,1.0


In [139]:
# Describe the Ethereum Sentiment
eth_sentiments_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,0.20508,0.04855,0.00675,0.9447
std,0.298922,0.066368,0.021718,0.075454
min,-0.0258,0.0,0.0,0.792
25%,0.0,0.0,0.0,0.9025
50%,0.0,0.0,0.0,1.0
75%,0.475775,0.09475,0.0,1.0
max,0.8225,0.208,0.087,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: BTC has higher mean positive score

Q: Which coin had the highest compound score?

A: BTC has higher mean compound score

Q. Which coin had the highest positive score?

A: ETH has the highest maximum positive score

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [140]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [141]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
sw_list = set(stopwords.words('english'))

# Expand the default stopwords list if necessary
sw_addons = {}

In [142]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z]")
    re_clean = regex.sub(' ', text)
   
    # Create a tokenized list of the words
    words = word_tokenize(re_clean)
    
    # Lemmatize words into root words
    lem = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase and remove stop words
    sw = sw_list
    sw_add = sw_addons
    tokens = [word.lower() for word in words if word.lower() not in sw.union(sw_add)]
    
    return ' '.join(tokens)

In [143]:
# Create a new tokens column for Bitcoin
btc_sentiments_df['token_text'] = btc_sentiments_df.apply(lambda x: tokenizer(x['text']), axis = 1)

In [144]:
btc_sentiments_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral,token_text
0,2021-11-05,A similar hoax earlier this year tied Walmart ...,-0.2732,0.0,0.063,0.937,similar hoax earlier year tied walmart litecoi...
1,2021-10-10,"Specifically, why did someone make a massive p...",0.5461,0.121,0.0,0.879,specifically someone make massive purchase bil...
2,2021-10-28,Theres a big new presence slurping up power fr...,0.3612,0.096,0.0,0.904,theres big new presence slurping power u grid ...
3,2021-10-08,"Last week, the Wall Street Journal ran a piece...",0.34,0.099,0.0,0.901,last week wall street journal ran piece three ...
4,2021-10-26,"For all the talk of democratizing finance, the...",0.0,0.0,0.0,1.0,talk democratizing finance vast majority bitco...


In [145]:
# Create a new tokens column for Ethereum
eth_sentiments_df['token_text'] = eth_sentiments_df.apply(lambda x: tokenizer(x['text']), axis = 1)
eth_sentiments_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral,token_text
0,2021-10-16,A new cross-chain bridge is currently connecte...,0.0,0.0,0.0,1.0,new cross chain bridge currently connected eth...
1,2021-10-14,Mark Cuban has some advice for people who are ...,0.0,0.0,0.0,1.0,mark cuban advice people new investing cryptoc...
2,2021-11-05,Ethereum and bitcoin are the two biggest crypt...,0.4588,0.094,0.0,0.906,ethereum bitcoin two biggest cryptocurrencies ...
3,2021-11-01,Elon Musk\r\npicture alliance / Getty Images\r...,0.5267,0.093,0.0,0.907,elon musk picture alliance getty images crypto...
4,2021-11-01,Cryptocurrency and business continuity line im...,0.4588,0.097,0.0,0.903,cryptocurrency business continuity line image ...


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [146]:
from collections import Counter
from nltk import ngrams

In [147]:
# Generate the Bitcoin N-grams where N=2
# Join all tokenized text into one big string

btc_corpus = btc_sentiments_df['token_text'].str.cat(sep=', ')

btc_bigrams = ngrams(btc_corpus, n=2)

btc_corpus

'similar hoax earlier year tied walmart litecoin buy something verge link vox media may earn commission see ethics statement photo illustration thiago prudencio chars, specifically someone make massive purchase billion worth bitcoin wednesday couple minutes many see huge buy signal bullishness may chars, theres big new presence slurping power u grid growing bitcoin miners new research shows u overtaken china top global destination bitcoin mining chars, last week wall street journal ran piece three recent nuclear bitcoin deals may signal growing trend industry journal piece reflects small growing sense excitemen chars, talk democratizing finance vast majority bitcoin continues owned relative handful investors flagged bloomberg newly released data national bureau chars, representation cryptocurrency bitcoin placed pc motherboard illustration taken june reuters dado ruvic illustrationhong kong oct reuters bitcoin fell slightly chars, representation virtual cryptocurrency bitcoin seen pict

In [148]:
# Generate the Ethereum N-grams where N=2
eth_corpus = eth_sentiments_df['token_text'].str.cat(sep=', ')

eth_bigrams = ngrams(eth_corpus, n=2)

In [149]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [150]:
# Use token_count to get the top 10 words for Bitcoin
btc_top_10 = dict(Counter(btc_bigrams).most_common(10))
btc_top_10_df = pd.DataFrame(list(btc_top_10.items()), columns=['bigram', 'count'])
btc_top_10

{('i', 'n'): 65,
 ('n', ' '): 64,
 ('s', ' '): 62,
 ('r', 'e'): 61,
 ('e', ' '): 54,
 (' ', 'c'): 53,
 ('c', 'o'): 51,
 (' ', 'b'): 47,
 ('e', 'r'): 45,
 ('i', 't'): 43}

In [151]:
# Use token_count to get the top 10 words for Ethereum
# YOUR CODE HERE!

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [152]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [153]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [154]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [155]:
import spacy
from spacy import displacy

In [156]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [157]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [159]:
# Concatenate all of the Bitcoin text together
concat_btc = btc_sentiments_df['text'].str.cat(sep=', ')
concat_btc

'A similar hoax earlier this year tied Walmart to Litecoin\r\nIf you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.\r\nPhoto Illustration by Thiago Prudencio/S… [+1900 chars], Specifically, why did someone make a massive purchase of $1.6 billion worth of bitcoin on Wednesday in a couple of minutes?\r\nWhile many see this huge buy as a signal of bullishness, there may be more… [+8443 chars], Theres a big new presence slurping up power from the U.S. grid, and its growing: bitcoin miners. New research shows that the U.S. has overtaken China as the top global destination for bitcoin mining … [+3088 chars], Last week, the Wall Street Journal ran a piece on three recent nuclear-bitcoin deals that may signal a growing trend in the industry. The Journal piece reflects a small but growing sense of excitemen… [+9512 chars], For all the talk of democratizing finance, the vast majority of Bitcoin continues to be owned by a relative handful of investors.

In [160]:
# Run the NER processor on all of the text
btc_ner = nlp(concat_btc)

# Add a title to the document


In [161]:
# Render the visualization
displacy.render(btc_ner, style='ent')

In [164]:
# List all Entities
from pprint import pprint
pprint([(X.text) for X in btc_ner])

['A',
 'similar',
 'hoax',
 'earlier',
 'this',
 'year',
 'tied',
 'Walmart',
 'to',
 'Litecoin',
 '\r\n',
 'If',
 'you',
 'buy',
 'something',
 'from',
 'a',
 'Verge',
 'link',
 ',',
 'Vox',
 'Media',
 'may',
 'earn',
 'a',
 'commission',
 '.',
 'See',
 'our',
 'ethics',
 'statement',
 '.',
 '\r\n',
 'Photo',
 'Illustration',
 'by',
 'Thiago',
 'Prudencio',
 '/',
 'S',
 '…',
 '[',
 '+1900',
 'chars',
 ']',
 ',',
 'Specifically',
 ',',
 'why',
 'did',
 'someone',
 'make',
 'a',
 'massive',
 'purchase',
 'of',
 '$',
 '1.6',
 'billion',
 'worth',
 'of',
 'bitcoin',
 'on',
 'Wednesday',
 'in',
 'a',
 'couple',
 'of',
 'minutes',
 '?',
 '\r\n',
 'While',
 'many',
 'see',
 'this',
 'huge',
 'buy',
 'as',
 'a',
 'signal',
 'of',
 'bullishness',
 ',',
 'there',
 'may',
 'be',
 'more',
 '…',
 '[',
 '+8443',
 'chars',
 ']',
 ',',
 'There',
 's',
 'a',
 'big',
 'new',
 'presence',
 'slurping',
 'up',
 'power',
 'from',
 'the',
 'U.S.',
 'grid',
 ',',
 'and',
 'its',
 'growing',
 ':',
 'bitcoin',

---

### Ethereum NER

In [165]:
# Concatenate all of the Ethereum text together
concat_eth = eth_sentiments_df['text'].str.cat(sep=', ')
concat_eth

'A new cross-chain bridge is currently connected to Ethereum through a cross-chain bridge, with Cardano and other public chains to come in the future.\r\nNervos\xa0today announced that the Force Bridge is … [+3114 chars], Mark Cuban has some advice for people who are new to investing in cryptocurrency.\r\nAs an investment, I think ethereum has the most upside, he told CNBC Make It Wednesday. Bitcoin, he added, is better… [+1139 chars], Ethereum and bitcoin are the two biggest cryptocurrencies.\r\nJordan Mansfield /Getty Images\r\nCrypto investors should be holding ethereum rather than bitcoin as interest rates rise, JPMorgan said, beca… [+2957 chars], Elon Musk\r\npicture alliance / Getty Images\r\nA cryptocurrency named after Elon Musk has shot to the moon with a 3,780% gain in October. \r\nDogelon Mars traded at $0.00000229 on November 1, up from $0.0… [+1533 chars], Cryptocurrency and business continuity line image for business concept.\r\nGetty Images\r\nLittle-known altcoin mana s

In [166]:
# Run the NER processor on all of the text
eth_ner = nlp(concat_eth)

# Add a title to the document


In [167]:
# Render the visualization
displacy.render(btc_ner, style='ent')

In [168]:
# List all Entities
from pprint import pprint
pprint([(X.text) for X in eth_ner])

['A',
 'new',
 'cross',
 '-',
 'chain',
 'bridge',
 'is',
 'currently',
 'connected',
 'to',
 'Ethereum',
 'through',
 'a',
 'cross',
 '-',
 'chain',
 'bridge',
 ',',
 'with',
 'Cardano',
 'and',
 'other',
 'public',
 'chains',
 'to',
 'come',
 'in',
 'the',
 'future',
 '.',
 '\r\n',
 'Nervos',
 '\xa0',
 'today',
 'announced',
 'that',
 'the',
 'Force',
 'Bridge',
 'is',
 '…',
 '[',
 '+3114',
 'chars',
 ']',
 ',',
 'Mark',
 'Cuban',
 'has',
 'some',
 'advice',
 'for',
 'people',
 'who',
 'are',
 'new',
 'to',
 'investing',
 'in',
 'cryptocurrency',
 '.',
 '\r\n',
 'As',
 'an',
 'investment',
 ',',
 'I',
 'think',
 'ethereum',
 'has',
 'the',
 'most',
 'upside',
 ',',
 'he',
 'told',
 'CNBC',
 'Make',
 'It',
 'Wednesday',
 '.',
 'Bitcoin',
 ',',
 'he',
 'added',
 ',',
 'is',
 'better',
 '…',
 '[',
 '+1139',
 'chars',
 ']',
 ',',
 'Ethereum',
 'and',
 'bitcoin',
 'are',
 'the',
 'two',
 'biggest',
 'cryptocurrencies',
 '.',
 '\r\n',
 'Jordan',
 'Mansfield',
 '/Getty',
 'Images',
 '\r\n',

---