# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [33]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mattbuchanan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [34]:
# Find DotEnv
from dotenv import find_dotenv
print(find_dotenv())

/Users/mattbuchanan/Desktop/.env


In [35]:
# Read your api key environment variable
load_dotenv()

True

In [36]:
from newsapi import NewsApiClient

In [37]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=os.getenv("NEWS_API_KEY"))

In [38]:
# Fetch the Bitcoin news articles
btc_articles = newsapi.get_everything(q="bitcoin", language='en')

In [39]:
# Fetch the Ethereum news articles
eth_articles = newsapi.get_everything(q="ethereum", language='en')

In [40]:
# Total Number Btc
btc_articles["totalResults"]

6703

In [41]:
# Total Number Eth
eth_articles["totalResults"]

2604

In [42]:
# Create the Bitcoin sentiment scores DataFrame
btc_sentiments = []

for article in btc_articles["articles"]:
    try: 
        text = article["content"]
        date = article["publishedAt"]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        btc_sentiments.append({'Compound':compound, 'Negative':neg, 'Neutral':neu, 'Positive':pos, 'text':text})
        
    except AttributeError: 
        pass
btc_df= pd.DataFrame(btc_sentiments)

In [43]:
# Show Data Frame
btc_df.head(20)

Unnamed: 0,Compound,Negative,Neutral,Positive,text
0,0.5574,0.036,0.838,0.127,You won't have to stick to Bitcoin if you're d...
1,0.0,0.0,1.0,0.0,Four months after Twitter first introduced in-...
2,0.0,0.0,1.0,0.0,Bitcoin and similar blockchain-based cryptos e...
3,0.34,0.0,0.924,0.076,"<ul><li>Bitcoin, in terms of market value, ros..."
4,0.4939,0.0,0.781,0.219,How high are the chances of Bitcoin sustaining...
5,-0.2411,0.116,0.884,0.0,JPMorgan CEO Jamie Dimon is still not a Bitcoi...
6,0.1901,0.043,0.866,0.091,Elon Musk has performed a sudden U-turn on bit...
7,0.5461,0.0,0.879,0.121,"Specifically, why did someone make a massive p..."
8,0.0,0.0,1.0,0.0,Its the countrys latest crackdown on digital c...
9,0.34,0.0,0.901,0.099,"Last week, the Wall Street Journal ran a piece..."


In [44]:
# Create the Ethereum sentiment scores DataFrame
eth_sentiments = []

for article in eth_articles["articles"]:
    try: 
        text = article["content"]
        date = article["publishedAt"]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        btc_sentiments.append({'Compound':compound, 'Negative':neg, 'Neutral':neu, 'Positive':pos, 'text':text})
        
    except AttributeError: 
        pass
eth_df= pd.DataFrame(btc_sentiments)

In [45]:
# Show Data Frame
eth_df.head(20)

Unnamed: 0,Compound,Negative,Neutral,Positive,text
0,0.5574,0.036,0.838,0.127,You won't have to stick to Bitcoin if you're d...
1,0.0,0.0,1.0,0.0,Four months after Twitter first introduced in-...
2,0.0,0.0,1.0,0.0,Bitcoin and similar blockchain-based cryptos e...
3,0.34,0.0,0.924,0.076,"<ul><li>Bitcoin, in terms of market value, ros..."
4,0.4939,0.0,0.781,0.219,How high are the chances of Bitcoin sustaining...
5,-0.2411,0.116,0.884,0.0,JPMorgan CEO Jamie Dimon is still not a Bitcoi...
6,0.1901,0.043,0.866,0.091,Elon Musk has performed a sudden U-turn on bit...
7,0.5461,0.0,0.879,0.121,"Specifically, why did someone make a massive p..."
8,0.0,0.0,1.0,0.0,Its the countrys latest crackdown on digital c...
9,0.34,0.0,0.901,0.099,"Last week, the Wall Street Journal ran a piece..."


In [46]:
# Describe the Bitcoin Sentiment
btc_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,0.225325,0.01905,0.90565,0.0754
std,0.311257,0.034413,0.065915,0.067524
min,-0.4404,0.0,0.781,0.0
25%,0.0,0.0,0.8645,0.0
50%,0.2777,0.0,0.8965,0.0755
75%,0.5021,0.0345,0.955,0.1195
max,0.7269,0.116,1.0,0.219


In [47]:
# Describe the Ethereum Sentiment
eth_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,40.0,40.0,40.0,40.0
mean,0.220928,0.017225,0.9111,0.071775
std,0.310052,0.034704,0.081399,0.071659
min,-0.4404,0.0,0.694,0.0
25%,0.0,0.0,0.85675,0.0
50%,0.1962,0.0,0.9115,0.0655
75%,0.467575,0.0085,1.0,0.1195
max,0.8765,0.126,1.0,0.245


### Questions:

Q: Which coin had the highest mean positive score?



Q: Which coin had the highest compound score?



Q. Which coin had the highest positive score?


A. Eth has highest score in each category

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [48]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [49]:
# Instantiate the lemmatizer
ltzr = WordNetLemmatizer()
# Create a list of stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mattbuchanan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [50]:
# Expand default stopwords
stop = stopwords.words("english")
stop.append("it'")
stop.append("'s")
stop.append("t")
stop = set(stop)

In [51]:
# Complete the tokenizer function
def tokenizer(text):
    # Create a list of the words
    words = word_tokenize(text)
    # Convert the words to lowercase
    words = list(filter(lambda w: w.lower(), words))
    # Remove the punctuation
    words = list(filter(lambda t: t not in punctuation, words))
    # Remove the stop words
    words = list(filter(lambda t: t.lower() not in stop, words))
    # Lemmatize Words into root words
    tokens = [ltzr.lemmatize(word) for word in words]
    return tokens

In [52]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mattbuchanan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [53]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mattbuchanan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [54]:
# Create a new tokens column for Bitcoin
btc_df["tokens"] = btc_df.text.apply(tokenizer)
btc_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,text,tokens
0,0.5574,0.036,0.838,0.127,You won't have to stick to Bitcoin if you're d...,"[wo, n't, stick, Bitcoin, 're, determined, pay..."
1,0.0,0.0,1.0,0.0,Four months after Twitter first introduced in-...,"[Four, month, Twitter, first, introduced, in-a..."
2,0.0,0.0,1.0,0.0,Bitcoin and similar blockchain-based cryptos e...,"[Bitcoin, similar, blockchain-based, cryptos, ..."
3,0.34,0.0,0.924,0.076,"<ul><li>Bitcoin, in terms of market value, ros...","[ul, li, Bitcoin, term, market, value, rose, 4..."
4,0.4939,0.0,0.781,0.219,How high are the chances of Bitcoin sustaining...,"[high, chance, Bitcoin, sustaining, gain, push..."


In [55]:
# Create a new tokens column for Ethereum
eth_df["tokens"] = eth_df.text.apply(tokenizer)
eth_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,text,tokens
0,0.5574,0.036,0.838,0.127,You won't have to stick to Bitcoin if you're d...,"[wo, n't, stick, Bitcoin, 're, determined, pay..."
1,0.0,0.0,1.0,0.0,Four months after Twitter first introduced in-...,"[Four, month, Twitter, first, introduced, in-a..."
2,0.0,0.0,1.0,0.0,Bitcoin and similar blockchain-based cryptos e...,"[Bitcoin, similar, blockchain-based, cryptos, ..."
3,0.34,0.0,0.924,0.076,"<ul><li>Bitcoin, in terms of market value, ros...","[ul, li, Bitcoin, term, market, value, rose, 4..."
4,0.4939,0.0,0.781,0.219,How high are the chances of Bitcoin sustaining...,"[high, chance, Bitcoin, sustaining, gain, push..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [56]:
from collections import Counter
from nltk import ngrams

In [57]:
# Generate the Bitcoin N-grams where N=2
N = 2
btc_ngrams = ngrams(tokenizer(btc_df.text.str.cat()), N)
Counter(btc_ngrams).most_common(10)

[(('U.Today', 'http'), 3),
 (('li', 'Bitcoin'), 2),
 (('/li', 'li'), 2),
 (('digital', 'currency'), 2),
 (('central', 'bank'), 2),
 (('Securities', 'Exchange'), 2),
 (('Exchange', 'Commission'), 2),
 (('wo', "n't"), 1),
 (("n't", 'stick'), 1),
 (('stick', 'Bitcoin'), 1)]

In [58]:
# Generate the Ethereum N-grams where N=2
N = 2
eth_ngrams = ngrams(tokenizer(eth_df.text.str.cat()), N)
Counter(eth_ngrams).most_common(10)

[(('Vitalik', 'Buterin'), 5),
 (('digital', 'currency'), 4),
 (('U.Today', 'http'), 3),
 (('central', 'bank'), 3),
 (('illustration', 'taken'), 3),
 (('Reuters', 'Bitcoin'), 3),
 (('wo', "n't"), 2),
 (("n't", 'stick'), 2),
 (('stick', 'Bitcoin'), 2),
 (('Bitcoin', "'re"), 2)]

In [59]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [60]:
# Use token_count to get the top 10 words for Bitcoin
btc_top10 = token_count(btc_ngrams)
btc_top10

[]

In [61]:
# Use token_count to get the top 10 words for Ethereum
eth_top10 = token_count(eth_ngrams)
eth_top10

[]

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [62]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [63]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [64]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [65]:
import spacy
from spacy import displacy

In [66]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [73]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

---
### Bitcoin NER

In [70]:
# Concatenate all of the Bitcoin text together
btc_concat =[]
for text in btc_df.loc[:,"text"]:
    btc_concat.append(text)

In [72]:
# Run the NER processor on all of the text
btc_ner = nlp(btc_concat)

# Add a title to the document
btc_ner.user_data["title"] = "Bitcoin NER"

# Add a title to the document
# YOUR CODE HERE!

NameError: name 'nlp' is not defined

In [29]:
# Render the visualization
# YOUR CODE HERE!

In [30]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [71]:
eth_concat =[]
for text in eth_df.loc[:,"text"]:
    eth_concat.append(text)

In [74]:
# Run the NER processor on all of the text
eth_ner = nlp(eth_concat)

# Add a title to the document
eth_ner.user_data["title"] = "Ethereum NER"

# Add a title to the document
# YOUR CODE HERE!

NameError: name 'nlp' is not defined

In [33]:
# Render the visualization
# YOUR CODE HERE!

In [34]:
# List all Entities
# YOUR CODE HERE!

---