# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\matth\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [3]:
# Read your api key environment variable
load_dotenv()
api_key = os.getenv("NEWS_API_KEY")
print(type(api_key))

<class 'str'>


In [4]:
# Create a newsapi client
from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=api_key)

In [5]:
# Fetch the Bitcoin news articles
bitcoin_news = newsapi.get_everything(q="bitcoin", language="en")
bitcoin_articles = bitcoin_news["articles"]
bitcoin_articles[:3]

[{'source': {'id': 'the-verge', 'name': 'The Verge'},
  'author': 'Richard Lawler',
  'title': 'A fake press release claiming Kroger accepts crypto reached the retailer’s own webpage',
  'description': 'A crypto hoax claimed Kroger is accepting Bitcoin Cash. The fake press release was similar to one targeting Walmart earlier this year. The retailer quickly confirmed it’s fake, but not before the cryptocurrency’s price spiked by $30.',
  'url': 'https://www.theverge.com/2021/11/5/22765098/kroger-bitcoin-cash-cryptocurrency-hoax-pump-dump',
  'urlToImage': 'https://cdn.vox-cdn.com/thumbor/CKp0YjnwF88--mWg1kfPmspvfzY=/0x358:5000x2976/fit-in/1200x630/cdn.vox-cdn.com/uploads/chorus_asset/file/22988084/1234440443.jpg',
  'publishedAt': '2021-11-05T13:32:14Z',
  'content': 'A similar hoax earlier this year tied Walmart to Litecoin\r\nIf you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.\r\nPhoto Illustration by Thiago Prudencio/S… [+1900 chars]'},


In [6]:
# Fetch the Ethereum news articles
ethereum_news = newsapi.get_everything(q="ethereum", language="en")
ethereum_articles = ethereum_news["articles"]
ethereum_articles[:3]

[{'source': {'id': None, 'name': 'Blogspot.com'},
  'author': 'noreply@blogger.com (Unknown)',
  'title': 'Nervos launches cross-chain bridge to connect Ethereum and Cardano',
  'description': 'A new cross-chain bridge is currently connected to Ethereum through a cross-chain bridge, with Cardano and other public chains to come in the future.Nervos\xa0today announced that the Force Bridge is now live on the mainnet. The Nervos Network is a collection of…',
  'url': 'https://techncruncher.blogspot.com/2021/10/nervos-launches-cross-chain-bridge-to.html',
  'urlToImage': 'https://blogger.googleusercontent.com/img/a/AVvXsEgPPOybYbMwmsXrgektLx2gAB_TxrtYlXuFMKC9_ufbyBE23UZ7meSKtNO9FgKdDh0FZf-ugBepgc9Iooy6XQ5s4NkDthhSo2pPF-X2A3Aa2mXtZ5KSkUA4QwB7tEzJ8y79T4iN0A7XC-Ac_RdFuEhCDUuirVAvxQH4b_LUtvyto6aM_sFaDt5v39HYnQ=w1200-h630-p-k-no-nu',
  'publishedAt': '2021-10-16T18:50:00Z',
  'content': 'A new cross-chain bridge is currently connected to Ethereum through a cross-chain bridge, with Cardano and o

In [7]:
# Create the Bitcoin sentiment scores DataFrame
btc_dict = {}
for counter, article in enumerate(bitcoin_articles):
    btc_dict[counter]={}

    bitcoin_analysis = analyzer.polarity_scores(article['content'])
    btc_dict[counter]['Compound'] = bitcoin_analysis["compound"]
    btc_dict[counter]['Negative'] = bitcoin_analysis["neg"]
    btc_dict[counter]['Neutral'] = bitcoin_analysis["neu"]
    btc_dict[counter]['Positive'] = bitcoin_analysis["pos"]
    btc_dict[counter]['text']= article["content"]

btc_analysis_df = pd.DataFrame.from_dict(btc_dict,orient='index')
btc_analysis_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,text
0,-0.2732,0.063,0.937,0.0,A similar hoax earlier this year tied Walmart ...
1,0.3612,0.0,0.904,0.096,Theres a big new presence slurping up power fr...
2,0.0,0.0,1.0,0.0,"For all the talk of democratizing finance, the..."
3,0.5719,0.0,0.847,0.153,"In keeping with a previous announcement, AMC t..."
4,0.0,0.0,1.0,0.0,Representation of cryptocurrency Bitcoin is pl...


In [8]:
# Create the Ethereum sentiment scores DataFrame
eth_dict = {}
for counter, article in enumerate(ethereum_articles):
    eth_dict[counter]={}

    ethereum_analysis = analyzer.polarity_scores(article['content'])
    eth_dict[counter]['Compound'] = ethereum_analysis["compound"]
    eth_dict[counter]['Negative'] = ethereum_analysis["neg"]
    eth_dict[counter]['Neutral'] = ethereum_analysis["neu"]
    eth_dict[counter]['Positive'] = ethereum_analysis["pos"]
    eth_dict[counter]['text']= article["content"]

eth_analysis_df = pd.DataFrame.from_dict(eth_dict,orient='index')
eth_analysis_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,text
0,0.0,0.0,1.0,0.0,A new cross-chain bridge is currently connecte...
1,0.5719,0.0,0.847,0.153,"In keeping with a previous announcement, AMC t..."
2,0.4588,0.0,0.906,0.094,Ethereum and bitcoin are the two biggest crypt...
3,0.5267,0.0,0.907,0.093,Elon Musk\r\npicture alliance / Getty Images\r...
4,0.4588,0.0,0.903,0.097,Cryptocurrency and business continuity line im...


In [9]:
# Describe the Bitcoin Sentiment
btc_analysis_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,0.19552,0.00315,0.9466,0.05025
std,0.289429,0.014087,0.066127,0.067138
min,-0.2732,0.0,0.801,0.0
25%,0.0,0.0,0.8965,0.0
50%,0.0,0.0,1.0,0.0
75%,0.4767,0.0,1.0,0.1035
max,0.7558,0.063,1.0,0.199


In [10]:
# Describe the Ethereum Sentiment
eth_analysis_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,0.280475,0.0,0.93685,0.06315
std,0.308037,0.0,0.074661,0.074661
min,0.0,0.0,0.779,0.0
25%,0.0,0.0,0.9025,0.0
50%,0.1806,0.0,0.964,0.036
75%,0.5306,0.0,1.0,0.0975
max,0.8225,0.0,1.0,0.221


### Questions:

Q: Which coin had the highest mean positive score?

A: Ethereum

Q: Which coin had the highest compound score?

A: Ethereum

Q. Which coin had the highest positive score?

A: Ethereum

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [11]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [12]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
sw = set(stopwords.words('english'))

# Expand the default stopwords list if necessary
sw_expanded = {''}

In [13]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    sw = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    regex = re.compile("[^a-zA-Z ]")
    # Remove punctuation
    text_clean = [regex.sub('', text)]
    # Tokenize words
    word_tokenized = [word_tokenize(i) for i in text_clean]
    # Convert the words to lowercase and remove stop words
    lower_tokenized = [word.lower() for words in word_tokenized for word in words if word.lower() not in sw.union(sw_expanded)]
    # Lemmatize Words into root words
    tokens = [lemmatizer.lemmatize(word) for word in lower_tokenized]
    return tokens

In [14]:
# Create a new tokens column for Bitcoin
bitcoin_tokens = [tokenizer(article) for article in btc_analysis_df['text']]
btc_analysis_df['tokens'] = bitcoin_tokens
btc_analysis_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,text,tokens
0,-0.2732,0.063,0.937,0.0,A similar hoax earlier this year tied Walmart ...,"[similar, hoax, earlier, year, tied, walmart, ..."
1,0.3612,0.0,0.904,0.096,Theres a big new presence slurping up power fr...,"[there, big, new, presence, slurping, power, u..."
2,0.0,0.0,1.0,0.0,"For all the talk of democratizing finance, the...","[talk, democratizing, finance, vast, majority,..."
3,0.5719,0.0,0.847,0.153,"In keeping with a previous announcement, AMC t...","[keeping, previous, announcement, amc, theater..."
4,0.0,0.0,1.0,0.0,Representation of cryptocurrency Bitcoin is pl...,"[representation, cryptocurrency, bitcoin, plac..."


In [15]:
# Create a new tokens column for Ethereum
ethereum_tokens = [tokenizer(article) for article in eth_analysis_df['text']]
eth_analysis_df["tokens"] = ethereum_tokens
eth_analysis_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,text,tokens
0,0.0,0.0,1.0,0.0,A new cross-chain bridge is currently connecte...,"[new, crosschain, bridge, currently, connected..."
1,0.5719,0.0,0.847,0.153,"In keeping with a previous announcement, AMC t...","[keeping, previous, announcement, amc, theater..."
2,0.4588,0.0,0.906,0.094,Ethereum and bitcoin are the two biggest crypt...,"[ethereum, bitcoin, two, biggest, cryptocurren..."
3,0.5267,0.0,0.907,0.093,Elon Musk\r\npicture alliance / Getty Images\r...,"[elon, muskpicture, alliance, getty, imagesa, ..."
4,0.4588,0.0,0.903,0.097,Cryptocurrency and business continuity line im...,"[cryptocurrency, business, continuity, line, i..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [15]:
from collections import Counter
from nltk import ngrams

In [16]:
# Generate the Bitcoin N-grams where N=2
# YOUR CODE HERE!

In [17]:
# Generate the Ethereum N-grams where N=2
# YOUR CODE HERE!

In [18]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [19]:
# Use token_count to get the top 10 words for Bitcoin
# YOUR CODE HERE!

In [20]:
# Use token_count to get the top 10 words for Ethereum
# YOUR CODE HERE!

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [21]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [22]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [23]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [24]:
import spacy
from spacy import displacy

In [25]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [26]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together
# YOUR CODE HERE!

In [28]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [29]:
# Render the visualization
# YOUR CODE HERE!

In [30]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [32]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [33]:
# Render the visualization
# YOUR CODE HERE!

In [34]:
# List all Entities
# YOUR CODE HERE!

---