# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/emilianomendez/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
load_dotenv()
api_key = os.getenv("NEWS_API_KEY")

In [3]:
# Create a newsapi client
from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=api_key)

In [4]:
# Fetch the Bitcoin news articles
bitcoin_news = newsapi.get_everything(
    q="bitcoin",
    language="en",
    sort_by="relevancy"
)
print(f"Total articles about bitcoin: {bitcoin_news['totalResults']}")


Total articles about bitcoin: 9658


In [5]:
bitcoin_news

{'status': 'ok',
 'totalResults': 9658,
 'articles': [{'source': {'id': 'wired', 'name': 'Wired'},
   'author': 'Arielle Pardes',
   'title': 'Miami’s Bitcoin Conference Left a Trail of Harassment',
   'description': 'For some women, inappropriate conduct from other conference-goers continued to haunt them online.',
   'url': 'https://www.wired.com/story/bitcoin-2022-conference-harassment/',
   'urlToImage': 'https://media.wired.com/photos/627a89e3e37e715cb7d760d2/191:100/w_1280,c_limit/Bitcoin_Miami_Biz_GettyImages-1239817123.jpg',
   'publishedAt': '2022-05-10T16:59:46Z',
   'content': 'Now, even though there are a number of women-focused crypto spaces, Odeniran says women are still underrepresented. Ive been in spaces where Im the only Black person, or the only woman, or the only B… [+3828 chars]'},
  {'source': {'id': 'the-verge', 'name': 'The Verge'},
   'author': 'Justine Calma',
   'title': 'Why fossil fuel companies see green in Bitcoin mining projects',
   'description': 'Exxo

In [6]:
# Fetch the Ethereum news articles
ethereum_news = newsapi.get_everything(
    q="ethereum",
    language="en",
    sort_by="relevancy"
)
print(f"Total articles about ethereum: {ethereum_news['totalResults']}")

Total articles about ethereum: 4491


In [7]:
ethereum_news

{'status': 'ok',
 'totalResults': 4491,
 'articles': [{'source': {'id': 'engadget', 'name': 'Engadget'},
   'author': 'Jon Fingas',
   'title': "Here's what NFTs look like on Instagram",
   'description': "Meta has revealed more of how NFTs will work on Instagram. In the US-based test, you can show what you've bought or created for free by connecting your Instagram account to a compatible digital wallet and posting for the world to see. If you like, the social …",
   'url': 'https://www.engadget.com/instagram-nft-details-131020868.html',
   'urlToImage': 'https://s.yimg.com/os/creatr-uploaded-images/2022-05/2546c160-d05e-11ec-b75e-e45eaa8c5b2b',
   'publishedAt': '2022-05-10T13:10:20Z',
   'content': "Meta has revealed more of how NFTs will work on Instagram. In the US-based test, you can show what you've bought or created for free by connecting your Instagram account to a compatible digital walle… [+1223 chars]"},
  {'source': {'id': None, 'name': 'The Guardian'},
   'author': 'Alex H

In [8]:
# Download/Update the VADER Lexicon
nltk.download("vader_lexicon")
# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/emilianomendez/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [9]:
# Create the Bitcoin sentiment scores DataFrame
bitcoin_sentiment = []

for article in bitcoin_news["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        bitcoin_sentiment.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
            
        })
        
    except AttributeError:
        pass

In [10]:
bitcoin_df = pd.DataFrame(bitcoin_sentiment)

In [11]:
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
bitcoin_df = bitcoin_df[cols]

bitcoin_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2022-05-10,"Now, even though there are a number of women-f...",0.0772,0.036,0.0,0.964
1,2022-05-04,A Bitcoin mining site powered by otherwise los...,-0.0516,0.056,0.061,0.882
2,2022-05-02,Warren Buffett has always been a bitcoin skept...,-0.3269,0.085,0.143,0.772
3,2022-05-16,"As a kid, I remember when my father tried to u...",0.3818,0.114,0.052,0.833
4,2022-05-09,"Image source, Getty Images\r\nThe value of Bit...",0.34,0.072,0.0,0.928


In [12]:
# Create the Ethereum sentiment scores DataFrame
ethereum_sentiment = []

for article in ethereum_news["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        ethereum_sentiment.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
            
        })
        
    except AttributeError:
        pass

In [13]:
ethereum_df = pd.DataFrame(ethereum_sentiment)

In [14]:
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
etheruem_df = ethereum_df[cols]

ethereum_df.head()

Unnamed: 0,text,date,compound,positive,negative,neutral
0,Meta has revealed more of how NFTs will work o...,2022-05-10,0.6486,0.135,0.0,0.865
1,A multi-billion dollar cryptocurrency company ...,2022-05-02,-0.2263,0.046,0.075,0.879
2,When Bored Ape Yacht Club creators Yuga Labs a...,2022-05-04,-0.2732,0.0,0.055,0.945
3,April 26 (Reuters) - Ether has promised to do ...,2022-04-26,0.5346,0.142,0.0,0.858
4,Ethereum is preparing for an upgrade thats bee...,2022-04-26,0.2716,0.065,0.0,0.935


In [15]:
# Describe the Bitcoin Sentiment
bitcoin_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,-0.09339,0.05945,0.08045,0.86005
std,0.389782,0.062439,0.07613,0.104336
min,-0.8593,0.0,0.0,0.557
25%,-0.36635,0.0,0.0535,0.827
50%,-0.1901,0.048,0.063,0.888
75%,0.152575,0.085,0.08425,0.93025
max,0.7506,0.202,0.3,0.964


In [16]:
# Describe the Ethereum Sentiment
ethereum_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,-0.02918,0.0459,0.04945,0.9048
std,0.402413,0.059923,0.043592,0.052498
min,-0.6908,0.0,0.0,0.822
25%,-0.28445,0.0,0.0,0.85875
50%,-0.1897,0.0,0.059,0.9255
75%,0.2887,0.073,0.069,0.937
max,0.6908,0.178,0.178,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: Bitcoin had the highest positive mean score (0.059450)

Q: Which coin had the highest compound score?

A: BTC had a higher max compound score at 0.75

Q. Which coin had the highest positive score?

A: BTC had the highest positive mean at 0.059450, and the highest positive max at 0.202

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [17]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [18]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
def process_text(article):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', article)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return output

# Expand the default stopwords list if necessary


In [19]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text

   
    # Create a tokenized list of the words

    
    # Lemmatize words into root words

result = [lemmatizer.lemmatize(word) for word in words]

print(result)   
    # Convert the words to lowercase
sw = set(stopwords.words('english'))
first_result = [word.lower() for word in words if word.lower() not in sw]
    
    # Remove the stop words
    
re_words = word_tokenize(re_clean)
re_result = [word.lower() for word in re_words if word.lower() not in sw.union(sw_addon)]

# Print result
print(re_result)    
   

NameError: name 'words' is not defined

In [20]:
# Create a new tokens column for Bitcoin
# YOUR CODE HERE!

In [21]:
# Create a new tokens column for Ethereum
# YOUR CODE HERE!

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [22]:
from collections import Counter
from nltk import ngrams

In [23]:
# Generate the Bitcoin N-grams where N=2
# YOUR CODE HERE!

In [24]:
from nltk.corpus import reuters, stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

In [25]:
lemmatizer = WordNetLemmatizer()

In [26]:
article = reuters.raw(reuters.fileids(categories='bitcoin')[2])
print(article)

ValueError: Category bitcoin not found

In [None]:
# Generate the Ethereum N-grams where N=2
# YOUR CODE HERE!

In [None]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [None]:
# Use token_count to get the top 10 words for Bitcoin
nltk.download("reuters")


In [None]:
categories = ["bitcoin"]
all_docs_id = reuters.fileids()
bitcoin_news_ids = [
    doc
    for doc in all_docs_id
    if categories[0] in reuters.categories(doc)
    or categories[5000] in reuters.categories(doc)
]

print(f"Total number of news articles about bitcoin: {len(bitcoin_news_ids)}")


In [27]:
# Use token_count to get the top 10 words for Ethereum
# YOUR CODE HERE!

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [29]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(bitcoin_news)

In [30]:
# Creating a DataFrame Representation of the TF-IDF results
bitcoin_news_df = pd.DataFrame(
    list(zip(vectorizer.get_feature_names(), np.ravel(X.sum(axis=0)))),
    columns=["Word", "Frequency"],
)

# Order the DataFrame by word frequency in descending order
bitcoin_news_df = bitcoin_news_df.sort_values(by=["Frequency"], ascending=False)

# Print the top 10 words
bitcoin_news_df.head(10)



Unnamed: 0,Word,Frequency
0,articles,1.0
1,status,1.0
2,totalresults,1.0


In [31]:
categories = ["ethereum"]
all_docs_id = reuters.fileids()
ethereum_news_ids = [
    doc
    for doc in all_docs_id
    if categories[0] in reuters.categories(doc)
    or categories[5000] in reuters.categories(doc)
]

print(f"Total number of news articles about ethereum: {len(ethereum_news_ids)}")



IndexError: list index out of range

In [32]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(ethereum_news)

---

In [33]:
# Creating a DataFrame Representation of the TF-IDF results
ethereum_news_df = pd.DataFrame(
    list(zip(vectorizer.get_feature_names(), np.ravel(X.sum(axis=0)))),
    columns=["Word", "Frequency"],
)

# Order the DataFrame by word frequency in descending order
ethereum_news_df = ethereum_news_df.sort_values(by=["Frequency"], ascending=False)

# Print the top 10 words
ethereum_news_df.head(10)



Unnamed: 0,Word,Frequency
0,articles,1.0
1,status,1.0
2,totalresults,1.0


In [34]:
def process_text(doc):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', doc)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return output

def bigram_counter(corpus): 
    # Combine all articles in corpus into one large string
    big_string = ' '.join(corpus)
    processed = process_text(big_string)
    bigrams = ngrams(processed, n=2)
    top_10 = dict(Counter(bigrams).most_common(10))
    return pd.DataFrame(list(top_10.items()), columns=['bigram', 'count'])

def word_counter(corpus): 
    # Combine all articles in corpus into one large string
    big_string = ' '.join(corpus)
    processed = process_text(big_string)
    top_10 = dict(Counter(processed).most_common(10))
    return pd.DataFrame(list(top_10.items()), columns=['word', 'count'])

In [35]:
# Generate the Bitcoin N-grams where N=2
corpus = bitcoin_sentiment["description"]
bigram_counter(corpus)

TypeError: list indices must be integers or slices, not str

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [36]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [37]:
# Generate the Bitcoin word cloud
from nltk.corpus import stopwords, reuters
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
import re
import matplotlib.pyplot as plt

# Code to download wordnet corpora
import nltk
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/emilianomendez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [38]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

In [39]:
ids = reuters.fileids(categories='bitcoin')
corpus = [reuters.raw(i) for i in ids]

ValueError: Category bitcoin not found

In [40]:
def process_text(doc):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', doc)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return ' '.join(output)

In [41]:
big_string = ' '.join(corpus)
input_text = process_text(big_string)

NameError: name 'corpus' is not defined

In [42]:
wc = WordCloud().generate(input_text)
plt.imshow(wc)

NameError: name 'input_text' is not defined

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [43]:
import spacy
from spacy import displacy

In [44]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 12.7 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [45]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [47]:
# Concatenate all of the Bitcoin text together
article = bitcoin_articles["description"].str.cat()

NameError: name 'bitcoin_articles' is not defined

In [48]:
# Run the NER processor on all of the text
doc = nlp(article)

# Add a title to the document
# YOUR CODE HERE!

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'dict'>.

In [57]:
# Render the visualization
displacy.render(doc, style='ent')

NameError: name 'doc' is not defined

In [58]:
# List all Entities
print([ent.text for ent in doc.ents if ent.label_ == 'GPE'])

NameError: name 'doc' is not defined

---

### Ethereum NER

In [53]:
# Concatenate all of the Ethereum text together
article = ethereum_articles["description"].str.cat()

NameError: name 'ethereum_articles' is not defined

In [54]:
# Run the NER processor on all of the text
doc = nlp(article)

# Add a title to the document
# YOUR CODE HERE!

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'dict'>.

In [55]:
# Render the visualization
displacy.render(doc, style='ent')

NameError: name 'doc' is not defined

In [56]:
# List all Entities
print([ent.text for ent in doc.ents if ent.label_ == 'GPE'])

NameError: name 'doc' is not defined

---