# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
Euthereum has the the highest mean positive score 
2. Which coin had the highest negative score?
Bitcoin has the highest negative score 
3. Which coin had the highest positive score?
Euthereum has the highest postive score 

In [6]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

In [7]:
import nltk 
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/diopmouhamadoulamine/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [8]:
!pip install NLTK



In [24]:
# Read your api key environment variable
load_dotenv()
api_key = os.getenv("NEWS_API_KEY")

In [28]:
!pip install newsapi



In [40]:
!pip install newsapi-python



In [None]:
# Create a newsapi client
from newsapi import NewsApiClient

In [None]:
newsapi = NewsApiClient(api_key=api_key)

In [None]:
# Fetch the Bitcoin news articles
BTC_headlines = newsapi.get_everything(
    q="bitcoin",
    language="en",
    sort_by="relevancy"
)

In [None]:
# Fetch the Ethereum news articles
ETH_headlines = newsapi.get_everything(
    q="ethereum",
    language="en",
    sort_by="relevancy"
)

In [None]:
# Create the Bitcoin sentiment scores DataFrame
sentiments = []

for articles inBTC_headlines["articles"]:
    try:
        text = articles["content"]
        results = analyzer.polarity_scores(text)
        compound = results["compound"]
        pos = results["pos"]
        neu = results["neu"]
        neg = results["neg"]

        sentiments.append({
            "text": text,
            "Compound": compound,
            "Positive": pos,
            "Negative": neg,
            "Neutral": neu,
        })
    except AttributeError:
        pass
    
BTC  = pd.DataFrame(sentiments)
BTC.head()

In [None]:
# Create the ethereum sentiment scores DataFrame
sentiments = []

for articles in ETH_headlines["articles"]:
    try:
        text = articles["content"]
        results = analyzer.polarity_scores(text)
        compound = results["compound"]
        pos = results["pos"]
        neu = results["neu"]
        neg = results["neg"]

        sentiments.append({
            "text": text,
            "Compound": compound,
            "Positive": pos,
            "Negative": neg,
            "Neutral": neu,
        })
    except AttributeError:
        pass
    
ETH  = pd.DataFrame(sentiments)
ETH.head()

In [None]:
# Describe the Bitcoin Sentiment
BTC.describe()

In [None]:
# Describe the Ethereum Sentiment
ETH.describe()

### Questions:

**Important note:** The sample answers may vary depending on when this code is running since news may change over time.

Q: Which coin had the highest mean positive score?

A: Ethereum had a slightly higher mean positive score

Q: Which coin had the highest compound score?

A: Bitcoin had the highest compound score

Q. Which coin had the highest positive score?

A: Bitcoin had the highest Positive score

---

## 2. Natural Language Processing
---
###   Tokenize

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word,
2. Remove punctuation.
3. Remove stopwords.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [None]:
# Instantiate the lemmatizer
wnl = WordNetLemmatizer() 

# Create a list of stopwords
stop = stopwords.words('english')

# Expand the default stopwords list if necessary
stop.append("u")
stop.append("it'")
stop.append("'s")
stop.append("n't")
stop.append('…')
stop.append("\`")
stop.append('``')
stop.append('char')
stop.append("''")
stop = set(stop)

In [None]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Create a list of the words
    words = word_tokenize(text)

    # Convert the words to lowercase
    words = list(filter(lambda w: w.lower(), words))
    
    # Remove the punctuation
    words = list(filter(lambda t: t not in punctuation, words))
    
    # Remove the stopwords
    words = list(filter(lambda t: t.lower() not in stop, words))
    
    # Lemmatize Words into root words
    tokens = [wnl.lemmatize(word) for word in words]
    
    return tokens


In [None]:
# Create a new tokens column for Bitcoin
BTC["tokens"] = BTC.text.apply(tokenizer)
BTC.head()

In [None]:
# Create a new tokens column for Ethereum
ETH["tokens"] = ETH.text.apply(tokenizer)
ETH.head()

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [None]:
from collections import Counter
from nltk import ngrams

In [None]:
# Generate the Bitcoin N-grams where N=2
N = 2
grams = ngrams(tokenizer(BTC.text.str.cat()), N)
Counter(grams).most_common(20)

In [None]:
# Generate the Ethereum N-grams where N=2
N = 2
grams = ngrams(tokenizer(ETH.text.str.cat()), N)
Counter(grams).most_common(20)

In [None]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [None]:
# Use token_count to get the top 10 words for Bitcoin
all_tokens = tokenizer(BTC.text.str.cat())
token_count(all_tokens, 10)

In [None]:
# Use token_count to get the top 10 words for Ethereum
all_tokens = tokenizer(ETH.text.str.cat())
token_count(all_tokens, 10)

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [None]:
def wordcloud(text, title=""):
    df_cloud = WordCloud(width=500, colormap='RdYlBu').generate(text)
    plt.imshow(df_cloud)
    plt.axis("off")
    fontdict = {"fontsize": 48, "fontweight" : "bold"}
    plt.title(title, fontdict=fontdict)
    plt.show()

In [None]:
wordcloud(BTC.text.str.cat(), title="Bitcoin Word Cloud")

In [None]:
wordcloud(ETH.text.str.cat(), title="Ethereum Word Cloud")

---

## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both coins and visualize the tags using SpaCy.

In [None]:
import spacy
from spacy import displacy

In [None]:
# Download the language model for SpaCy if needed
# !python -m spacy download en_core_web_sm

In [None]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---

## Bitcoin NER

In [None]:
# Concatenate all of the bitcoin text together
all_BTC_text = BTC.text.str.cat()
all_BTC_text

In [None]:
# Run the NER processor on all of the text
doc = nlp(all_BTC_text)

# Add a title to the document
doc.user_data["title"] = "Bitcoin NER"

In [None]:
# Render the visualization
displacy.render(doc, style='ent', jupyter=True)

In [None]:
# List all Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

---

In [None]:
# Concatenate all of the bitcoin text together
all_ETH_text = ETH.text.str.cat()
all_ETH_text

In [None]:
# Run the NER processor on all of the text
ETH_doc = nlp(all_ETH_text)

# Add a title to the document
ETH_doc.user_data["title"] = "Ethereum NER"

In [None]:
# Render the visualization
displacy.render(ETH_doc, style='ent', jupyter=True)

In [None]:
# List all Entities
for ent in ETH_doc.ents:
    print(ent.text, ent.label_)