# Tales from Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from newsapi import NewsApiClient

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Rog\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
load_dotenv()
API = 'd05a7ec8bc9e4c359e75c4d9ed058570'

In [3]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=API)

In [4]:
# Fetch the Bitcoin news articles
all_btc_articles = newsapi.get_everything(q='bitcoin',
                                      language='en',
                                      )
#I am using the 'get everything' api endpoint because there are not that many etherium articles.

In [5]:
# Fetch the Ethereum news articles
all_eth_articles = newsapi.get_everything(q='etherium',
                                      language='en',
                                      )

In [6]:
# Create the Bitcoin sentiment scores DataFrame
btc_df = pd.DataFrame.from_dict(all_btc_articles["articles"])

btc_df.drop(columns=['source','author','url','urlToImage','publishedAt','content'], inplace=True)
btc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        20 non-null     object
 1   description  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes


In [7]:
# Create the Ethereum sentiment scores DataFrame
eth_df = pd.DataFrame.from_dict(all_eth_articles["articles"])
eth_df.drop(columns=['source','author','url','urlToImage','publishedAt','content'], inplace=True)
eth_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        20 non-null     object
 1   description  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes


In [8]:
# Instantiate SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [9]:
btc_df.head(5)

Unnamed: 0,title,description
0,New York passes a bill to limit bitcoin mining,New York lawmakers have passed a bill\r\n that...
1,Jay-Z and Jack Dorsey Launch Bitcoin Academy f...,"Rapper and entrepreneur Shawn Carter, better k..."
2,Bitcoin Wasn't as Decentralized or Anonymous a...,A new study on bitcoin calls into question whe...
3,Why the Central African Republic adopted Bitcoin,Some 90% of people in the Central African Repu...
4,Chipotle now accepts cryptocurrency payments,You can now reportedly pay for your burritos a...


## What do we want to learn?
Determine the valence of the article through reading the headline of the article. 
Valence, or hedonic tone, is the affective quality referring to the intrinsic attractiveness/"good"-ness or 
averseness/"bad"-ness of an event, object, or situation. The term also characterizes and categorizes specific emotions. Through sentiment analysis, we can
determine the valence of the article.


In [10]:
# Describe the Bitcoin Sentiment
btc_daily_sent = [] #analyze the sentiment for each article's headline
btc_sentiment = [] #collect the mean sentiment for bitcoin from each article

for title in btc_df['title']:
    btc_daily_sent.append(sid.polarity_scores(title)["compound"]) #find the compount polarity score to determine the sentiment.

btc_sentiment.append(sum(btc_daily_sent) / len(btc_daily_sent)) #find the mean sentiment


In [11]:
# Describe the Ethereum Sentiment
eth_daily_sent = [] #analyze the sentiment for each article's headline
eth_sentiment = [] #collect the mean sentiment for bitcoin from each article

for title in eth_df['title']:
    eth_daily_sent.append(sid.polarity_scores(title)["compound"]) #find the compount polarity score to determine the sentiment.

eth_sentiment.append(sum(eth_daily_sent) / len(eth_daily_sent)) #find the mean sentiment

In [12]:
#Make a dataframe to analyze the results
daily_sent_df = pd.DataFrame({'Btc':btc_daily_sent,'Eth':eth_daily_sent})
daily_sent_df.head(5)

Unnamed: 0,Btc,Eth
0,0.0,0.0
1,0.0,0.8555
2,0.0,-0.6249
3,0.0,-0.296
4,0.3182,-0.4019


In [13]:
#Find the highest daily score
highest_Btc_sent = daily_sent_df['Btc'].max()
highest_Eth_sent = daily_sent_df['Eth'].max()

#Find the lowest daily score
lowest_Btc_sent = daily_sent_df['Btc'].min()
lowest_Eth_sent = daily_sent_df['Eth'].min()

lowest_Eth_sent


-0.6249

In [14]:
print(f'The highest mean positive score is Etheriums at {eth_sentiment}.')
print(f'The highest daily sentiment score is from Etherium at {highest_Eth_sent}.')
print(f'The lowest daily sentiment score is from Bitcoin at {lowest_Btc_sent}.')

The highest mean positive score is Etheriums at [0.07633500000000001].
The highest daily sentiment score is from Etherium at 0.8555.
The lowest daily sentiment score is from Bitcoin at -0.6652.


### Questions:

Q: Which coin had the highest mean positive score?

A: The highest mean positive score is from Etherium [0.07793].

Q: Which coin had the highest negative score?

A: The lowest daily sentiment score is from Bitcoin at -0.6652.

Q. Which coin had the highest positive score?

A: The highest daily sentiment score is from Etherium at 0.6369.

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [15]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re
regex = re.compile("[^a-zA-Z ]")

In [16]:
print(btc_df['description'][0])

New York lawmakers have passed a bill
 that would temporarily ban new bitcoin
 mining operations. Early on Friday, state senators voted 36-27 to pass the legislation. It's now bound for the desk of Governor Kathy Hochul, who will sign it into law or veto th…


In [17]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
sw = set(stopwords.words('english'))

# Expand the default stopwords list if necessary
# YOUR CODE HERE!

In [30]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
   
    # Remove the punctuation from text
    re_clean = regex.sub('', text)

    # Create a tokenized list of the sentences
    words = word_tokenize(text)
    
    # Lemmatize words into root words
    words = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Remove the stop words
    tokens = [word.lower() for word in words if word.lower not in sw]
    
    return tokens

In [31]:
# Create a new tokens column for Bitcoin
btc_df['tokens'] = [tokenizer(description) for description in btc_df['description']]
btc_df.head()

Unnamed: 0,title,description,tokens
0,New York passes a bill to limit bitcoin mining,New York lawmakers have passed a bill\r\n that...,"[new, york, lawmaker, have, passed, a, bill, t..."
1,Jay-Z and Jack Dorsey Launch Bitcoin Academy f...,"Rapper and entrepreneur Shawn Carter, better k...","[rapper, and, entrepreneur, shawn, carter, ,, ..."
2,Bitcoin Wasn't as Decentralized or Anonymous a...,A new study on bitcoin calls into question whe...,"[a, new, study, on, bitcoin, call, into, quest..."
3,Why the Central African Republic adopted Bitcoin,Some 90% of people in the Central African Repu...,"[some, 90, %, of, people, in, the, central, af..."
4,Chipotle now accepts cryptocurrency payments,You can now reportedly pay for your burritos a...,"[you, can, now, reportedly, pay, for, your, bu..."


In [20]:
# Create a new tokens column for Ethereum
eth_df['tokens'] = [tokenizer(description) for description in eth_df['description']]
eth_df.head()

Unnamed: 0,title,description,tokens
0,50 cryptocurrency and NFT terms you need to know,Web3 Youtuber Aprilynne Alters shares a list o...,"[web3, youtuber, aprilynne, alters, share, a, ..."
1,Top Ten: Weekend reads: New ideas for retireme...,"Also, inflation, oil prices and energy stocks,...","[also, ,, inflation, ,, oil, price, and, energ..."
2,Crypto Bros Are Now On The Frontlines Of A Ful...,Financial crises can trigger serious mental he...,"[financial, crisis, can, trigger, serious, men..."
3,Chargesheet filed against cyber expert duo for...,The duo is alleged to have prepared forged scr...,"[the, duo, is, alleged, to, have, prepared, fo..."
4,From blood-on-the-streets to a full-blown bubb...,"​Terra (UST), a stablecoin designed to peg the...","[​terra, (, ust, ), ,, a, stablecoin, designed..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [21]:
from collections import Counter
from nltk import ngrams

In [22]:
btc_token_list = btc_df['tokens'].tolist() #create a nested list of all the tokens for BTC.

In [23]:
#flatten the nested list into one list so it can be read by the counter function.
flat_btc_token_list = []
for i in range(len(btc_token_list)):
    for j in range(len(btc_token_list[i])):
        flat_btc_token_list.append(btc_token_list[i][j])

In [24]:
# Generate the Bitcoin N-grams where N=2
bigram_counts = Counter(ngrams(flat_btc_token_list, n=2))
print(dict(bigram_counts))



In [25]:
# Generate the Ethereum N-grams where N=2

eth_token_list = eth_df['tokens'].tolist() #create a nested list of all the tokens for ETH.

In [26]:
#flatten the nested list into one list so it can be read by the counter function.
flat_eth_token_list = []
for i in range(len(eth_token_list)):
    for j in range(len(eth_token_list[i])):
        flat_eth_token_list.append(eth_token_list[i][j])

In [38]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    
    #create a nested list of all the tokens for the coin.
    token_list = tokens.tolist()
    
    #flatten the nested list into one list so it can be read by the counter function.
    flat_token_list = []
    for i in range(len(token_list)):
        for j in range(len(token_list[i])):
            flat_token_list.append(token_list[i][j])
            
    # Remove the punctuation from text
    re_clean = regex.sub('', flat_token_list)

    # Create a tokenized list of the sentences
    words = word_tokenize(flat_token_list)
    
    # Lemmatize words into root words
    words = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Remove the stop words
    tokens = [word.lower() for word in words if word.lower not in sw]
    
    
    
    return Counter(tokens).most_common(N)

In [39]:
# Use token_count to get the top 10 words for Bitcoin
token_count(btc_df['tokens'], N=10)

TypeError: expected string or bytes-like object

In [50]:
# Use token_count to get the top 10 words for Ethereum
token_count(flat_eth_token_list, N=10)

[('the', 29),
 (',', 28),
 ('.', 18),
 ('of', 17),
 ('and', 17),
 ('a', 14),
 ('to', 13),
 ('in', 11),
 ('crypto', 7),
 ('is', 7)]

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [21]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [22]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [23]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [24]:
import spacy
from spacy import displacy

In [25]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [26]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together
# YOUR CODE HERE!

In [28]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [29]:
# Render the visualization
# YOUR CODE HERE!

In [30]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [32]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [33]:
# Render the visualization
# YOUR CODE HERE!

In [34]:
# List all Entities
# YOUR CODE HERE!

---