<a href="https://colab.research.google.com/github/RDeconomist/classes/blob/main/dataScience/DS4_ScraperCrypto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Science - Cryptocurrency scraper - Richard Davies**

Aim of the file:

1.   Discuss blocked scrapers and using headers.
2.   Extract Cryto prices
3.   Pull together into one Dataframe
4.   Clean this up



In [2]:
# ////////////////////////////////////////////////////////////////
# // 1.  Import packages that we might need:
# // Packages for data manipulation
import numpy as np
import pandas as pd
# // Web scraping: 
import requests
from bs4 import BeautifulSoup
# // OS. Sometimes need this for finding working directory:
import os
# ////////////////////////////////////////////////////////////////
# ////////////////////////////////////////////////////////////////

Step 2 - Pick the URL we want to scrape and extract the html content

In [3]:
# ////////////////////////////////////////////////////////////////
# /// 2.  Set the URL ////////////////////////////////////////////
# /// Notes: This could be a list of URLs

URL = "https://uk.investing.com/crypto/currencies"

# /// Do the html request:
html = requests.get(URL)
soup = BeautifulSoup(html.content, 'html.parser')

# ////////////////////////////////////////////////////////////////
# ////////////////////////////////////////////////////////////////

Let's see what we have:

In [None]:
soup

We need to open the page with proper headers, as the site is blocking us thinking that we are a bot.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

r = Request("https://uk.investing.com/crypto/currencies", headers={"User-Agent": "Mozilla/5.0"})
c = urlopen(r).read()
soup = BeautifulSoup(c, "html.parser")
print(soup)

Now we start to eyeball the soup and see where the data is that we want. There are tr and td elements that look promising...

In [None]:
soup_tr = soup.find_all("tr")

# /// Now print this out:
soup_tr

In [None]:
soup_td = soup.find_all("td", class_="price js-currency-price" )

# /// Now print this out:
soup_td

Now, check that I can get something useful from this:

In [9]:
soup_td[0].text

'45,703.3'

OK, this is getting better so I look at the soup more closely to find the classes of the data that I want. I then use these in the find all function. 

In [None]:
name = soup.find_all("td", class_="left bold elp name cryptoName first js-currency-name")
ticker = soup.find_all("td", class_="left noWrap elp symb js-currency-symbol")
price = soup.find_all("td", class_="price js-currency-price")
mktcap = soup.find_all("td", class_="js-market-cap")
volume = soup.find_all("td", class_="js-24h-volume")

data = [name, ticker, price, mktcap, volume]
data

The result of this is a very big and messy list of data. I can isolate particular elements by specifying which part of the array they are in. 

Check that you understand the difference between these two blocks...

In [17]:
print(data[0][0].text)
print(data[0][1].text)
print(data[0][2].text)
print(data[0][3].text)
print(data[0][4].text)

Bitcoin
Ethereum
Binance Coin
Cardano
Tether


In [18]:
print(data[0][0].text)
print(data[1][0].text)
print(data[2][0].text)
print(data[3][0].text)
print(data[4][0].text)

Bitcoin
BTC
45,703.3
$859.72B
$33.92B


In [22]:
# // To get the length of our results:
length = len(name)
print(length)

# // Set up some empty vectors to fill:
results = np.empty(length, dtype='S7')


100


In [None]:
# // To get some clean results:
for i in range(0, length):
   results[i] = name[i].text

print(results)

Now I want to do this in a more sytematic way for all of the things we have found:

In [34]:
nameD = np.empty(length, dtype='S15')
tickerD = np.empty(length, dtype='S15')
priceD = np.empty(length, dtype='S15')
mktcapD = np.empty(length, dtype='S15')
volumeD = np.empty(length, dtype='S15')

# // To get some clean results:
for i in range(0, length):
   nameD[i] = name[i].text
   tickerD[i] = ticker[i].text
   priceD[i] = price[i].text
   mktcapD[i] = mktcap[i].text
   volumeD[i] = volume[i].text

print(tickerD)

[b'BTC' b'ETH' b'BNB' b'ADA' b'USDT' b'XRP' b'DOGE' b'USDC' b'DOT' b'SOL'
 b'UNI' b'LUNA' b'BCH' b'BUSD' b'LINK' b'LTC' b'ICP' b'WBTC' b'MATIC'
 b'XLM' b'ETC' b'VET' b'THETA' b'FIL' b'TRX' b'DAI' b'AAVE' b'AVAX' b'EOS'
 b'XMR' b'CAKE' b'KLAY' b'FTT' b'GRT' b'AXS' b'ATOM' b'NEO' b'CRO' b'MKR'
 b'BTCB' b'ALGO' b'SHIB' b'XTZ' b'BSV' b'LEO' b'MIOTA' b'BTT' b'EGLD'
 b'KSM' b'AMP' b'WAVES' b'COMP' b'HT' b'KICK' b'QNT' b'UST' b'HBAR'
 b'DASH' b'DCR' b'CHZ' b'RUNE' b'HNT' b'NEAR' b'ZEC' b'HOT' b'XEM'
 b'SUSHI' b'TFUEL' b'STX' b'CEL' b'MANA' b'YFI' b'SNX' b'VGX' b'TUSD'
 b'AUDIO' b'RVN' b'OKB' b'ENJ' b'QTUM' b'FLOW' b'FTM' b'ZIL' b'BAT' b'BTG'
 b'XEC' b'ONE' b'AR' b'NEXO' b'SAFEMOON' b'TEL' b'CHSB' b'PAX' b'BNT'
 b'DGB' b'REV' b'ONT' b'CELO' b'PERP' b'KCS']


Now concert this to a DataFrame

In [35]:
df = pd.DataFrame(nameD)

# // Name the column
df.columns = ['Name']

# // Add extra columns
df['Ticker'] = tickerD
df['Price'] = priceD
df['MarketCap'] = mktcapD
df['Volume'] = volumeD

print(df)

                  Name   Ticker        Price    MarketCap       Volume
0           b'Bitcoin'   b'BTC'  b'45,703.3'  b'$859.72B'   b'$33.92B'
1          b'Ethereum'   b'ETH'  b'3,060.93'  b'$358.78B'   b'$19.14B'
2      b'Binance Coin'   b'BNB'    b'420.76'   b'$70.74B'    b'$2.45B'
3           b'Cardano'   b'ADA'  b'2.137426'   b'$68.03B'    b'$5.32B'
4            b'Tether'  b'USDT'    b'1.0002'   b'$64.20B'   b'$76.46B'
..                 ...      ...          ...          ...          ...
95           b'Revain'   b'REV'   b'0.00998'  b'$881.15M'    b'$3.28M'
96         b'Ontology'   b'ONT'     b'1.003'  b'$877.93M'  b'$143.81M'
97             b'Celo'  b'CELO'    b'3.0459'  b'$872.89M'   b'$24.78M'
98  b'Perpetual Proto'  b'PERP'   b'19.6245'  b'$872.32M'  b'$178.64M'
99     b'KuCoin Token'   b'KCS'    b'10.863'  b'$872.15M'   b'$13.17M'

[100 rows x 5 columns]


We note that the data has a b prefix indicating "byte" - we need to decode this:

In [36]:
df['Name'] = df['Name'].str.decode("utf-8")
df['Ticker'] = df['Ticker'].str.decode("utf-8")
#df['Price'] = df['Price'].str.decode("utf-8")
df['MarketCap'] = df['MarketCap'].str.decode("utf-8")
df['Volume'] = df['Volume'].str.decode("utf-8")

print(df)

               Name Ticker        Price MarketCap    Volume
0           Bitcoin    BTC  b'45,703.3'  $859.72B   $33.92B
1          Ethereum    ETH  b'3,060.93'  $358.78B   $19.14B
2      Binance Coin    BNB    b'420.76'   $70.74B    $2.45B
3           Cardano    ADA  b'2.137426'   $68.03B    $5.32B
4            Tether   USDT    b'1.0002'   $64.20B   $76.46B
..              ...    ...          ...       ...       ...
95           Revain    REV   b'0.00998'  $881.15M    $3.28M
96         Ontology    ONT     b'1.003'  $877.93M  $143.81M
97             Celo   CELO    b'3.0459'  $872.89M   $24.78M
98  Perpetual Proto   PERP   b'19.6245'  $872.32M  $178.64M
99     KuCoin Token    KCS    b'10.863'  $872.15M   $13.17M

[100 rows x 5 columns]


Problem: have an error in the price above, which will not decode...