# Capstone Webscrapping using BeautifulSoup

This notebook contains guidances & tasks on the data processing for the application

## background

(Please insert the background here )

## Requesting the Data and Creating a BeautifulSoup

Let's begin with requesting the web from the site with `get` method.

In [1]:
import requests

url_get = requests.get('https://www.coingecko.com/en/coins/ethereum/historical_data?start_date=2022-01-01&end_date=2022-12-30#panel')

In [2]:
#mengimport pandas dan matplotlib untuk menampilkan plot
import pandas as pd
import numpy as np
import matplotlib as plt

In [3]:
#mengecek versi pandas
pd.__version__

'1.5.2'

To visualize what exactly you get from the `request.get`, we can use .content so ee what we exactly get, in here i slice it so it won't make our screen full of the html we get from the page. You can delete the slicing if you want to see what we fully get.

In [4]:
#mengambil 500 bytes pertama dari konten yang didapat dari link web capstone
url_get.content[1:500]

b'!DOCTYPE html>\n<html lang="en">\n<head>\n<script src="/cdn-cgi/apps/head/gYtXOyllgyP3-Z2iKTP8rRWGBm4.js"></script><script async defer src="https://www.googleoptimize.com/optimize.js?id=GTM-W3CD992"></script>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<link rel="shortcut icon" href="/favicon.ico">\n<link type="application/opensearchdescription+xml" rel="search" href="/OpensearchDescription.xml" '

In [5]:
url_get

<Response [200]>

As we can see we get a very unstructured and complex html, which actually contains the codes needed to show the webpages on your web browser. But we as human still confused what and where we can use that piece of code, so here where we use the beautifulsoup. Beautiful soup class will result a beautifulsoup object. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 

Let's make Beautiful soup object and feel free to explore the object here.

In [6]:
#mengimport beutifulsoup4
from bs4 import BeautifulSoup 

soup = BeautifulSoup(url_get.content,"html.parser")
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [7]:
#mengecek fungsi beautifulsoup
print(soup.prettify()[:500])

<!DOCTYPE html>
<html lang="en">
 <head>
  <script src="/cdn-cgi/apps/head/gYtXOyllgyP3-Z2iKTP8rRWGBm4.js">
  </script>
  <script async="" defer="" src="https://www.googleoptimize.com/optimize.js?id=GTM-W3CD992">
  </script>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="/favicon.ico" rel="shortcut icon"/>
  <link href="/OpensearchDescription.xml" rel="search" type="applica


## Finding the right key to scrap the data & Extracting the right information

Find the key and put the key into the `.find()` Put all the exploring the right key at this cell. (please change this markdown with your explanation)

In [8]:
#memasukan tbody untuk memasukkan tabel yang ada pada web yang berada pada link dengan cara inspect tabel
table = soup.find('tbody')
print(table.prettify()[1:500])

tbody>
 <tr>
  <th class="font-semibold text-center" scope="row">
   2022-12-30
  </th>
  <td class="text-center">
   $144,831,301,452
  </td>
  <td class="text-center">
   $4,174,715,684
  </td>
  <td class="text-center">
   $1,201.54
  </td>
  <td class="text-center">
   N/A
  </td>
 </tr>
 <tr>
  <th class="font-semibold text-center" scope="row">
   2022-12-29
  </th>
  <td class="text-center">
   $143,241,827,137
  </td>
  <td class="text-center">
   $5,177,421,363
  </td>
  <td class="text


In [25]:
#mengecek keseluruhan tabel berdasarkan date
table.find_all('th', attrs={'class':'font-semibold text-center','scope':'row'}=='2022-12-30')[:5]

[<th class="font-semibold text-center" scope="row">2022-12-30</th>,
 <th class="font-semibold text-center" scope="row">2022-12-29</th>,
 <th class="font-semibold text-center" scope="row">2022-12-28</th>,
 <th class="font-semibold text-center" scope="row">2022-12-27</th>,
 <th class="font-semibold text-center" scope="row">2022-12-26</th>]

In [24]:
#mengecek apakah keseluruhan tabel 2022-12-30 sudah benar
table.find_all('th', attrs={'class':'font-semibold text-center','scope':'row'}=='2022-12-30')[0].text

'2022-12-30'

Finding row length.

In [26]:
#mengetahui panang row berdasarkan date
row = table.find_all('th', attrs={'class':'font-semibold text-center','scope':'row'}=='2022-12-30')
row_length = len(row)
row_length

60

In [13]:
#mengecek keseluruhan tabel tr
rows = table.find_all('tr')

In [14]:
rows

[<tr>
 <th class="font-semibold text-center" scope="row">2022-12-30</th>
 <td class="text-center">
 $144,831,301,452
 </td>
 <td class="text-center">
 $4,174,715,684
 </td>
 <td class="text-center">
 $1,201.54
 </td>
 <td class="text-center">
 N/A
 </td>
 </tr>,
 <tr>
 <th class="font-semibold text-center" scope="row">2022-12-29</th>
 <td class="text-center">
 $143,241,827,137
 </td>
 <td class="text-center">
 $5,177,421,363
 </td>
 <td class="text-center">
 $1,188.73
 </td>
 <td class="text-center">
 $1,201.54
 </td>
 </tr>,
 <tr>
 <th class="font-semibold text-center" scope="row">2022-12-28</th>
 <td class="text-center">
 $146,030,514,727
 </td>
 <td class="text-center">
 $4,221,450,707
 </td>
 <td class="text-center">
 $1,211.82
 </td>
 <td class="text-center">
 $1,188.73
 </td>
 </tr>,
 <tr>
 <th class="font-semibold text-center" scope="row">2022-12-27</th>
 <td class="text-center">
 $147,697,269,148
 </td>
 <td class="text-center">
 $3,071,221,734
 </td>
 <td class="text-center">


Do the scrapping process here (please change this markdown with your explanation)

In [18]:
#untuk menloopin mencari data date, dan volume
temp = [] #initiating a tuple

for i in range(0, row_length):

    #scrapping process
     #get Date 
    Date_Ethereum = table.find_all('th', attrs={'scope':'row', 'class':'font-semibold text-center'}=='2022-12-30')[i].text
    
    #get volum
    Volum = table.find_all('td', attrs={'class':'text-center'}=='Volume')[i].text
    Volum = Volum.strip() #to remove excess white space
    
    #get marketCap
    MarketCap = table.find_all('td', attrs={'class':'text-center'}=='MarketCap')[i].text
    MarketCap = MarketCap.strip() #to remove excess white space
    
    temp.append((Date_Ethereum,Volum,MarketCap)) 
    
temp 

[('2022-12-30', '$144,831,301,452', '$144,831,301,452'),
 ('2022-12-29', '$4,174,715,684', '$4,174,715,684'),
 ('2022-12-28', '$1,201.54', '$1,201.54'),
 ('2022-12-27', 'N/A', 'N/A'),
 ('2022-12-26', '$143,241,827,137', '$143,241,827,137'),
 ('2022-12-25', '$5,177,421,363', '$5,177,421,363'),
 ('2022-12-24', '$1,188.73', '$1,188.73'),
 ('2022-12-23', '$1,201.54', '$1,201.54'),
 ('2022-12-22', '$146,030,514,727', '$146,030,514,727'),
 ('2022-12-21', '$4,221,450,707', '$4,221,450,707'),
 ('2022-12-20', '$1,211.82', '$1,211.82'),
 ('2022-12-19', '$1,188.73', '$1,188.73'),
 ('2022-12-18', '$147,697,269,148', '$147,697,269,148'),
 ('2022-12-17', '$3,071,221,734', '$3,071,221,734'),
 ('2022-12-16', '$1,226.25', '$1,226.25'),
 ('2022-12-15', '$1,211.82', '$1,211.82'),
 ('2022-12-14', '$146,840,669,269', '$146,840,669,269'),
 ('2022-12-13', '$3,694,243,213', '$3,694,243,213'),
 ('2022-12-12', '$1,219.29', '$1,219.29'),
 ('2022-12-11', '$1,226.25', '$1,226.25'),
 ('2022-12-10', '$147,173,116,81

In [16]:
#untuk menloopin mencari data date, dan volume percobaan ke 2 agar dapat menampilkan nilai volume
temp = [] #initiating a tuple

for i in range(0, row_length):

    #scrapping process
     #get Date 
    Date_Ethereum = table.find_all('th', attrs={'scope':'colum', 'class':'font-semibold text-center'}=='2022-12-30')[i].text
    
    #get volum
    Volum = table.find_all('td', attrs={'class':'text-center'}=='$4,174,715,684')[i].text
    Volum = Volum.strip() #to remove excess white space
    
    #get marketcap
    MarketCap = table.find_all('td', attrs={'class':'text-center'}=='$144,831,301,452')[i].text
    MarketCap = MarketCap.strip() #to remove excess white space
    
    temp.append((Date_Ethereum,Volum,MarketCap)) 
    
temp

[('2022-12-30', '$144,831,301,452', '$144,831,301,452'),
 ('2022-12-29', '$4,174,715,684', '$4,174,715,684'),
 ('2022-12-28', '$1,201.54', '$1,201.54'),
 ('2022-12-27', 'N/A', 'N/A'),
 ('2022-12-26', '$143,241,827,137', '$143,241,827,137'),
 ('2022-12-25', '$5,177,421,363', '$5,177,421,363'),
 ('2022-12-24', '$1,188.73', '$1,188.73'),
 ('2022-12-23', '$1,201.54', '$1,201.54'),
 ('2022-12-22', '$146,030,514,727', '$146,030,514,727'),
 ('2022-12-21', '$4,221,450,707', '$4,221,450,707'),
 ('2022-12-20', '$1,211.82', '$1,211.82'),
 ('2022-12-19', '$1,188.73', '$1,188.73'),
 ('2022-12-18', '$147,697,269,148', '$147,697,269,148'),
 ('2022-12-17', '$3,071,221,734', '$3,071,221,734'),
 ('2022-12-16', '$1,226.25', '$1,226.25'),
 ('2022-12-15', '$1,211.82', '$1,211.82'),
 ('2022-12-14', '$146,840,669,269', '$146,840,669,269'),
 ('2022-12-13', '$3,694,243,213', '$3,694,243,213'),
 ('2022-12-12', '$1,219.29', '$1,219.29'),
 ('2022-12-11', '$1,226.25', '$1,226.25'),
 ('2022-12-10', '$147,173,116,81

In [19]:
#merapihkan berisan hasil looping
temp = temp[::-1]
temp[:11]

[('2022-11-01', '$1,165.97', '$1,165.97'),
 ('2022-11-02', '$1,263.75', '$1,263.75'),
 ('2022-11-03', '$6,014,960,880', '$6,014,960,880'),
 ('2022-11-04', '$152,444,818,383', '$152,444,818,383'),
 ('2022-11-05', '$1,189.43', '$1,189.43'),
 ('2022-11-06', '$1,165.97', '$1,165.97'),
 ('2022-11-07', '$9,306,680,538', '$9,306,680,538'),
 ('2022-11-08', '$140,289,330,990', '$140,289,330,990'),
 ('2022-11-09', '$1,186.78', '$1,186.78'),
 ('2022-11-10', '$1,189.43', '$1,189.43'),
 ('2022-11-11', '$4,893,469,369', '$4,893,469,369')]

## Creating data frame & Data wrangling

Put the array into dataframe

In [20]:
#melihat tabel data frame date, volume dan MarketCap
import pandas as pd

df = pd.DataFrame(temp, columns = ('Date_Ethereum','Volum','MarketCap'))
df.head()

Unnamed: 0,Date_Ethereum,Volum,MarketCap
0,2022-11-01,"$1,165.97","$1,165.97"
1,2022-11-02,"$1,263.75","$1,263.75"
2,2022-11-03,"$6,014,960,880","$6,014,960,880"
3,2022-11-04,"$152,444,818,383","$152,444,818,383"
4,2022-11-05,"$1,189.43","$1,189.43"


Do the data cleaning here (please change this markdown with your explanation of what you do for data wrangling)

In [21]:
#melihat tipe data
df.dtypes


Date_Ethereum    object
Volum            object
MarketCap        object
dtype: object

In [52]:
#mengecek tipe data volum
df = pd.DataFrame({'Volum': ['$1,165.97', '$1,263.75', '$6,014,960,880', '$152,444,818,383', '$1,189.43']})

df.dtypes

Volum    object
dtype: object

In [58]:
#karena ketika dicoba saat mengubah tipe data volum object ke float mengalami error akhirnya saya mencoba untuk mengubah semua tipe data obect ke float
df = df.apply(pd.to_numeric, errors='coerce')

In [59]:
df.dtypes

Volum    float64
dtype: object

In [60]:
#mengubah tipe data date yang tadinya float ke object
import pandas as pd

# Membuat dataframe dengan tipe data float64
df = pd.DataFrame({'Date_Ethereum': ['2022-11-01', '2022-11-02', '2022-11-03', '2022-11-04']})

# Menampilkan tipe data masing-masing kolom
print(df.dtypes)

# Mengubah tipe data float64 ke object (string)
df['Date_Ethereum'] = df['Date_Ethereum'].astype(str)

# Menampilkan tipe data masing-masing kolom setelah diubah
print(df.dtypes)

Date_Ethereum    object
dtype: object
Date_Ethereum    object
dtype: object


Data visualisation (please change this markdown with your explanation of what you do for data wrangling)

In [47]:
#mencoba kembali mengubah tipe data sesuai intruksi yang berada di video akan tetapi mengalami error
df['Volum'] = df['Volum'].str.replace(",",".")
df['Volum'] = df['Volum'].astype('float64')
df['MarketCap'] = df['MarketCap'].str.replace(",",".")
df['MarketCap'] = df['MarketCap'].astype('float64')
df['Date_Ethereum'] = df['Date_Ethereum'].astype('datetime64')

df.dtypes

KeyError: 'Volum'

### Implementing your webscrapping to the flask dashboard

- Copy paste all of your web scrapping process to the desired position on the `app.py`
- Changing the title of the dasboard at `index.html`

## Finishing This Notebook with Your Analysis and Conclusion

First you can do start with making the data visualisation. 

In [50]:
#menadikan date menadi index
df = df.set_index('Date_Ethereum')


(Put your analysis and conclusion here.)

In [61]:
#menampilkan grafik plot wab scraping
df.plot

<pandas.plotting._core.PlotAccessor object at 0x00000284654C5A20>

### Extra Challange

This will be not included to the scoring. 

- You can create additional analysis from the data.
- Implement it to the dashboard with at `app.py` dan `index.html`.