<a href="https://colab.research.google.com/github/Dsushmitha/Web-Scrapping-and-Data-Visualisation-with-Python/blob/main/Web_Scraping_with_Python_Starter_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Step 1: Loading modules**
Before we start scrapping the target website, we need to import some necessary modules from the system library.
*   “requests” includes the modules for sending HTTP requests to websites, the core step for web scrapping.
*   “bs4/BeautifulSoup” includes the required APIs for cleaning and formatting the data collected from the web scrapper.
*   “pandas” includes some essential functionalities for data analytics, allowing users to quickly manipulate and analyse them.
---


In [None]:
import requests 
from bs4 import BeautifulSoup 
import pandas as pd 

# **Step 2: Naïve Scrapping Method (Scrapping Whole Page)**
We will now introduce the simplest way to scrape the data from a website.
*   Define a Python "list" for every column you identified in the stock price table from Yahoo! Finance.
*   Add the URL of the target website in the code.
*   Observe the stock price table and identify the column data that will be useful. Then, use the "Inspect" feature from Chrome to show the HTML content.
*   Use for-loop to format the data collected from BeautifulSoup.


**Discussions**
1.   Try to discuss the advantages and disadvantages of the method above。
2.   If the column name of the underlying table in the website changes, does this method still work?
---






In [None]:
# TODO: Use requests and BeautifulSoup（BS）to scrape website data
active_stocks_url = "https://in.finance.yahoo.com/most-active"
r = requests.get(active_stocks_url)
data = r.text
soup = BeautifulSoup(data)

# TODO: Define python lists for every column and get the HTML table
symbols=[]
names=[]
prices=[]
changes=[]
percentage_changes=[]
volumes=[]
market_caps=[]
price_earning_ratios=[]


In [None]:
"""
Using the concepts of for-loop, find all the <tr> tags from "stockTable".
Every <tr> tag represent a row of stock data (saved as listing).
We need to find all the <td> tag from the "listing", and extract its info to be inserted to the relevant python list.
"""
# TODO: Fill in the relevant HTML tag in the find_all "brackets"
stocks_table = soup.find('tbody')
for listing in stocks_table.find_all('tr'):
  symbol = listing.find('td',attrs={'aria-label':'Symbol'})
  symbols.append(symbol.text)

  name = listing.find('td',attrs={'aria-label':'Name'})
  names.append(name.text)
  
  price = listing.find('td',attrs={'aria-label':"Price (intraday)"})
  prices.append(price.text)
  
  change = listing.find('td',attrs={'aria-label':'Change'})
  changes.append(change.text)
  
  percentage_change = listing.find('td',attrs={'aria-label':'% change'})
  percentage_changes.append(percentage_change.text)
  
  volume = listing.find('td',attrs={'aria-label':'Volume'})
  volumes.append(volume.text)
  
  market_cap = listing.find('td',attrs={'aria-label':'Market cap'})
  market_caps.append(market_cap.text)
  
  price_earning_ratio = listing.find('td',attrs={'aria-label':'PE ratio (TTM)'})
  price_earning_ratios.append(price_earning_ratio.text)

In [None]:
"""
Use pandas to create a new data frame, aggregate all python lists into a single table.
You will need to know how to use Python dictionary in this part.
"""
# TODO: Display the table using a Python dataframe
df = pd.DataFrame({ "Symbol":           symbols,
                    "Name":             names,
                    "Price (intraday)": prices,
                    "Change":           changes,
                    "% change":         percentage_changes,
                    "Volume":           volumes,
                    "Market cap":       market_caps,
                    "PE ratio (TTM)":   price_earning_ratios })
df

Unnamed: 0,Symbol,Name,Price (intraday),Change,% change,Volume,Market cap,PE ratio (TTM)
0,YESBANK.NS,Yes Bank Limited,13.85,0.5,+3.75%,300.538M,346.75B,
1,PNB.NS,Punjab National Bank,39.8,1.75,+4.60%,276.802M,438.815B,33.59
2,UVSL.NS,Uttam Value Steels Limited,0.2,0.0,0.00%,204.818M,1.652B,
3,SBIN.NS,State Bank of India,412.05,10.85,+2.70%,142.92M,3.677T,15.85
4,BANKBARODA.NS,Bank of Baroda,81.35,0.55,+0.68%,129.685M,420.69B,13.67
5,SAIL.NS,Steel Authority of India Limited,124.75,2.75,+2.25%,96.64M,515.494B,15.55
6,IDFCFIRSTB.NS,IDFC First Bank Limited,59.25,2.0,+3.49%,91.31M,367.339B,68.1
7,IDEA.NS,Vodafone Idea Limited,8.6,0.05,+0.58%,90.038M,247.702B,
8,IOC.NS,Indian Oil Corporation Limited,109.45,5.15,+4.94%,83.19M,1.03T,20.8
9,BHEL.NS,Bharat Heavy Electricals Limited,72.45,0.25,+0.35%,62.621M,252.275B,


# **Step 3: Naïve Scrapping Method (Scrapping Individual Rows)**
*   Copy and paste the Yahoo Finance link for currencies。
*   Use Chrome Inspector to inspect the HTML elements。

**Discussions**
1.   What is the difference of this method in terms of execution efficiency when compared to the previous method?
2.   If the row header, does this method still works?
3.   When should we use whole page scraping, when should we use individual row scraping?
---

In [None]:
# TODO: Scrape website data and extract info into relevant Python lists
currencies_url = "https://in.finance.yahoo.com/currencies"
r = requests.get(currencies_url)
data = r.text
soup = BeautifulSoup(data)

symbols=[]
names=[]
last_prices=[]
changes=[]
percentage_changes=[]

# TODO: Find the starting and ending data-reactid，and the difference between each column
start, end, jump = 33, 449, 16
stocks_table = soup.find('tbody')
for i in range(start,end,jump):
  listing = stocks_table.find('tr',attrs={'data-reactid':i})

  symbol = listing.find('td',attrs={'data-reactid':i+1})
  symbols.append(symbol.text)

  name = listing.find('td',attrs={'data-reactid':i+3})
  names.append(name.text)
  
  last_price = listing.find('td',attrs={'data-reactid':i+5})
  last_prices.append(price.text)
  
  change = listing.find('td',attrs={'data-reactid':i+7})
  changes.append(change.text)
  
  percentage_change = listing.find('td',attrs={'data-reactid':i+9})
  percentage_changes.append(percentage_change.text)  


# TODO: Display the table using a Python dataframe
df = pd.DataFrame({ "Symbol":     symbols,
                    "Name":       names,
                    "Last price": last_prices,
                    "Change":     changes,
                    "% change":   percentage_changes })
df  

Unnamed: 0,Symbol,Name,Last price,Change,% change
0,INR=X,USD/INR,211.5,0.143,+0.20%
1,EURINR=X,EUR/INR,211.5,0.3877,+0.44%
2,GBPINR=X,GBP/INR,211.5,-0.0084,-0.01%
3,AEDINR=X,AED/INR,211.5,0.015,+0.08%
4,INRJPY=X,INR/JPY,211.5,-0.0032,-0.21%
5,SGDINR=X,SGD/INR,211.5,0.187,+0.34%
6,USDIDR=X,USD/IDR,211.5,-20.0,-0.14%
7,USDTHB=X,USD/THB,211.5,-0.012,-0.04%
8,USDMYR=X,USD/MYR,211.5,0.005,+0.12%
9,USDZAR=X,USD/ZAR,211.5,0.0449,+0.32%


# **Step 4: Header Scraping Method**
This method is an advanced scraping method. The code will automatically scrape the header so that we don't have to define the list for ourselves, making the code much simpler and cleaner.

*   Copy and paste the Yahoo Finance link of cryptocurrencies
*   Scrape the headers and put those into a python list
*   Put the relevant data into a Python dictionary
---

In [None]:
crypto_url = "https://in.finance.yahoo.com/cryptocurrencies"
r = requests.get(crypto_url)
data = r.text
soup = BeautifulSoup(data)

# TODO: Use a Python list and Python dictionaruy to scrape all the headers
raw_data = {}
headers =[]
header_rows = soup.find('thead')
for header in header_rows.find_all('th'):
  headers.append(header.text)
  raw_data[header.text]=[]

rows = soup.find('tbody')
for row in rows.find_all('tr'):
  for index,row_data in enumerate(row.find_all('td')):
    header_value = headers[index]
    raw_data[header_value].append(row_data.text)

# TODO: Display the table using a Python dataframe
pd.DataFrame(raw_data)

Unnamed: 0,Symbol,Name,Price (intraday),Change,% change,Market cap,Volume in currency (since 0:00 UTC),Volume in currency (24 hrs),Total volume all currencies (24 hrs),Circulating supply,52-week range,1-day chart
0,BTC-INR,Bitcoin INR,2629632.5,37675.25,+1.45%,49.21T,5.732T,5.732T,5.732T,18.714M,,
1,ETH-INR,Ethereum INR,163301.03,5703.7,+3.62%,18.944T,4.158T,4.158T,4.158T,116.004M,,
2,USDT-INR,Tether INR,73.04,0.05,+0.07%,4.362T,12.417T,12.417T,12.417T,59.73B,,
3,ADA-INR,Cardano INR,103.71,7.27,+7.54%,3.313T,686.066B,686.066B,686.066B,31.948B,,
4,BNB-INR,BinanceCoin INR,21053.82,1646.93,+8.49%,3.23T,426.128B,426.128B,426.128B,153.433M,,
5,DOGE-INR,Dogecoin INR,23.18,-0.53,-2.24%,3.008T,564.328B,564.328B,564.328B,129.735B,,
6,XRP-INR,XRP INR,60.91,1.79,+3.03%,2.81T,710.92B,710.92B,710.92B,46.135B,,
7,USDC-INR,USDCoin INR,72.94,0.02,+0.03%,1.504T,378.997B,378.997B,378.997B,20.615B,,
8,DOT1-INR,Polkadot INR,1409.29,-24.57,-1.71%,1.327T,450.054B,450.054B,450.054B,941.401M,,
9,BCH-INR,BitcoinCash INR,44613.44,1923.47,+4.51%,836.278B,459.853B,459.853B,459.853B,18.745M,,


# **Step 5: Making a generic scraping function**




We are going to turn the header method into a Python function. This function can also work for other types of financial products!

*   Define a good name for the function
*   Define input paramters and input value

---


In [None]:
# TODO: code a generic function scrape_table 

# **Concept Challenge: Scrape other products**
Try using the generic function to scrape other products in Yahoo Finance!
*   Gainers
*   Losers
*   Top ETFs
---


In [None]:
# TODO: Try using the generic function to scrape other kind of products (e.g. cryptocurrencies)

# Step 6: Data Wrangling 

**Datatype Conversion**

This part will make use of the stock data we have collected from our web scrapper. However, the data collected are all stored as "strings". In other words, the data is regarded as textual data even if the underlying data is representing a number. We need to convert them into right formats for the chart plotting tools.

Steps in data conversion：
1.   Remove all the commas in the number data, and change columns that contain number data to floating point.
2.   Change all columns that contain dates to datetime.
3.   Recover abbreaviated numbers, for example, recover "1M" to 1000000.

In [None]:
from datetime import datetime
def convert_column_to_float(df, columns):
# TODO: code the logic for string to float

def convert_column_to_datetime(df, columns):
# TODO: code the logic for string to datetime

def revert_scaled_number(number):
# TODO: code the logic for converting the string apreviations back to numbers

**Filtering dataframe**

- We can scrape all the active stocks easily now
- Let's try to separate them into rising and losing stocks?

In [None]:
# TODO: first scrape the active stocks table using the web scraper function

# TODO: change the data type of the dataframe columns

# TODO: filter the dataframe by % Change (pos/neg)


**Sorting dataframe**

- It's not quite clear which stock is the top gainer/loser
- We can sort the dataframe and see it clearly

In [None]:
rising = rising.sort_values(by=['% Change'], ascending=False)
# TODO: get the losing stocks

Finally, if you prefer, you can add back the "+/-" sign and the percentage symbol and convert back the value to string

In [None]:
rising['% Change']='+' + rising['% Change'].astype(str) + '%'
losing['% Change']=losing['% Change'].astype(str) + '%'

In [None]:
rising