# Web Scraping

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

Websites are of two types and they are

            1. static web pages
            2. Dynamic web pages
           
1.Static webpages : These website does not change over time. If we want to do any changes in the static webpage then the html file needs to be edited according to the requirments.Static Webpage is directly delivered from server side to webpage

2.Dynamic webpages: These websites change over time and if we want to do any changes in the website then we have to change database which is in the backend.

However, It is only legal to scrape publicly available data and illegtal to scrape private data

#### Scraping static webpage

Website fpr scraping - https://www.nfl.com/stats/player-stats/  (America’s National Football League data)

In [2]:
# importing libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.nfl.com/stats/player-stats/"

# make a get request to the URL
request_object = requests.get(url)

# return the source code from the request object
source_code = request_object.text


In [3]:
soup = BeautifulSoup(source_code, "html.parser")

#Taking headers from html source code

headers_list = []

headers = soup.find_all('th')

for header in headers:
    headers_list.append(header.text)
    
# Creating dataframe with headers as coloumn names\

NFL_data = pd.DataFrame(columns = headers_list)

# Iterating through table content and merging it into dataframe

rows = soup.find_all('tr')[1:]
c = 0
for j in rows:
    data=j.find_all('td')
    row = [i.text for i in data]
    c+=1
    NFL_data.loc[c]= row  #adding data into each row of dataframe
    
NFL_data

Unnamed: 0,Player,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY
1,Patrick Mahomes,5250,8.1,648,435,67.1,41,12,105.2,272,42.0,73,13,67,26,188
2,Justin Herbert,4739,6.8,699,477,68.2,25,10,93.2,228,32.6,50,7,55,38,206
3,Tom Brady,4694,6.4,733,490,66.8,25,9,90.7,237,32.3,50,8,63,22,160
4,Kirk Cousins,4547,7.1,643,424,65.9,29,14,92.5,230,35.8,47,10,66,46,329
5,Joe Burrow,4475,7.4,606,414,68.3,35,12,100.8,219,36.1,53,10,60,41,259
6,Jared Goff,4438,7.6,587,382,65.1,29,7,99.3,227,38.7,57,12,81,23,156
7,Josh Allen,4283,7.6,567,359,63.3,35,14,96.6,212,37.4,51,12,98,33,162
8,Geno Smith,4282,7.5,572,399,69.8,30,11,100.9,206,36.0,50,6,54,46,348
9,Trevor Lawrence,4113,7.0,584,387,66.3,25,8,95.2,206,35.3,55,3,59,27,184
10,Jalen Hurts,3701,8.0,460,306,66.5,22,6,101.6,165,35.9,52,11,68,38,231


#### Scraping dynamic webpage

Website taken for scraping - https://www.igxindia.com/market-data/?product%5B%5D=Day+Ahead&product%5B%5D=Daily&product%5B%5D=Week+Day&product%5B%5D=Weekly&product%5B%5D=Fortnightly&product%5B%5D=Monthly&hub_name%5B%5D=Dabhol&hub_name%5B%5D=Dahej&hub_name%5B%5D=Hazira&hub_name%5B%5D=KG+Basin&hub_name%5B%5D=Oduru&DateType=1&interval=Daily&delivery_period=SR&DateFrom=01-01-2022&DateTo=31-01-2022&currency=&tab_active_hidden=&search=true  ( Market data for Indian gas exchange)

In [11]:
# importing libraries

import requests
from bs4 import BeautifulSoup
import selenium
import pandas as pd
from selenium import webdriver


In [13]:
url = "https://www.igxindia.com/market-data/?product%5B%5D=Day+Ahead&product%5B%5D=Daily&product%5B%5D=Week+Day&product%5B%5D=Weekly&product%5B%5D=Fortnightly&product%5B%5D=Monthly&hub_name%5B%5D=Dabhol&hub_name%5B%5D=Dahej&hub_name%5B%5D=Hazira&hub_name%5B%5D=KG+Basin&hub_name%5B%5D=Oduru&DateType=1&interval=Daily&delivery_period=SR&DateFrom=01-01-2022&DateTo=31-01-2022&currency=&tab_active_hidden=&search=true"
# giving path of chromedriver for opening chrome 

driver = webdriver.Chrome("C:\webdrivers\Chromedriver.exe")

# opening the given url in chromedriver

driver.get(url)

driver.implicitly_wait(100)


In [23]:
soup=BeautifulSoup(driver.page_source,'html.parser')


# taking headers from data

headers_list=[]

headers = soup.find_all('th',tabindex="0")


for header in headers:
    headers_list.append(header.text)

# Creating dataframe with headers as coloumn names\

market_data = pd.DataFrame(columns = headers_list)

rows=soup.find_all('tr',role="row")[1:]

bottom_row=soup.find_all('th',rowspan="1")[13:]

# Iterating through table rows and table data for gathering data

for j in rows:
    data=j.find_all('td')
    row = [i.text for i in data]
    bot=[i.text for i in bottom_row] 
    length = len(market_data)
    market_data.loc[length] = row

market_data.loc[length]=bot

market_data

Unnamed: 0,Trade Date,Hub-State,Delivery Point,Product,Contract,Buy Bid Qty (MMBTU),Sell Bid Qty (MMBTU),Trade Price (Rs./MMBTU),Trade Qty (MMBTU),Scheduled Qty (MMBTU),Best Buy Bid (Rs./MMBTU),Best Sell Bid (Rs./MMBTU),Delivery Days
0,\n\t\t\t03-01-2022\t\t\t,Western - Gujarat,Dahej,Monthly,MN-DH-01/02/22-FEB-2022,420000,700000,\n \t\t\t \t\t\t1875 \t\t\t \t\t,420000,-,-,-,28
1,\n\t\t\t03-01-2022\t\t\t,Western - Gujarat,Dahej,Weekly,WK-DH-08/01/22-14/01/22,-,17500,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,2050,7
2,\n\t\t\t03-01-2022\t\t\t,Western - Gujarat,Dahej,Daily,DL-DH-07/01/22-FRI,-,2500,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,2050,1
3,\n\t\t\t03-01-2022\t\t\t,Western - Gujarat,Dahej,Daily,DL-DH-06/01/22-THU,-,2500,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,2050,1
4,\n\t\t\t03-01-2022\t\t\t,Western - Gujarat,Dahej,Daily,DL-DH-05/01/22-WED,-,2500,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,2050,1
5,\n\t\t\t03-01-2022\t\t\t,Western - Gujarat,Dahej,Day Ahead,DA-DH-04/01/22-TUE,-,2500,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,2050,1
6,\n\t\t\t04-01-2022\t\t\t,Western - Gujarat,Dahej,Monthly,MN-DH-01/02/22-FEB-2022,-,280000,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,2550,28
7,\n\t\t\t04-01-2022\t\t\t,Western - Gujarat,Dahej,Fortnightly,FN-DH-16/01/22-31/01/22,-,9600,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,1875,16
8,\n\t\t\t04-01-2022\t\t\t,Western - Gujarat,Dahej,Weekly,WK-DH-08/01/22-14/01/22,-,21000,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,1600,7
9,\n\t\t\t04-01-2022\t\t\t,Western - Gujarat,Dahej,Daily,DL-DH-07/01/22-FRI,-,3000,\n \t\t\t \t\t\t- \t\t\t \t\t,-,-,-,1600,1


As we can see the above data is not in the correct format as there are many unnescessary charachters so for doing it right lets replace the unnecesary characters with empty string and replace '-' with NaN 

In [24]:
market_data['Trade Date'] = market_data['Trade Date'].str.replace('\n\t\t\t','')
market_data['Trade Date'] = market_data['Trade Date'].str.replace('\t\t\t','')
market_data['Trade Date'] = market_data['Trade Date'].str.replace('\t','')
market_data['Trade Price (Rs./MMBTU)'] = market_data['Trade Price (Rs./MMBTU)'].str.replace('\n','')
market_data['Trade Price (Rs./MMBTU)'] = market_data['Trade Price (Rs./MMBTU)'].str.replace('\t\t\t','')
market_data['Trade Price (Rs./MMBTU)'] = market_data['Trade Price (Rs./MMBTU)'].str.replace('\t\t','')
market_data['Trade Price (Rs./MMBTU)'] = market_data['Trade Price (Rs./MMBTU)'].str.replace('\t','')

#Indicating Nan Values

market_data['Trade Price (Rs./MMBTU)'] = market_data['Trade Price (Rs./MMBTU)'].str.replace('-','NaN')
market_data['Buy Bid Qty (MMBTU)'] = market_data['Buy Bid Qty (MMBTU)'].str.replace('-','NaN')
market_data['Sell Bid Qty (MMBTU)'] = market_data['Sell Bid Qty (MMBTU)'].str.replace('-','NaN')
market_data['Best Buy Bid (Rs./MMBTU)'] = market_data['Best Buy Bid (Rs./MMBTU)'].str.replace('-','NaN')
market_data['Best Sell Bid (Rs./MMBTU)'] = market_data['Best Sell Bid (Rs./MMBTU)'].str.replace('-','NaN')
market_data['Trade Qty (MMBTU)'] = market_data['Trade Qty (MMBTU)'].str.replace('-','NaN')
market_data['Scheduled Qty (MMBTU)'] = market_data['Scheduled Qty (MMBTU)'].str.replace('-','NaN')
market_data['Delivery Days'] = market_data['Delivery Days'].str.replace('-','NaN')


market_data

Unnamed: 0,Trade Date,Hub-State,Delivery Point,Product,Contract,Buy Bid Qty (MMBTU),Sell Bid Qty (MMBTU),Trade Price (Rs./MMBTU),Trade Qty (MMBTU),Scheduled Qty (MMBTU),Best Buy Bid (Rs./MMBTU),Best Sell Bid (Rs./MMBTU),Delivery Days
0,03-01-2022,Western - Gujarat,Dahej,Monthly,MN-DH-01/02/22-FEB-2022,420000.0,700000,1875.0,420000.0,,,,28.0
1,03-01-2022,Western - Gujarat,Dahej,Weekly,WK-DH-08/01/22-14/01/22,,17500,,,,,2050.0,7.0
2,03-01-2022,Western - Gujarat,Dahej,Daily,DL-DH-07/01/22-FRI,,2500,,,,,2050.0,1.0
3,03-01-2022,Western - Gujarat,Dahej,Daily,DL-DH-06/01/22-THU,,2500,,,,,2050.0,1.0
4,03-01-2022,Western - Gujarat,Dahej,Daily,DL-DH-05/01/22-WED,,2500,,,,,2050.0,1.0
5,03-01-2022,Western - Gujarat,Dahej,Day Ahead,DA-DH-04/01/22-TUE,,2500,,,,,2050.0,1.0
6,04-01-2022,Western - Gujarat,Dahej,Monthly,MN-DH-01/02/22-FEB-2022,,280000,,,,,2550.0,28.0
7,04-01-2022,Western - Gujarat,Dahej,Fortnightly,FN-DH-16/01/22-31/01/22,,9600,,,,,1875.0,16.0
8,04-01-2022,Western - Gujarat,Dahej,Weekly,WK-DH-08/01/22-14/01/22,,21000,,,,,1600.0,7.0
9,04-01-2022,Western - Gujarat,Dahej,Daily,DL-DH-07/01/22-FRI,,3000,,,,,1600.0,1.0
