# Web scraping

Web scraping is an automatic method to obtain large amounts of data from websites.

Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

There are many different ways to perform web scraping to obtain data from websites

These include using online services, particular API’s or even creating your code for web scraping from scratch

Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format.

This is the best option, but there are other sites that don’t allow users to access large amounts of data in a structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.

Web scraping requires two parts, namely the crawler and the scraper.

 The **crawler** is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet.
  
The **scraper**, on the other hand, is a specific tool created to extract data from the website. The design of the scraper can vary greatly according to the complexity and scope of the project so that it can quickly and accurately extract the data.



In [1]:
# %pip install numpy
# %pip install matplotlib
# %pip install seaborn
# %pip install scikit-learn

In [2]:
# %pip install pandas

In [3]:
# pip install ipykernal

In [4]:
#pip install requests

In [5]:
#%pip install beautifulsoup4

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [7]:
url = "https://en.wikipedia.org/wiki/World_population"

response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")


In [8]:
print(soup.title)

<title>World population - Wikipedia</title>


In [None]:
print(soup.text)

In [10]:
print(soup.title.text)

World population - Wikipedia


In [11]:
# print(soup.prettify()) # print the whole html code

In [12]:
# find all the tables

tables = soup.find_all("table")

dataframe = [] # empty list

for i, table in enumerate(tables):
    rows = table.find_all("tr")[1:] # skip the first row
    data = [] # empty list
    for row in rows:
        cols = row.find_all("td")
        cols = [col.text.strip() for col in cols]
        data.append(cols)
    df = pd.DataFrame(data)
    dataframe.append(df)



In [13]:
dataframe[1].head()
# dataframe[1].to_csv("world_population.csv", index=False, header=True)

Unnamed: 0,0,1,2,3
0,Sub-Saharan Africa,"1,152 (14.51%)","1,401 (16.46%)","2,094 (21.62%)"
1,Northern Africa and Western Asia,549 (6.91%),617 (7.25%),771 (7.96%)
2,Central Asia and Southern Asia,"2,075 (26.13%)","2,248 (26.41%)","2,575 (26.58%)"
3,Eastern Asia and Southeastern Asia,"2,342 (29.49%)","2,372 (27.87%)","2,317 (23.92%)"
4,Europe and Northern America,"1,120 (14.10%)","1,129 (13.26%)","1,125 (11.61%)"


In [14]:
import requests
from bs4 import BeautifulSoup


req = requests.get("https://www.geeksforgeeks.org/")

soup = BeautifulSoup(req.content, 'html.parser')

# print(soup.title) #title tag
# print(soup.p) #p tag
# print(soup.text) #text

# print(soup.prettify())

soup.title.get_text()

'GeeksforGeeks | A computer science portal for geeks'

In [18]:

import requests
from bs4 import BeautifulSoup

# Send a GET request to the Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(req.content, 'html.parser')

# Find the table containing the population data
table = soup.find('table', {'class': 'wikitable sortable'})
# Find all the rows in the table
# Check if the table was found
if table:

    rows = table.find_all('tr')

# Loop through each row and extract the data
for row in rows:
    # Find all the cells in the row
    cells = row.find_all('td')
    
    # Extract the data from the cells
    if len(cells) > 1:
        rank = cells[0].text.strip()
        country = cells[1].text.strip()
        population = cells[2].text.strip()
        
    # Print the data
    print(f'{rank}\t{country}\t{population}')


    

In [20]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send a GET request to the Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(req.content, 'html.parser')

# Find the table containing the population data
table = soup.find('table', {'class': 'wikitable sortable'})
# Find all the rows in the table
# Check if the table was found
if table:

    for row in table.find_all('tr'):
        print(row.text)
   

# Create an empty DataFrame to store the data
    data = pd.DataFrame(columns=['Rank', 'Country', 'Population'])

# Loop through each row and extract the data
for row in rows:
    # Find all the cells in the row
    cells = row.find_all('td')
    
    
    # Extract the data from the cells
    if len(cells) > 1:
        rank = cells[0].text.strip()
        country_cell = cells[1].find('a')
        if country_cell is not None:
            country = country_cell.text.strip()
        else:
            country = cells[1].text.strip()
        population = cells[2].text.strip()
        
        # Append the data to the DataFrame
        data = pd.concat([data, pd.DataFrame({'Rank': [rank], 'Country': [country], 'Population': [population]})], ignore_index=True)

# Print the DataFrame
data





Location

Population

% ofworld

Date

Source (official or fromthe United Nations)




 –

 World

8,113,332,000
100%
18 Jun 2024

UN projection[3]



 1/2  [b]

 China

1,409,670,000
17.4%
31 Dec 2023

Official estimate[5]
[c]


 India

1,400,744,000
17.3%
1 Mar 2024

Official projection[6]
[d]


3

 United States

335,893,238
4.1%
1 Jan 2024

Official estimate[7]
[e]


4

 Indonesia

279,118,866
3.4%
1 Jul 2023

National annual projection[8]



5

 Pakistan

241,499,431
3.0%
1 Mar 2023

2023 census result[9]
[f]


6

 Nigeria

223,800,000
2.8%
1 Jul 2023

Official projection[10]



7

 Brazil

203,080,756
2.5%
1 Aug 2022

2022 census result[11]



8

 Bangladesh

169,828,911
2.1%
14 Jun 2022

2022 census result[12]



9

 Russia

146,150,789
1.8%
1 Jan 2024

Official estimate[13]
[g]


10

 Mexico

129,713,690
1.6%
31 Mar 2024

National quarterly estimate[14]



11

 Japan

123,930,000
1.5%
1 May 2024

Official estimate[15]



12

 Philippines

112,892,781
1.4%
1 Jul 2023

Officia

Unnamed: 0,Rank,Country,Population


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/World_population'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the tables on the webpage
tables = soup.find_all('table')

# For each table, create a DataFrame and print it
for i, table in enumerate(tables):
    # The [1:] is to skip the header row
    rows = table.find_all('tr')[1:]
    data = []
    for row in rows:
        cols = row.find_all('td')
        cols = [col.text.strip() for col in cols]
        data.append(cols)
    df = pd.DataFrame(data)
    print(f"Table {i}:")
    print(df)
    print("\n\n")

Table 0:
          0     1     2     3     4     5     6     7     8     9
0      1804  1927  1960  1974  1987  1999  2011  2022  2037  2057
1  200,000+   123    33    14    13    12    12    11    15    20



Table 1:
                                    0               1               2  \
0                  Sub-Saharan Africa  1,152 (14.51%)  1,401 (16.46%)   
1    Northern Africa and Western Asia     549 (6.91%)     617 (7.25%)   
2      Central Asia and Southern Asia  2,075 (26.13%)  2,248 (26.41%)   
3  Eastern Asia and Southeastern Asia  2,342 (29.49%)  2,372 (27.87%)   
4         Europe and Northern America  1,120 (14.10%)  1,129 (13.26%)   
5     Latin America and the Caribbean     658 (8.29%)     695 (8.17%)   
6           Australia and New Zealand      31 (0.39%)      34 (0.40%)   
7                             Oceania      14 (0.18%)      15 (0.18%)   
8                               World           7,942           8,512   

                3  
0  2,094 (21.62%)  
1     771 

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/World_population'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the tables on the webpage
tables = soup.find_all('table')

dataframes = []  # List to store all dataframes

# For each table, create a DataFrame and append it to the list
for i, table in enumerate(tables):
    # The [1:] is to skip the header row
    rows = table.find_all('tr')[1:]
    data = []
    for row in rows:
        cols = row.find_all('td')
        cols = [col.text.strip() for col in cols]
        data.append(cols)
    df = pd.DataFrame(data)
    dataframes.append(df)

# Now dataframes[0], dataframes[1], etc. each contain a different table from the page
dataframes[5]

Unnamed: 0,0,1,2,3,4
0,,Graphs are unavailable due to technical issues...,,,
1,1,China[B],1270.0,1376.0,1416.0
2,2,India,1053.0,1311.0,1528.0
3,3,United States,283.0,322.0,356.0
4,4,Indonesia,212.0,258.0,295.0
5,5,Pakistan,136.0,208.0,245.0
6,6,Brazil,176.0,206.0,228.0
7,7,Nigeria,123.0,182.0,263.0
8,8,Bangladesh,131.0,161.0,186.0
9,9,Russia,146.0,146.0,149.0


## Stock market data scrapping

**Why stock market data scrapping is important?**

Stock market data scraping is important because it allows traders and investors to gather large amounts of data from various sources, analyze it, and make informed decisions based on the insights gained. This can include tracking stock prices, analyzing market trends, and monitoring news and social media sentiment. Additionally, stock market data scraping can help traders and investors identify potential risks and opportunities, and make more informed decisions about when to buy or sell stocks.

There are several libraries that can be used to scrape stock market data in Python. Some popular ones include:

BeautifulSoup
Scrapy
Selenium
Pandas
Requests
Each of these libraries has its own strengths and weaknesses, and the choice of which one to use will depend on the specific requirements of your project. For example, if you need to scrape data from dynamic websites that require user interaction, you may want to use Selenium. If you need to scrape data from multiple pages or websites, Scrapy may be a good choice. If you need to manipulate and analyze the data after scraping it, Pandas may be the way to go.

In [None]:
# pip install yfinance

In [1]:
import yfinance as yf
import pandas as pd

In [8]:
# Define the ticker symbol
tickerSymbol = 'AAPL'
# get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# Get the stock price history
tickerDf= tickerData.history(period='1d', start='2010-1-1', end='2024-1-1')

# Get the last closing price
lastPrice = tickerDf['Close'].iloc[-1]

# Print the last closing price
print(lastPrice)

192.28463745117188


# Web Scrapping vs. web crawling

Web scraping and web crawling are two related but distinct techniques for gathering data from the web.

Web scraping involves extracting specific data from web pages, typically using tools like BeautifulSoup or Scrapy in Python.

This data can be used for a variety of purposes, such as analyzing market trends, monitoring social media sentiment, or gathering product information for price comparison websites.
Web crawling, on the other hand, involves systematically exploring the web to gather data, typically using automated bots or spiders.

This data can be used to create search engine indexes, monitor website changes, or gather data for academic research.
In summary, web scraping is focused on extracting specific data from web pages, while web crawling is focused on systematically exploring the web to gather data.

# Challenges scrapping stockmarket data

There are several common challenges when scraping stock market data, including:

Dynamic websites: Many stock market websites use dynamic content that is generated by JavaScript, which can make it difficult to scrape the data using traditional web scraping techniques. In these cases, you may need to use a tool like Selenium to automate a web browser and interact with the website in order to scrape the data.

Anti-scraping measures: Some websites may have anti-scraping measures in place to prevent automated scraping. These measures can include CAPTCHAs, IP blocking, or user agent detection. To avoid these measures, you may need to use techniques like rotating IP addresses or user agents, or using a proxy server.

Data formatting: Stock market data can be presented in a variety of formats, including tables, charts, and graphs. Extracting the data from these formats can be challenging, and may require specialized tools or techniques.

Data quality: Stock market data can be noisy and contain errors or outliers. It's important to carefully clean and validate the data before using it for analysis or decision-making.

Legal and ethical considerations: Scraping stock market data can raise legal and ethical concerns, particularly if the data is used for insider trading or other illegal activities. It's important to ensure that your scraping activities are legal and ethical, and to obtain any necessary permissions or licenses before scraping data.

# Techniques to avoid anti-scraping measures when scraping stock market data?

Use a proxy server: A proxy server can be used to route your requests through a different IP address, which can help you avoid IP blocking. There are several free and paid proxy services available that you can use.

Rotate user agents: Some websites may block requests from certain user agents, so rotating your user agent can help you avoid detection. You can use a library like fake_useragent in Python to generate random user agents for each request.

Slow down your requests: Sending too many requests too quickly can trigger anti-scraping measures, so slowing down your requests can help you avoid detection. You can use a library like time in Python to add a delay between each request.

Use CAPTCHA solving services: Some websites may require you to solve a CAPTCHA in order to access the data. There are several CAPTCHA solving services available that you can use to automate this process.

Use headless browsers: Some websites may use JavaScript to generate content, which can make it difficult to scrape the data using traditional web scraping techniques. Using a headless browser like Selenium can help you automate the process of interacting with the website and scraping the data.

It's important to note that while these techniques can help you avoid anti-scraping measures, they may not be foolproof and may still result in your requests being blocked or your IP address being banned. It's always a good idea to check the website's terms of service and to be respectful of their policies when scraping data.

# What are some common data formatting issues when scraping stock market data?

There are several common data formatting issues when scraping stock market data, including:

Inconsistent data types: Stock market data can be presented in a variety of formats, including text, numbers, and dates. It's important to ensure that the data is consistently formatted and that the data types are correct before using it for analysis.

Missing data: Stock market data can be incomplete or missing, which can make it difficult to analyze. It's important to handle missing data appropriately, either by imputing missing values or by excluding them from the analysis.

Non-standard data formats: Some stock market data may be presented in non-standard formats, such as PDFs or images. Extracting data from these formats can be challenging and may require specialized tools or techniques.

Data normalization: Stock market data can be presented in different units or currencies, which can make it difficult to compare across different stocks or markets. It's important to normalize the data to a common unit or currency before using it for analysis.

Data cleaning: Stock market data can be noisy and contain errors or outliers. It's important to carefully clean and validate the data before using it for analysis or decision-making.

It's important to be aware of these formatting issues when scraping stock market data and to take steps to address them before using the data for analysis.

# Let's scrape some stock market data!

In [9]:
import yfinance as yf
import pandas as pd

# Define the ticker symbol
tickerSymbol = 'GOOGL'
# get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# Get the stock price history
tickerDf= tickerData.history(period='1y')
# Print the last 5 rows of the DataFrame
print(tickerDf.tail())
print(tickerDf.head())
# Save the DataFrame to a CSV file
# tickerDf.to_csv('googl_stock_data.csv')

                                 Open        High         Low       Close  \
Date                                                                        
2024-06-12 00:00:00-04:00  178.250000  180.410004  176.110001  177.789993   
2024-06-13 00:00:00-04:00  176.110001  176.740005  174.880005  175.160004   
2024-06-14 00:00:00-04:00  174.220001  177.059998  174.149994  176.789993   
2024-06-17 00:00:00-04:00  175.460007  178.360001  174.809998  177.240005   
2024-06-18 00:00:00-04:00  177.139999  177.389999  174.100006  175.089996   

                             Volume  Dividends  Stock Splits  
Date                                                          
2024-06-12 00:00:00-04:00  27864700        0.0           0.0  
2024-06-13 00:00:00-04:00  20913300        0.0           0.0  
2024-06-14 00:00:00-04:00  18063600        0.0           0.0  
2024-06-17 00:00:00-04:00  19618500        0.0           0.0  
2024-06-18 00:00:00-04:00  21869900        0.0           0.0  
                   

In [10]:
import yfinance as yf
import pandas as pd

# define ticker symbol
tickerSymbol = 'TSLA'

# get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# tickerData.info

# get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2010-1-1', end='2020-1-25')

# last closing price
tickerDf['Close'].iloc[-1]

print(tickerDf.head())
print(tickerDf.shape)

                               Open      High       Low     Close     Volume  \
Date                                                                           
2010-06-29 00:00:00-04:00  1.266667  1.666667  1.169333  1.592667  281494500   
2010-06-30 00:00:00-04:00  1.719333  2.028000  1.553333  1.588667  257806500   
2010-07-01 00:00:00-04:00  1.666667  1.728000  1.351333  1.464000  123282000   
2010-07-02 00:00:00-04:00  1.533333  1.540000  1.247333  1.280000   77097000   
2010-07-06 00:00:00-04:00  1.333333  1.333333  1.055333  1.074000  103003500   

                           Dividends  Stock Splits  
Date                                                
2010-06-29 00:00:00-04:00        0.0           0.0  
2010-06-30 00:00:00-04:00        0.0           0.0  
2010-07-01 00:00:00-04:00        0.0           0.0  
2010-07-02 00:00:00-04:00        0.0           0.0  
2010-07-06 00:00:00-04:00        0.0           0.0  
(2410, 7)


In [15]:
import pandas as pd
import yfinance as yf
import datetime as dt
from datetime import date, timedelta

today = date.today()
d1 = today.strftime("%Y-%m-%d")
d1
d2 = (today - timedelta(days=365)).strftime("%Y-%m-%d")
d2
start_date = d2
end_date = d1

# define ticker symbol
tickerSymbol = 'META'

# get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# tickerData.info

# get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start=start_date, end=end_date)
tickerDf.head()

# tickerDf.to_csv('META.csv')

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-06-21 00:00:00-04:00,282.948547,283.417585,277.789136,281.062439,20556200,0.0,0.0
2023-06-22 00:00:00-04:00,278.507683,284.675032,277.22035,284.295807,17563100,0.0,0.0
2023-06-23 00:00:00-04:00,280.932714,289.075984,278.377966,288.137909,50988400,0.0,0.0
2023-06-26 00:00:00-04:00,288.107957,289.195718,277.030715,277.898926,24232700,0.0,0.0
2023-06-27 00:00:00-04:00,281.431692,288.756636,280.074465,286.461334,26108300,0.0,0.0
