<span style="color:yellow;font-weight:bold;text-decoration:underline; font-size:50px">Stock market data scrapping</span> 

## Why stock market data scrapping is important?
Stock market data scraping is important because it allows traders and investors to gather large amounts of data from various sources, analyze it, and make informed decisions based on the insights gained. This can include tracking stock prices, analyzing market trends, and monitoring news and social media sentiment. Additionally, stock market data scraping can help traders and investors identify potential risks and opportunities, and make more informed decisions about when to buy or sell stocks.

There are several libraries that can be used to scrape stock market data in Python. Some popular ones include:

1. BeautifulSoup
2. Scrapy
3. Selenium
4. Pandas
5. Requests

Each of these libraries has its own strengths and weaknesses, and the choice of which one to use will depend on the specific requirements of your project. For example, if you need to scrape data from dynamic websites that require user interaction, you may want to use Selenium. If you need to scrape data from multiple pages or websites, Scrapy may be a good choice. If you need to manipulate and analyze the data after scraping it, Pandas may be the way to go.

## Example Code for Scraping Stock Prices

Here's an example of how to use yfinance to get the stock price of Apple (AAPL):



In [None]:
# pip install yfinance

In [1]:
import yfinance as yf

# Define the ticker symbol
tickerSymbol = 'AAPL'

# Get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# print(tickerData.info)

# Get the stock price history
tickerDf = tickerData.history(start='2021-01-01', end='2021-12-31') # period='1y' -- last 1 year

print(tickerDf.head())

# Get the last closing price
lastPrice = tickerDf['Close'].iloc[-1]

# Print the last closing price
print('---------------------------------------------------------')
print(lastPrice)

                                 Open        High         Low       Close  \
Date                                                                        
2021-01-04 00:00:00-05:00  130.101333  130.189025  123.514415  126.096565   
2021-01-05 00:00:00-05:00  125.589909  128.366944  125.141681  127.655624   
2021-01-06 00:00:00-05:00  124.449831  127.694571  123.144137  123.358505   
2021-01-07 00:00:00-05:00  125.073458  128.259737  124.586260  127.567909   
2021-01-08 00:00:00-05:00  129.039236  129.234127  126.895568  128.668976   

                              Volume  Dividends  Stock Splits  
Date                                                           
2021-01-04 00:00:00-05:00  143301900        0.0           0.0  
2021-01-05 00:00:00-05:00   97664900        0.0           0.0  
2021-01-06 00:00:00-05:00  155088000        0.0           0.0  
2021-01-07 00:00:00-05:00  109578200        0.0           0.0  
2021-01-08 00:00:00-05:00  105158200        0.0           0.0  
------------



This code uses the yfinance library to get data on the AAPL ticker, including its stock price history. It then extracts the last closing price from the history and prints it to the console. Note that you can adjust the `start` and `end` dates to get the stock price history for a different time period.

<span style="color:yellow;font-weight:bold;text-decoration:underline; font-size:50px">Web Scrapping vs. web crawling</span>

`Web scraping` and `web crawling` are two related but distinct techniques for gathering data from the web.

`Web scraping` involves `extracting specific data from web pages`, typically using tools like `BeautifulSoup` or `Scrapy` in Python. 
- This data can be used for a variety of purposes, such as analyzing market trends, monitoring social media sentiment, or gathering product information for price comparison websites.

`Web crawling`, on the other hand, involves systematically exploring the web to gather data, typically using automated bots or spiders. 
- This data can be used to create search engine indexes, monitor website changes, or gather data for academic research.

In summary, `web scraping is focused on extracting specific data from web pages`, while `web crawling is focused on systematically exploring the web to gather data`.

<span style="color:orange;font-weight:bold;font-size:30px">Challenges scrapping stockmarket data</span>


There are several common challenges when scraping stock market data, including:

1. `Dynamic websites:` Many stock market websites use dynamic content that is generated by JavaScript, which can make it difficult to scrape the data using traditional web scraping techniques. In these cases, you may need to use a tool like Selenium to automate a web browser and interact with the website in order to scrape the data.

2. `Anti-scraping measures:` Some websites may have anti-scraping measures in place to prevent automated scraping. These measures can include CAPTCHAs, IP blocking, or user agent detection. To avoid these measures, you may need to use techniques like rotating IP addresses or user agents, or using a proxy server.

3. `Data formatting:` Stock market data can be presented in a variety of formats, including tables, charts, and graphs. Extracting the data from these formats can be challenging, and may require specialized tools or techniques.

4. `Data quality:` Stock market data can be noisy and contain errors or outliers. It's important to carefully clean and validate the data before using it for analysis or decision-making.

5. `Legal and ethical considerations:` Scraping stock market data can raise legal and ethical concerns, particularly if the data is used for insider trading or other illegal activities. It's important to ensure that your scraping activities are legal and ethical, and to obtain any necessary permissions or licenses before scraping data.

<span style="color:orange;font-weight:bold;font-size:30px">Techniques to avoid anti-scraping measures when scraping stock market data?</span>

There are several techniques that can be used to avoid anti-scraping measures when scraping stock market data, including:

1. Use a proxy server: A proxy server can be used to route your requests through a different IP address, which can help you avoid IP blocking. There are several free and paid proxy services available that you can use.

2. Rotate user agents: Some websites may block requests from certain user agents, so rotating your user agent can help you avoid detection. You can use a library like `fake_useragent` in Python to generate random user agents for each request.

3. Slow down your requests: Sending too many requests too quickly can trigger anti-scraping measures, so slowing down your requests can help you avoid detection. You can use a library like `time` in Python to add a delay between each request.

4. Use CAPTCHA solving services: Some websites may require you to solve a CAPTCHA in order to access the data. There are several CAPTCHA solving services available that you can use to automate this process.

5. Use headless browsers: Some websites may use JavaScript to generate content, which can make it difficult to scrape the data using traditional web scraping techniques. Using a headless browser like Selenium can help you automate the process of interacting with the website and scraping the data.

It's important to note that while these techniques can help you avoid anti-scraping measures, they may not be foolproof and may still result in your requests being blocked or your IP address being banned. It's always a good idea to check the website's terms of service and to be respectful of their policies when scraping data.

<span style="color:orange;font-weight:bold;font-size:30px">What are some common data formatting issues when scraping stock market data?</span>

There are several common data formatting issues when scraping stock market data, including:

1. `Inconsistent data types:` Stock market data can be presented in a variety of formats, including text, numbers, and dates. It's important to ensure that the data is consistently formatted and that the data types are correct before using it for analysis.

2. Missing data: Stock market data can be incomplete or missing, which can make it difficult to analyze. It's important to handle missing data appropriately, either by imputing missing values or by excluding them from the analysis.

3. `Non-standard data formats:` Some stock market data may be presented in non-standard formats, such as PDFs or images. Extracting data from these formats can be challenging and may require specialized tools or techniques.

4. `Data normalization:` Stock market data can be presented in different units or currencies, which can make it difficult to compare across different stocks or markets. It's important to normalize the data to a common unit or currency before using it for analysis.

5. `Data cleaning:` Stock market data can be noisy and contain errors or outliers. It's important to carefully clean and validate the data before using it for analysis or decision-making.

It's important to be aware of these formatting issues when scraping stock market data and to take steps to address them before using the data for analysis.

# Let's scrape some stock market data!

Sure! Here's an example code snippet that uses yfinance to download the stock data of Google (GOOGL) for the last year from today and stores it in a Pandas DataFrame:



In [2]:
import yfinance as yf
import pandas as pd

# Define the ticker symbol
tickerSymbol = 'GOOGL'

# Get data on this ticker
tickerData = yf.Ticker(tickerSymbol)

# Get the stock price history
tickerDf = tickerData.history(period='1y') # last 1 year data

# Print the last 5 rows of the DataFrame
print(tickerDf.shape)
print(tickerDf.tail())
print(tickerDf.head())
# Save the DataFrame to a CSV file
# tickerDf.to_csv('googl_stock_data.csv')

(251, 7)
                                 Open        High         Low       Close  \
Date                                                                        
2025-09-24 00:00:00-04:00  251.660004  252.350006  246.440002  247.139999   
2025-09-25 00:00:00-04:00  244.399994  246.490005  240.740005  245.789993   
2025-09-26 00:00:00-04:00  247.070007  249.419998  245.970001  246.539993   
2025-09-29 00:00:00-04:00  247.850006  251.149994  242.770004  244.050003   
2025-09-30 00:00:00-04:00  242.982498  243.220093  239.244995  240.410004   

                             Volume  Dividends  Stock Splits  
Date                                                          
2025-09-24 00:00:00-04:00  28201000        0.0           0.0  
2025-09-25 00:00:00-04:00  31020400        0.0           0.0  
2025-09-26 00:00:00-04:00  18503200        0.0           0.0  
2025-09-29 00:00:00-04:00  32452000        0.0           0.0  
2025-09-30 00:00:00-04:00   9049289        0.0           0.0  
          

In [3]:
tickerDf.shape

(251, 7)



This code first defines the ticker symbol for Google (GOOGL) and uses yfinance to get data on this ticker. It then uses the `history` method to get the stock price history for the last year from today and stores it in a Pandas DataFrame. Finally, it prints the last 5 rows of the DataFrame and saves the DataFrame to a CSV file named `googl_stock_data.csv`.

In [4]:
import yfinance as yf
import pandas as pd

# define ticker symbol
tickerSymbol = 'TSLA'

# get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# tickerData.info

# get the historical prices for this ticker
tickerDf = tickerData.history(start='2010-1-1', end='2020-1-25')
# tickerDf = tickerData.history(period='1d')  # last 1 day data

# last closing price
tickerDf['Close'].iloc[-1]

print(tickerDf.head())
print(tickerDf.shape)

                               Open      High       Low     Close     Volume  \
Date                                                                           
2010-06-29 00:00:00-04:00  1.266667  1.666667  1.169333  1.592667  281494500   
2010-06-30 00:00:00-04:00  1.719333  2.028000  1.553333  1.588667  257806500   
2010-07-01 00:00:00-04:00  1.666667  1.728000  1.351333  1.464000  123282000   
2010-07-02 00:00:00-04:00  1.533333  1.540000  1.247333  1.280000   77097000   
2010-07-06 00:00:00-04:00  1.333333  1.333333  1.055333  1.074000  103003500   

                           Dividends  Stock Splits  
Date                                                
2010-06-29 00:00:00-04:00        0.0           0.0  
2010-06-30 00:00:00-04:00        0.0           0.0  
2010-07-01 00:00:00-04:00        0.0           0.0  
2010-07-02 00:00:00-04:00        0.0           0.0  
2010-07-06 00:00:00-04:00        0.0           0.0  
(2410, 7)


In [5]:
import pandas as pd
import yfinance as yf
import datetime as dt
from datetime import date, timedelta

today = date.today()
d1 = today.strftime("%Y-%m-%d")
print(d1)

d2 = (today - timedelta(days=365)).strftime("%Y-%m-%d")
print(d2)

start_date = d2
end_date = d1

# define ticker symbol
tickerSymbol = 'META'

# get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# tickerData.info

# get the historical prices for this ticker
tickerDf = tickerData.history(start=start_date, end=end_date)
tickerDf.head()

# tickerDf.to_csv('META.csv')

2025-09-30
2024-09-30


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-09-30 00:00:00-04:00,565.920558,572.87866,563.029624,570.645691,12792300,0.0,0.0
2024-10-01 00:00:00-04:00,576.168302,581.212439,568.312997,574.663025,15259300,0.0,0.0
2024-10-02 00:00:00-04:00,573.05815,574.194591,567.555493,571.014587,6524700,0.0,0.0
2024-10-03 00:00:00-04:00,568.362861,581.531415,566.947268,580.943298,11581000,0.0,0.0
2024-10-04 00:00:00-04:00,581.900332,594.979203,579.607553,594.072083,14169500,0.0,0.0
