# Demonstration of Web Scraping using different Techniques

 ## 1. Demonstration of Web Scraping using BeautifulSoup Library

### Importing required libraries

In [1]:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import warnings
warnings.filterwarnings('ignore')

**In this step We are scraping data from market watch webstie using Beautifulsoup library.**


**The data we scraped is Canoo's financial statemenet data.**

In [2]:
url = "https://www.marketwatch.com/investing/stock/goev/financials?mod=mw_quote_tab"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')

table = soup.find('table',class_='table table--overflow align--right')

rows = table.find('tbody',class_='table__body row-hover').find_all('tr')

with open("canoo_financial_statement.csv", "w", newline="") as file:
    writer = csv.writer(file)
    # Write header
    writer.writerow(["Key_factors", "2018", "2019", "2020", "2021",'2022','5_year'])

    for i in rows:
        td_row = i.find_all('td')
        data = []
        for z in td_row:
            f_row = z.find_all('div',class_="cell__content")[0]
            data.append(f_row.get_text(strip=True))
        writer.writerow(data)

### 2. Data scraping using Pandas library

**Using pandas read_html method we can read the data of web page, then we can convert it in pandas Dataframe or can store in csv fiel**

**These method useful in scraping the text data that present on the page**

**For Dynamic websites and multiple pages we have other methods for it**

**For Eg. here is a canoo's insider activity of share buy/sell**

In [3]:
pd.read_html("https://markets.businessinsider.com/stocks/goev-stock?miRedirects=1#insider_activity")[5]

Unnamed: 0,Name,Date,shares traded,shares held,Price,type (sell/buy),option
0,Ruiz Hector M.,02/06/2024,745.0,283355.0,0.16,Sell,No
1,MURTHY RAMESH,01/23/2024,1217.0,283669.0,0.18,Sell,No
2,Ruiz Hector M.,01/03/2024,3444.0,284100.0,0.23,Sell,No
3,Ruiz Hector M.,01/01/2024,912.0,287544.0,0.25,Sell,No
4,MURTHY RAMESH,12/25/2023,205.0,281886.0,0.24,Sell,No
5,Ethridge Greg,12/19/2023,1500000.0,1947419.0,,Buy,No
6,Sheeran Josette,11/19/2023,22484.0,1313975.0,0.32,Sell,No
7,MURTHY RAMESH,11/19/2023,527.0,282091.0,0.32,Sell,No
8,von Storch Debra,11/08/2023,326051.0,419310.0,,Buy,No
9,Schmueckle Rainer,11/08/2023,326051.0,419310.0,,Buy,No


**canoo's competitors**

In [4]:
pd.read_html("https://scripbox.com/us-stocks/goev-share-price")[2]

Unnamed: 0,Stock Name,Stock Price,52 Week High,52 Week Low,Capital
0,.css-1knbux5{display:-webkit-box;display:-webk...,$ 19.93,$ 71.50,$ 15.28,Large Cap
1,Proterra Inc,$ 4.20,$ 9.20,$ 3.48,Small Cap
2,Nikola Corporation,$ 1.51,$ 11.87,$ 1.51,Small Cap


In [5]:
pd.read_html("https://www.cnbc.com/quotes/GOEV?tab=profile")[0]

Unnamed: 0,SYMBOL,LAST,CHG,%CHG
0,AYROAYRO Inc,1.79,-0.02,-1.10%
1,FFIEFaraday Future Intelligent Electric Inc,0.0945,0.0008,+0.8538%
2,FUVArcimoto Inc,0.6298,0.0099,+1.597%
3,MULNMullen Automotive Inc,8.07,-1.06,-11.61%
4,FSRFisker Inc,0.6486,-0.0826,-11.2965%


## 3.Internet Search by querry

**Data required for Data analysis of canoo and it's peers is scraped by above methods**

**This is for Demonstration, there also various advance ways to scrap the data using AI**

### Importing required libraries

In [6]:
import csv
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

### Creating a function to search Google and get relevant URLs

In [7]:

def google_search(query):
    search_url = f"https://www.google.com/search?q={'+'.join(query.split())}"
    response = requests.get(search_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    urls = [a['href'] for a in soup.find_all('a', href=True) if a['href'].startswith("http")]
    return urls


#### 1st method - Creating function to scrape data from a URL using Scrapy

In [8]:

def scrape_with_scrapy(url):
    # Scrapy setup
    from scrapy.crawler import CrawlerProcess
    import scrapy

    class MySpider(scrapy.Spider):
        name = 'myspider'

        def start_requests(self):
            yield scrapy.Request(url, self.parse)

        def parse(self, response):
            # We can write the detailed code for extracting tables, of nested data
            # For demonstration, let's just extract the title of the page
            yield {
                'URL': response.url,
                'Title': response.css('title::text').get()
            }

    process = CrawlerProcess(settings={
        'USER_AGENT': 'Mozilla/5.0',
        'LOG_ENABLED': False  # Disable logging for clarity
    })

    # Run the spider
    data = []
    process.crawl(MySpider)
    process.start()
    return data


#### 2nd method - Creating function to scrape data from a URL using Selenium

In [9]:

def scrape_with_selenium(url):
    
    driver = webdriver.Chrome()
    driver.get(url)

    # we can give here the web page tags so we can get the actual text data we want
    # For demonstration, let's just extract the page title
    title = driver.title

    # Close the Selenium WebDriver
    driver.quit()

    return {'URL': url, 'Title': title}


#### 3rd method - function to scrape data from a URL using Beautiful Soup

In [10]:

def scrape_with_beautifulsoup(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # For demonstration, let's just extract the page title
    title = soup.title.string

    return {'URL': url, 'Title': title}


### Create function to save data to a CSV file

In [11]:

def save_to_csv(data_list, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = data_list[0].keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for data in data_list:
            writer.writerow(data)
            

### Main function which scrap the data and store in CSV

In [12]:

def main():
    queries = [
        "Gather information on Canoo's financial performance, including its revenue, profit margins, return on investment, and expense structure"
    ]
    all_data = []

    for query in queries:
        print(f"Searching for: {query}")
        urls = google_search(query)
        for url in urls:
            print(f"Scraping data from: {url}")
            # Choose one of the scraping methods (Scrapy, Selenium, Beautiful Soup)
            data = scrape_with_selenium(url)
            # Add scraped data to the list
            all_data.append(data)

    # Save the scraped data to a CSV file
    save_to_csv(all_data, "scraped_data.csv")
    print("Data saved to scraped_data.csv")

if __name__ == "__main__":
    main()
    

Searching for: Gather information on Canoo's financial performance, including its revenue, profit margins, return on investment, and expense structure
Scraping data from: https://www.google.com/preferences?hl=en-IN&fg=1&sa=X&ved=0ahUKEwiCxdT_t7qEAxXGzzgGHWGYDikQ5fUCCHs
Scraping data from: https://policies.google.com/privacy?hl=en-IN&fg=1
Scraping data from: https://policies.google.com/terms?hl=en-IN&fg=1
Data saved to scraped_data.csv


**After getting this Urls we can use power querry of advance excel, libraries like octoparse and other AI tools to scrap the large data**