# Web scraper

This script scrapes the all the links on "https://choicebroking.freshdesk.com/support/solutions", retrieves articiles and stores it in separate documents that acts as a knowledge base for our QnA app.

## Installing Libraries

In [2]:
!pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable


## Importing Libraries

In [3]:
import requests
from bs4 import BeautifulSoup

## Scraping for sections

Each section will be a file/document

In [4]:
import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
url = 'https://choicebroking.freshdesk.com/support/solutions'

# Send a request to fetch the HTML content of the webpage
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all div tags with class 'cs-s'
divs = soup.find_all('div', class_='cs-s')

# Initialize a list to store the anchor tags
sections = []

# Iterate through each div to find h3 tags and then the anchor tags within them
for div in divs:
    h3 = div.find('h3', class_='heading')
    if h3:
        anchor = h3.find('a')
        if anchor:
            sections.append({
                "section": anchor.text, 
                "url": f"https://choicebroking.freshdesk.com{anchor['href']}"
            })

# Print the list of anchor tags
print(sections)

[{'section': 'Updates & Releases', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000109390'}, {'section': 'General', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000017154'}, {'section': 'Stocks', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000108822'}, {'section': 'Mutual Funds', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000108824'}, {'section': 'Stratezy', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000108825'}, {'section': 'Jiffy Global', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000108826'}, {'section': 'Secured Products', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000108827'}, {'section': 'Insurance', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000108899'}, {'section': 'Loan', 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000109076'}, {'section': 'Product', 'url': 'https://choicebroking.freshdesk.com/sup

## Scraping for links in a section

Each link will be a QnA pair

In [5]:
sections[0]['url']

'https://choicebroking.freshdesk.com/support/solutions/22000109390'

In [6]:
for section in sections:
    section_url = section['url']

    # Send a get request to the url
    response = requests.get(section_url)

    # Parse the HTML document
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all anchor tags with href containing 'support/solutions/articles/'
    matching_links = soup.find_all('a', href=lambda href: href and 'support/solutions/articles/' in href)

    # Extract and print the href attribute from each matching anchor tag
    qna_links = []
    for link in matching_links:
        qna_links.append(f"https://choicebroking.freshdesk.com/{link['href']}")

    section['qna_links'] = qna_links

In [11]:
sections[0]

{'section': 'Updates & Releases',
 'url': 'https://choicebroking.freshdesk.com/support/solutions/22000109390',
 'qna_links': ['https://choicebroking.freshdesk.com//support/solutions/articles/22000286092-upcoming-features-releases',
  'https://choicebroking.freshdesk.com//support/solutions/articles/22000285853-what-s-new-on-finx-website-30th-april-24',
  'https://choicebroking.freshdesk.com//support/solutions/articles/22000286669-what-s-new-on-finx-website-20th-june-24']}

## Iterating through each file, scraping content and storing it in a folder

In [13]:
for section in sections:
    print(section['url'])
    document = ""
    if len(section['qna_links']) != 0:
        for qna_link in section['qna_links']:
            print(qna_link)
            # Sending a GET Request
            response = requests.get(qna_link)

            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')


            # Find all h2 tags with the class name 'heading'
            h2_headings = soup.find_all('h2', class_='heading')

            # Find all article tags with the class name 'article-body'
            article_bodies = soup.find_all('article', class_='article-body')

            header = h2_headings[0].text.strip()

            header = header.replace("Print", "").strip()
            
            answer = article_bodies[0].text.strip()


            document += f"Question: {header}\nAnswer: {answer}\n\n"
        
        file_name = section['section']

        with open(f'/home/choice/Desktop/whatsapp-chatbot-choice/data/{file_name}.txt', 'w') as file:
            file.write(document)
    
    else:
        pass       

https://choicebroking.freshdesk.com/support/solutions/22000109390
https://choicebroking.freshdesk.com//support/solutions/articles/22000286092-upcoming-features-releases
https://choicebroking.freshdesk.com//support/solutions/articles/22000285853-what-s-new-on-finx-website-30th-april-24
https://choicebroking.freshdesk.com//support/solutions/articles/22000286669-what-s-new-on-finx-website-20th-june-24
https://choicebroking.freshdesk.com/support/solutions/22000017154
https://choicebroking.freshdesk.com/support/solutions/22000108822
https://choicebroking.freshdesk.com//support/solutions/articles/22000275707-registration-sign-up
https://choicebroking.freshdesk.com//support/solutions/articles/22000275708-activate-your-account
https://choicebroking.freshdesk.com//support/solutions/articles/22000275709-sign-in
https://choicebroking.freshdesk.com//support/solutions/articles/22000275711-scrip-search
https://choicebroking.freshdesk.com//support/solutions/articles/22000275739-advanced-charting
http

In [9]:
# URL of the webpage to scrape
chunk_url = 'https://choicebroking.freshdesk.com//support/solutions/articles/22000273581-what-is-buyback-how-to-apply-for-buyback-'

# Send a request to fetch the HTML content of the webpage
response = requests.get(chunk_url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all h2 tags with the class name 'heading'
h2_headings = soup.find_all('h2', class_='heading')

# Find all article tags with the class name 'article-body'
article_bodies = soup.find_all('article', class_='article-body')

# Print the found h2 tags
print("h2 headings with class 'heading':")
for heading in h2_headings:
    header = heading.text.strip()
    header = header.replace("Print", "").strip()
    print(header)

# Print the found article tags
print("\narticle tags with class 'article-body':")
for article in article_bodies:
    print(article.text.strip())


h2 headings with class 'heading':
What is Buyback? How to Apply for Buyback?

article tags with class 'article-body':
A buyback is when a company buys its own shares from the stock market. It's like a company investing in itself by purchasing its own stock from investors who own it. This can happen for various reasons, like boosting the stock price or returning money to shareholders. Buyback is usually done at a price higher than the current market value. Below are the steps of How to apply for buyback from Website.1. Visit Dashboard and check Smart Investment Section and click on Buyback.2. Navigate Open Tab and Click on Apply on your preferred buyback company.3. Enter quantity you wish to sell and, then click on Apply to place your order.4.To view your order status, Navigate My Orders Tab.Points to be noted to apply for buyback:1. User should hold the shares on the record date of the buyback to be eligible to apply.2. User should have submitted POA/DDPI to their depository participan