# Automated Market Research Data Pipeline
This notebook collects, cleans, and annotates real-world market research data using web scraping and LLMs. The goal is to prepare training-ready datasets in a single automated flow.


# Setup and Imports
We load the necessary libraries for data scraping, cleaning, and annotation. This includes requests for fetching data, pandas for handling datasets, and OpenAI for LLM calls.


In [1]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI
import pandas as pd

In [2]:
load_dotenv(override=True)
api_key=os.getenv("OPENAI_API_KEY")

In [3]:
if(api_key[:8]=="sk-proj-"):
    print("API KEY IS VALID")
else:
    print("INVALID API KEY")

API KEY IS VALID


We set custom request headers with a browser-like User-Agent so the website doesn’t block us for using a script.


In [4]:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

We define a `Website` class to fetch a page safely. It retries on errors, removes irrelevant tags, and stores the page title and text.


In [5]:
import time
class Website():
    def __init__(self, url, max_retries=3, backoff=2):
        self.url = url
        for attempt in range(max_retries):
            try:
                response = requests.get(url, headers=headers, timeout=10)
                rawtext = BeautifulSoup(response.content, 'html.parser')
                self.title = rawtext.title.string if rawtext.title else "No Title Found"
                if rawtext.body:
                    for irrelevant in rawtext.body(["script", "style", "img", "input"]):
                        irrelevant.decompose()
                    self.text = rawtext.body.get_text(separator="\n", strip=True)
                else:
                    self.text = "No body found"
                break
            except Exception as e:
                print(f"Error fetching {url} (attempt {attempt+1}/{max_retries}): {e}")
                if attempt < max_retries - 1:
                    time.sleep(backoff * (attempt + 1))
                else:
                    self.title = "Error"
                    self.text = str(e)

`Example Usage`

In [9]:
rudra=Website("https://github.com/RudraDudhat2509/")
rudra.title

'RudraDudhat2509 (Rudra Dudhat) · GitHub'

In [10]:
display(Markdown(rudra.text))

Skip to content
Navigation Menu
Toggle navigation
Sign in
Appearance settings
Platform
GitHub Copilot
Write better code with AI
GitHub Spark
New
Build and deploy intelligent apps
GitHub Models
New
Manage and compare prompts
GitHub Advanced Security
Find and fix vulnerabilities
Actions
Automate any workflow
Codespaces
Instant dev environments
Issues
Plan and track work
Code Review
Manage code changes
Discussions
Collaborate outside of code
Code Search
Find more, search less
Explore
Why GitHub
Documentation
GitHub Skills
Blog
Integrations
GitHub Marketplace
View all features
Solutions
By company size
Enterprises
Small and medium teams
Startups
Nonprofits
By use case
DevSecOps
DevOps
CI/CD
View all use cases
By industry
Healthcare
Financial services
Manufacturing
Government
View all industries
View all solutions
Resources
Topics
AI
DevOps
Security
Software Development
View all
Explore
Learning Pathways
Events & Webinars
Ebooks & Whitepapers
Customer Stories
Partners
Executive Insights
Open Source
GitHub Sponsors
Fund open source developers
The ReadME Project
GitHub community articles
Repositories
Topics
Trending
Collections
Enterprise
Enterprise platform
AI-powered developer platform
Available add-ons
GitHub Advanced Security
Enterprise-grade security features
Copilot for business
Enterprise-grade AI features
Premium Support
Enterprise-grade 24/7 support
Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search
Clear
Search syntax tips
Provide feedback
We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
Cancel
Create saved search
Sign in
Sign up
Appearance settings
Resetting focus
You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
Dismiss alert
RudraDudhat2509
Follow
Overview
Repositories
6
Projects
0
Packages
0
Stars
4
More
Overview
Repositories
Projects
Packages
Stars
RudraDudhat2509
Follow
Rudra Dudhat
RudraDudhat2509
Follow
👋 Hi, I'm Rudra Dudhat
🎓 B.Tech in DSAI@ IIT Bhilai  
📈 Aspiring Quant | Machine Learning Enthusiast | Competitive Kaggler
2
followers
·
1
following
Student @ IIT Bhilai
Navi Mumbai, Maharashtra, IN
https://rudradudhat.github.io
LinkedIn
in/rdudhat-iitbhilai
https://www.kaggle.com/rudrad7
Block or Report
Block or report RudraDudhat2509
Report abuse
Contact GitHub support about this user’s behavior.
        Learn more about
reporting abuse
.
Report abuse
Overview
Repositories
6
Projects
0
Packages
0
Stars
4
More
Overview
Repositories
Projects
Packages
Stars
Pinned
Loading
OptiQuant
OptiQuant
Public
Jupyter Notebook
Resume-
Resume-
Public
Something went wrong, please refresh the page to try again.
If the problem persists, check the
GitHub status page
or
contact support
.
Uh oh!
There was an error while loading.
Please reload this page
.
Footer
© 2025 GitHub, Inc.
Footer navigation
Terms
Privacy
Security
Status
Docs
Contact
Manage cookies
Do not share my personal information
You can’t perform that action at this time.

In [6]:
load_dotenv(override=True)
SerpApi_key=os.getenv("SERP_API_KEY")


In [7]:
from serpapi import GoogleSearch

In [54]:
search_query_prompt = f"""
    Generate 11 search queries for collecting industry insights about the given domain.
    Each query must be on its own line.
    Do not include brackets, quotes, numbering, or Python list syntax.
    Only output the queries as plain text, nothing else.
    
    The queries should cover:
    - Current market size
    - Leading companies and key players
    - Growth trends
    - Latest innovations
    - Industry challenges and opportunities
    - Emerging startups
    - Recent investments
    - Future forecasts
    - Mergers and acquisitions

    DONT INCLUDE ANY YEAR IN THE QUERIES.
    """
    
    

In [55]:
def create_search_queries(industry_name):
    openai=OpenAI()
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are a helpful assistant that outputs only clean search queries in the context of {industry_name}."},
            {"role": "user", "content": search_query_prompt}
        ],
        max_tokens=500,
        temperature=0.7
    )
    
    # Extract raw text
    queries_text = response.choices[0].message.content
    
    # Split lines, strip whitespace, remove empties + duplicates (order preserved)
    queries = [q.strip() for q in queries_text.split("\n") if q.strip()]
    queries = list(dict.fromkeys(queries))
    
    return queries


In [56]:
create_search_queries('Healthcare')

['current market size of healthcare industry',
 'leading companies in healthcare industry',
 'healthcare industry growth trends',
 'latest innovations in healthcare industry',
 'challenges and opportunities in healthcare industry',
 'emerging healthcare startups',
 'recent investments in healthcare industry',
 'future forecasts for healthcare industry',
 'healthcare industry mergers and acquisitions',
 'key players in healthcare market',
 'healthcare industry market analysis']

This function `find_urls` uses SerpAPI to search Google with industry-specific queries.  
It collects unique URLs related to reports, companies, trends, and insights.  


In [61]:
def find_urls(industry_name):    
    urls = []

    search_queries = create_search_queries(industry_name)
    print(f"Generated {len(search_queries)} search queries for '{industry_name}'")

    try:
        for query in search_queries:
            params = {
                'q': query,
                'api_key':SerpApi_key,
                'engine': 'google',
                'num': 10 
            }
            response = requests.get('https://serpapi.com/search', params=params,timeout=10)
            if response.status_code == 200:
                data = response.json()
                if 'organic_results' in data:
                    for result in data['organic_results']:
                        if 'link' in result:
                            urls.append(result['link'])
            else:
                print(f"Error with query '{query}': {response.status_code}")
    except Exception as e:
        print(f"Error occurred: {e}")
        return []
    unique_urls = list(set(urls))
    print(f"Found {len(unique_urls)} unique URLs for '{industry_name}'")
    return unique_urls

In [62]:
def create_df(prompt):
    url_list = find_urls(prompt)
    data = []
    for url in url_list:
        temp=Website(url)
        data.append({"url": url, "title": temp.title, "text": temp.text})
    return pd.DataFrame(data)

In [63]:
prompt="Healthcare"

In [64]:
df=create_df(prompt)

Generated 11 search queries for 'Healthcare'
Found 91 unique URLs for 'Healthcare'
Error fetching https://www.nerdwallet.com/article/investing/best-performing-healthcare-stocks (attempt 1/3): ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
Error fetching https://www.pwc.com/gx/en/services/deals/trends/health-industries.html (attempt 1/3): HTTPSConnectionPool(host='www.pwc.com', port=443): Read timed out. (read timeout=10)
Error fetching https://www.pwc.com/gx/en/services/deals/trends/health-industries.html (attempt 2/3): HTTPSConnectionPool(host='www.pwc.com', port=443): Read timed out. (read timeout=10)
Error fetching https://www.mckinsey.com/industries/healthcare/our-insights/how-the-healthcare-industry-can-weather-ongoing-challenges (attempt 1/3): HTTPSConnectionPool(host='www.mckinsey.com', port=443): Read timed out. (read timeout=10)
Error fetching https://www.mckinsey.com/industries/healthcar

In [65]:
df.head()

Unnamed: 0,url,title,text
0,https://www.netsuite.com/portal/resource/artic...,Access Denied,Access Denied\nYou don't have permission to ac...
1,https://www.pwc.com/us/en/industries/health-in...,Healthcare trends: PwC\n,Skip to content\nSkip to footer\nFeatured insi...
2,https://publichealth.tulane.edu/blog/healthcar...,Types of Healthcare Innovation Improving Patie...,Skip to main content\nRequest Info\nApply Now ...
3,https://www.fidelity.com/learning-center/tradi...,Access Denied,Access Denied\nYou don't have permission to ac...
4,https://www.modernhealthcare.com/mergers-acqui...,Mergers & Acquisitions - Modern Healthcare,Mergers & Acquisitions\nModern Healthcare repo...


In [70]:
df.drop(index=df[df['title']=="Access Denied"].index, inplace=True)

In [71]:
df = df.dropna()

In [72]:
df.to_csv(f"rawdata_{prompt}.csv", index=False)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     87 non-null     object
 1   title   87 non-null     object
 2   text    87 non-null     object
dtypes: object(3)
memory usage: 2.2+ KB
