# **Tech_Intel** (Web Scraped Text Dataset)

## Overview
This dataset is a collection of categorized text data obtained from various online sources. The data is organized into multiple categories related to technology, finance, science, and other domains. The dataset is structured into folders, each corresponding to a category, containing extracted text from relevant articles and blogs.

## Data Collection
The dataset was created using a Python web scraper that utilizes the `requests` and `BeautifulSoup` libraries. The scraper fetched and extracted textual data from multiple websites listed under predefined categories. The extraction focused on article titles and content paragraphs from various HTML structures such as `<article>`, `<div>`, and `<section>`.

## Dataset Structure
- **Base Directory:** `tech_intel`
- **Category Folders:** Each category has a separate folder named after it (e.g., `technology_news_trends`, `machine_learning_data_science`).
- **Data Files:** Inside each category folder, a `data.txt` file contains extracted articles related to that category.



### Categories and Sources
The dataset consists of the following categories and their respective sources:

1. **Technology News & Trends**
   - TechCrunch, Wired, The Verge
2. **Artificial Intelligence & Deep Learning**
   - AI News, OpenAI Blog, MIT Technology Review
3. **Machine Learning & Data Science**
   - Towards Data Science, Google AI Blog, Fast.ai
4. **Cybersecurity & Ethical Hacking**
   - Krebs on Security, The Hacker News, Dark Reading
5. **Finance & Stock Market**
   - Investopedia, Bloomberg, CNBC Markets
6. **Cryptocurrency & Blockchain**
   - CoinDesk, CoinTelegraph, CryptoSlate
7. **Healthcare AI & Medical Research**
   - Mayo Clinic, WebMD, Healthline
8. **Quantum Computing**
   - IBM Research, Quanta Magazine, MIT CSAIL
9. **Space & Astronomy**
   - NASA, ESA, Space.com
10. **Education & Online Learning**
    - EdSurge, Coursera Blog, Harvard Online Learning
11. **Robotics & Automation**
    - Robotics Business Review, IEEE RAS, Boston Dynamics
12. **Semiconductor Industry & Chip Design**
    - TSMC, Intel, ARM
13. **Cloud Computing & DevOps**
    - AWS, Google Cloud, Microsoft Azure
14. **Big Data & Analytics**
    - Cloudera, Google Cloud Big Data Blog, Databricks
15. **Business & Startups**
    - Y Combinator, Forbes Startups, TechCrunch Startups
16. **Computer Vision & Image Processing**
    - CVPR, Google Vision, Facebook AI
17. **Software Development & Programming**
    - Stack Overflow, GitHub Blog, GeeksforGeeks
18. **Networking & Internet Technologies**
    - Cisco Blogs, Cloudflare Blog, Ars Technica
19. **Travel & Tech for Students**
    - Lonely Planet, National Geographic, TripAdvisor
20. **Research Papers & Open Science**
    - arXiv, Google Scholar, Springer Open

## Dataset Size
The dataset size varies based on the number of extracted articles and the availability of structured content on the respective websites. On average, each category contains multiple extracted articles stored in plain text format.

In [5]:
categories = {
    "Technology News & Trends": 
        [
            "https://techcrunch.com", 
            "https://www.wired.com", 
            "https://www.theverge.com"
        ],
    "Artificial Intelligence & Deep Learning": 
        [
            "https://www.artificialintelligence-news.com", 
            "https://openai.com/blog", 
            "https://www.technologyreview.com/topic/artificial-intelligence"
        ],
    "Machine Learning & Data Science": 
        [
            "https://towardsdatascience.com", 
            "https://ai.googleblog.com", 
            "https://www.fast.ai"
        ],
    "Cybersecurity & Ethical Hacking": 
        [
            "https://krebsonsecurity.com", 
            "https://thehackernews.com", 
            "https://www.darkreading.com"
        ],
    "Finance & Stock Market": 
        [
            "https://www.investopedia.com", 
            "https://www.bloomberg.com", 
            "https://www.cnbc.com/markets"
        ],
    "Cryptocurrency & Blockchain": 
        [
            "https://www.coindesk.com", 
            "https://www.cointelegraph.com", 
            "https://www.cryptoslate.com"
        ],
    "Healthcare AI & Medical Research": 
        [
            "https://www.mayoclinic.org", 
            "https://www.webmd.com", 
            "https://www.healthline.com"
        ],
    "Quantum Computing": 
        [
            "https://research.ibm.com/blog/quantum", 
            "https://www.quantamagazine.org", 
            "https://www.csail.mit.edu/research/quantum-computing"
        ],
    "Space & Astronomy (Tech in Space)": 
        [
            "https://www.nasa.gov", 
            "https://www.esa.int", 
            "https://www.space.com"
        ],
    "Education & Online Learning Platforms": 
        [
            "https://www.edsurge.com", 
            "https://blog.coursera.org", 
            "https://online-learning.harvard.edu"
        ],
    "Robotics & Automation": 
        [
            "https://www.roboticsbusinessreview.com", 
            "https://www.ieee-ras.org", 
            "https://www.bostondynamics.com"
        ],
    "Semiconductor Industry & Chip Design": 
        [
            "https://www.tsmc.com/english/news", 
            "https://www.intel.com", 
            "https://www.arm.com"
        ],
    "Cloud Computing & DevOps": 
        [
            "https://aws.amazon.com/blogs", 
            "https://cloud.google.com/blog", 
            "https://azure.microsoft.com/en-us/blog/"
        ],
    "Big Data & Analytics": 
        [
            "https://blog.cloudera.com", 
            "https://cloud.google.com/blog/topics/big-data", 
            "https://databricks.com/blog"
        ],
    "Business & Startups": 
        [
            "https://www.ycombinator.com/blog/", 
            "https://www.forbes.com/startups/", 
            "https://techcrunch.com/startups/"
        ],
    "Computer Vision & Image Processing": 
        [
            "https://cvpr2024.thecvf.com", 
            "https://research.google.com/teams/vision/", 
            "https://ai.facebook.com/blog/tag/computer-vision/"
        ],
    "Software Development & Programming": 
        [
            "https://stackoverflow.blog", 
            "https://github.blog", 
            "https://www.geeksforgeeks.org"
        ],
    "Networking & Internet Technologies": 
        [
            "https://blogs.cisco.com", 
            "https://blog.cloudflare.com", 
            "https://arstechnica.com/information-technology/"
        ],
    "Travel & Tech for Students": 
        [
            "https://www.lonelyplanet.com", 
            "https://www.nationalgeographic.com/travel", 
            "https://www.tripadvisor.com"
        ],
    "Research Papers & Open Science": 
        [
            "https://arxiv.org", 
            "https://scholar.google.com", 
            "https://www.springeropen.com"
        ]
}

In [6]:
import requests
from bs4 import BeautifulSoup
import os

In [7]:
base_dir = "tech_intel"
os.makedirs(base_dir, exist_ok=True)

In [8]:
for category in categories.keys():
    category_path = os.path.join(base_dir, category.replace(' ', '_').lower())
    os.makedirs(category_path, exist_ok=True)

In [None]:
def scrape_and_store(category, urls):
    for url in urls:
        try:
            response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, "html.parser")
                text_data = []
                # Using multiple HTML tags to find the article content
                for article in soup.find_all(['article', 'div', 'section']):
                    title = article.find(['h1', 'h2', 'h3'])
                    content = article.find(['p', 'span'])
                    if title and content:
                        text_data.append(f"Title: {title.get_text(strip=True)}\n{content.get_text(strip=True)}\n")
                
                if text_data:
                    filename = os.path.join(base_dir, category.replace(' ', '_').lower(), "data.txt")
                    with open(filename, "w", encoding="utf-8") as file:
                        file.write("\n".join(text_data))
                    print(f"Data saved for {category} from {url}")
                    return  # Stop after successful scrape
        except Exception as e:
            print(f"Error scraping {url}: {str(e)}")

In [10]:
for category, urls in categories.items():
    scrape_and_store(category, urls)

Data saved for Technology News & Trends from https://techcrunch.com
Data saved for Artificial Intelligence & Deep Learning from https://www.artificialintelligence-news.com
Data saved for Machine Learning & Data Science from https://towardsdatascience.com
Data saved for Cybersecurity & Ethical Hacking from https://krebsonsecurity.com
Data saved for Finance & Stock Market from https://www.investopedia.com
Data saved for Cryptocurrency & Blockchain from https://www.coindesk.com
Data saved for Healthcare AI & Medical Research from https://www.mayoclinic.org
Data saved for Quantum Computing from https://www.quantamagazine.org
Data saved for Space & Astronomy (Tech in Space) from https://www.nasa.gov
Data saved for Education & Online Learning Platforms from https://blog.coursera.org
Data saved for Robotics & Automation from https://www.roboticsbusinessreview.com
Data saved for Semiconductor Industry & Chip Design from https://www.arm.com
Data saved for Cloud Computing & DevOps from https://a