In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "913ed5f6",
   "metadata": {},
   "source": [
    "## Web Scraping Practice Questions\n",
    "\n",
    "# 1. Basic HTML Request and Parsing \n",
    "- Write a Python program to fetch the HTML content of https://www.geeksforgeeks.org using requests.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "136da42f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "def fetch_geeksforgeeks():\n",
    "    url = \"https://www.geeksforgeeks.org\"\n",
    "    headers = {\n",
    "        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\n",
    "    }\n",
    "\n",
    "    try:\n",
    "        response = requests.get(url, headers=headers)\n",
    "\n",
    "        if response.status_code == 200:\n",
    "            print(\"Successfully fetched the content!\")\n",
    "            print(\"-\" * 30)\n",
    "            \n",
    "            print(response.text[:500])\n",
    "        else:\n",
    "            print(f\"Failed to retrieve content. Status code: {response.status_code}\")\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"An error occurred: {e}\")\n",
    "if __name__ == \"__main__\":\n",
    "    fetch_geeksforgeeks()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe26ba4b",
   "metadata": {},
   "source": [
    "- Parse the HTML using BeautifulSoup and print the <title> of the page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7df6df2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def get_page_title():\n",
    "    url = \"https://www.geeksforgeeks.org\"\n",
    "    \n",
    "    headers = {\n",
    "        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\n",
    "    }\n",
    "\n",
    "    try:\n",
    "        response = requests.get(url, headers=headers)\n",
    "        \n",
    "        if response.status_code == 200:\n",
    "            soup = BeautifulSoup(response.text, 'html.parser')\n",
    "            if soup.title:\n",
    "                print(f\"Full Tag: {soup.title}\")\n",
    "                print(f\"Page Title: {soup.title.string}\")\n",
    "            else:\n",
    "                print(\"Title tag not found.\")\n",
    "        else:\n",
    "            print(f\"Error: Received status code {response.status_code}\")\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"An error occurred: {e}\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    get_page_title()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4a04f37",
   "metadata": {},
   "source": [
    "- Handle HTTP errors and network exceptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "059fa917",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from requests.exceptions import HTTPError, ConnectionError, Timeout, RequestException\n",
    "\n",
    "def fetch_with_error_handling(url):\n",
    "    headers = {'User-Agent': 'Mozilla/5.0'}\n",
    "    \n",
    "    try:\n",
    "        response = requests.get(url, headers=headers, timeout=10)\n",
    "        response.raise_for_status()\n",
    "        \n",
    "        print(\"Success! Data retrieved.\")\n",
    "        return response.text\n",
    "\n",
    "    except HTTPError as http_err:\n",
    "        print(f\"HTTP error occurred: {http_err}\") \n",
    "    except ConnectionError:\n",
    "        print(\"Error: Could not connect to the server. Check your internet.\")\n",
    "    except Timeout:\n",
    "        print(\"Error: The request timed out.\")\n",
    "    except RequestException as err:\n",
    "        print(f\"An unexpected error occurred: {err}\")\n",
    "    \n",
    "    return None\n",
    "\n",
    "content = fetch_with_error_handling(\"https://www.geeksforgeeks.org\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea703a76",
   "metadata": {},
   "source": [
    "2. Extract Links \n",
    "\n",
    "- Using the parsed HTML from Question 1, extract and print the first 5 hyperlinks (< a > tags) along with their text.\n",
    "- Use both .find() and .find_all() methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8d3b07c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def extract_hyperlinks():\n",
    "    url = \"https://www.geeksforgeeks.org\"\n",
    "    headers = {'User-Agent': 'Mozilla/5.0'}\n",
    "\n",
    "    try:\n",
    "        response = requests.get(url, headers=headers)\n",
    "        response.raise_for_status()\n",
    "        soup = BeautifulSoup(response.text, 'html.parser')\n",
    "\n",
    "        print(\"--- Using .find() (First Link Only) ---\")\n",
    "        first_link = soup.find('a')\n",
    "        if first_link:\n",
    "            print(f\"Text: {first_link.text.strip()} | URL: {first_link.get('href')}\")\n",
    "\n",
    "        print(\"\\n\" + \"=\"*50 + \"\\n\")\n",
    "        print(\"--- Using .find_all() (First 5 Links) ---\")\n",
    "        links = soup.find_all('a', limit=5)\n",
    "\n",
    "        for i, link in enumerate(links, 1):\n",
    "            link_text = link.text.strip()\n",
    "            link_url = link.get('href')\n",
    "            \n",
    "            if not link_text:\n",
    "                link_text = \"[No Visible Text]\"\n",
    "                \n",
    "            print(f\"{i}. Text: {link_text}\")\n",
    "            print(f\"   URL:  {link_url}\")\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"An error occurred: {e}\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    extract_hyperlinks()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39358252",
   "metadata": {},
   "source": [
    "3. Extract Headings\n",
    "- Scrape all < h2 > headings from a webpage and store them in a list.\n",
    "- Scrape all < a > from a webpage and store them in a list\n",
    "- Save the headings to a CSV file named headings.csv.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91595c78",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import csv\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def scrape_and_save_data(url):\n",
    "    headers = {'User-Agent': 'Mozilla/5.0'}\n",
    "    \n",
    "    try:\n",
    "       \n",
    "        response = requests.get(url, headers=headers)\n",
    "        response.raise_for_status()\n",
    "        soup = BeautifulSoup(response.text, 'html.parser')\n",
    "\n",
    "       \n",
    "        h2_headings = [h2.get_text(strip=True) for h2 in soup.find_all('h2')]\n",
    "\n",
    "       \n",
    "        links = []\n",
    "        for link in soup.find_all('a'):\n",
    "            text = link.get_text(strip=True) or \"[No Text]\"\n",
    "            href = link.get('href')\n",
    "            if href:\n",
    "                links.append((text, href))\n",
    "\n",
    "        \n",
    "        with open('headings.csv', mode='w', newline='', encoding='utf-8') as file:\n",
    "            writer = csv.writer(file)\n",
    "            writer.writerow(['Heading Number', 'H2 Text'])  \n",
    "            for index, text in enumerate(h2_headings, 1):\n",
    "                writer.writerow([index, text])\n",
    "        \n",
    "        print(f\"Successfully saved {len(h2_headings)} headings to headings.csv\")\n",
    "        \n",
    "    \n",
    "        print(\"\\nFirst 5 links found:\")\n",
    "        for text, url in links[:5]:\n",
    "            print(f\"- {text}: {url}\")\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"An error occurred: {e}\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    target_url = \"https://www.geeksforgeeks.org\"\n",
    "    scrape_and_save_data(target_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de21ff61",
   "metadata": {},
   "source": [
    "4. Scrape Wikipedia Table\n",
    "- Write a Python program to scrape all rows from the first table on Wikipedia: List of countries by population.\n",
    "- Print each row as a list of cell values.\n",
    "- Ensure proper handling of encoding and exceptions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90443fd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import csv\n",
    "\n",
    "def scrape_wikipedia_population():\n",
    "    \n",
    "    url = \"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population\"\n",
    "    headers = {\n",
    "        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\n",
    "    }\n",
    "\n",
    "    try:\n",
    "        \n",
    "        response = requests.get(url, headers=headers, timeout=10)\n",
    "        response.raise_for_status()\n",
    "        \n",
    "        \n",
    "        response.encoding = 'utf-8'\n",
    "        \n",
    "        soup = BeautifulSoup(response.text, 'html.parser')\n",
    "\n",
    "        \n",
    "        table = soup.find('table', {'class': 'wikitable'})\n",
    "        \n",
    "        if not table:\n",
    "            print(\"Could not find the table on the page.\")\n",
    "            return\n",
    "\n",
    "        \n",
    "        print(f\"{'Row Data':<20}\")\n",
    "        print(\"-\" * 50)\n",
    "\n",
    "        for row in table.find_all('tr'):\n",
    "        \n",
    "            cells = row.find_all(['th', 'td'])\n",
    "            \n",
    "            \n",
    "            row_data = [cell.get_text(strip=True) for cell in cells]\n",
    "            \n",
    "            \n",
    "            if row_data:\n",
    "                print(row_data)\n",
    "\n",
    "    except requests.exceptions.HTTPError as http_err:\n",
    "        print(f\"HTTP error occurred: {http_err}\")\n",
    "    except requests.exceptions.ConnectionError:\n",
    "        print(\"Error: Network connection failed.\")\n",
    "    except Exception as e:\n",
    "        print(f\"An unexpected error occurred: {e}\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    scrape_wikipedia_population()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1372f98",
   "metadata": {},
   "source": [
    "5. Selectors and Navigation\n",
    "- Given the HTML snippet:\n",
    "<html><body>\n",
    "<p class=\"intro\">Welcome</p>\n",
    "<p class=\"intro\">Learn Python</p>\n",
    "<a href=\"https://python.org\">Python</a>\n",
    "</body></html>\n",
    "\n",
    "- Extract all < p > tags with class \"intro\".\n",
    "- Find the parent of the < a > tag.\n",
    "- Pring the next sibling of the first < p > tag.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef458597",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "html_snippet = \"\"\"\n",
    "<html><body>\n",
    "<p class=\"intro\">Welcome</p>\n",
    "<p class=\"intro\">Learn Python</p>\n",
    "<a href=\"https://python.org\">Python</a>\n",
    "</body></html>\n",
    "\"\"\"\n",
    "\n",
    "soup = BeautifulSoup(html_snippet, 'html.parser')\n",
    "\n",
    "intro_paragraphs = soup.find_all('p', class_='intro')\n",
    "print(\"1. Paragraphs with class 'intro':\")\n",
    "for p in intro_paragraphs:\n",
    "    print(f\"   - {p.text}\")\n",
    "\n",
    "anchor_tag = soup.find('a')\n",
    "parent_tag = anchor_tag.parent\n",
    "print(f\"\\n2. Parent of <a> tag: <{parent_tag.name}>\")\n",
    "\n",
    "first_p = soup.find('p')\n",
    "next_sibling = first_p.find_next_sibling()\n",
    "print(f\"\\n3. Next sibling of the first <p>: {next_sibling}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22e70137",
   "metadata": {},
   "source": [
    "6. Tag Manipulation \n",
    "- Using BeautifulSoup, do the following on < b class=\"boldest\">Hello</ b >:\n",
    "    - Change the tag name to < strong >.\n",
    "    - Add an id=\"greeting\" attribute.\n",
    "    - Replace the text \"Hello\" with \"Hi there\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04c6e747",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "html_snippet = '<b class=\"boldest\">Hello</b>'\n",
    "soup = BeautifulSoup(html_snippet, 'html.parser')\n",
    "tag = soup.b\n",
    "tag.name = \"strong\"\n",
    "tag['id'] = \"greeting\"\n",
    "tag.string = \"Hi there\"\n",
    "print(soup)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8f4f4bd",
   "metadata": {},
   "source": [
    "7. Advanced Naivgation \n",
    "- Given an HTML table: \n",
    "<table>\n",
    "<tr><td>Apple</td></tr>\n",
    "<tr><td>Banana</td></tr>\n",
    "</table>\n",
    "    - Find the string \"Apple\" and print its parent < td > tag.\n",
    "    - Print all sibling of the first < td > tag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56a5c8bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "html = \"\"\"\n",
    "<table>\n",
    "<tr><td>Apple</td></tr>\n",
    "<tr><td>Banana</td></tr>\n",
    "</table>\n",
    "\"\"\"\n",
    "\n",
    "soup = BeautifulSoup(html, 'html.parser')\n",
    "\n",
    "apple_string = soup.find(string=\"Apple\")\n",
    "parent_td = apple_string.parent\n",
    "print(parent_td)\n",
    "\n",
    "first_td = soup.find('td')\n",
    "siblings = first_td.find_next_siblings()\n",
    "for sibling in siblings:\n",
    "    print(sibling)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7245ab7",
   "metadata": {},
   "source": [
    "9. Using SoupStrainer \n",
    "- Parse only < a > tags from the following HTML using SoupStrainer:\n",
    "< html >\n",
    "< a href=\"page 1.html\">Page 1 < /a >\n",
    "< p >Paragraph < /p >\n",
    "< a href=\"page 1.html\">Page 2< /a >\n",
    "< /html >\n",
    "\n",
    "    - print the parsed result.\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19ea7981",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup, SoupStrainer\n",
    "\n",
    "html = \"\"\"\n",
    "<html>\n",
    "<a href=\"page1.html\">Page 1</a>\n",
    "<p>Paragraph</p>\n",
    "<a href=\"page2.html\">Page 2</a>\n",
    "</html>\n",
    "\"\"\"\n",
    "\n",
    "only_a_tags = SoupStrainer(\"a\")\n",
    "\n",
    "soup = BeautifulSoup(html, 'html.parser', parse_only=only_a_tags)\n",
    "\n",
    "print(soup)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cd1e191",
   "metadata": {},
   "source": [
    "9. Exception Handling \n",
    "- Modify your table scraping program to gracefully handle the following:\n",
    "    - Timeout \n",
    "    - HTTPError\n",
    "    - RequestException\n",
    "    - AttributeError if the table is not found \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71cb88e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def scrape_wikipedia_safe():\n",
    "    url = \"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population\"\n",
    "    headers = {'User-Agent': 'Mozilla/5.0'}\n",
    "\n",
    "    try:\n",
    "        response = requests.get(url, headers=headers, timeout=5)\n",
    "        response.raise_for_status()\n",
    "        \n",
    "        soup = BeautifulSoup(response.text, 'html.parser')\n",
    "        \n",
    "        table = soup.find('table', {'class': 'wikitable'})\n",
    "        \n",
    "        for row in table.find_all('tr'):\n",
    "            cells = row.find_all(['th', 'td'])\n",
    "            data = [cell.get_text(strip=True) for cell in cells]\n",
    "            print(data)\n",
    "\n",
    "    except requests.exceptions.Timeout:\n",
    "        print(\"Error: The request timed out.\")\n",
    "    except requests.exceptions.HTTPError as e:\n",
    "        print(f\"HTTP Error: {e}\")\n",
    "    except requests.exceptions.RequestException as e:\n",
    "        print(f\"Network Error: {e}\")\n",
    "    except AttributeError:\n",
    "        print(\"Error: The specified table was not found on the page.\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    scrape_wikipedia_safe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23ea41de",
   "metadata": {},
   "source": [
    "# Discussion \n",
    "The implementation of these web scraping techniques demonstrates the versatility of BeautifulSoup and Requests for data extraction and manipulation. By utilizing the requests library, we established robust connections to web servers while implementing necessary error handling for timeouts, HTTP errors, and network exceptions.\n",
    "\n",
    "Key takeaways from the technical exercises include:\n",
    "\n",
    "Navigation and Selection: We demonstrated that the HTML DOM can be navigated vertically through .parent and horizontally via .find_next_sibling(), allowing for precise data targeting even in complex structures like Wikipedia tables.\n",
    "\n",
    "Efficiency: The use of SoupStrainer highlighted a method for optimizing performance by parsing only specific tags, which significantly reduces memory overhead when processing large-scale HTML documents.\n",
    "\n",
    "Manipulation: Beyond extraction, we showed that BeautifulSoup can dynamically modify the HTML tree by renaming tags, updating attributes, and replacing text content in real-time.\n",
    "\n",
    "Data Persistence: The integration of Python's csv module allowed for the structured storage of scraped headings, bridging the gap between raw web data and usable local files."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0eaa087",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "This series of exercises successfully built a comprehensive toolkit for automated data collection. We progressed from basic HTTP requests to advanced tree navigation and efficient parsing strategies. By incorporating structured error handling and Git version control, the workflow ensures that the scraping process is not only functional but also professional and reproducible. The ability to transform raw HTML into structured formats like CSV or lists provides a critical foundation for further data analysis and machine learning applications."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.13.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}