# Problem Statemet for Web Scraping of information :

## Web Scraping:- Choose news websites (e.g., BBC, The Hindu, Times Now, CNN) and use web scraping
 tools or libraries (e.g., BeautifulSoup,Selenium) to extract news articles.- Retrieve the title and content of each news article. Ensure that you have a diverse dataset
 covering various topics.

# Indian Express News Scraper

This Python script is designed to scrape news articles from the Indian Express website, specifically from various sections like Business, Entertainment, Sports, Politics, Lifestyle, Education, and Technology. The scraped data is then stored in a CSV file.

## Prerequisites

Make sure you have the following Python libraries installed:

- `requests`: Used to make HTTP requests.
- `BeautifulSoup`: A library for pulling data out of HTML and XML files.
- `csv`: A module for reading and writing CSV files.

# Understanding the Code

The provided Python script is a web scraping tool tailored for extracting news articles from the Indian Express website. It encompasses several functions and logic for navigating through different sections of the site, scraping article details, and saving the data to a CSV file. Here's a breakdown of the main components:

## `scrape_news(url, section_label, num_pages=3)`

This function serves as the core of the scraper. It extracts news articles from a specified section of the Indian Express website. The key parameters are:

- `url`: The URL of the section to be scraped.
- `section_label`: The label or category of the section, e.g., Business, Entertainment, etc.
- `num_pages`: The number of pages to scrape for the specified section (default is 3).

The function iterates through the specified number of pages, calling other functions to extract article details.

## `scrape_technology_page(url)`

This function is specifically designed for the Technology section. It extracts article information from each page within the section.

## `scrape_page(url, section_label)`

A generic function used for non-Technology sections. It extracts article details from each page of the specified section.

## `scrape_article(article_url)`

Responsible for extracting the title and content of an individual article. It fetches the HTML of the article URL and uses BeautifulSoup to parse and retrieve relevant information.

## `scrape_homepage_sections(base_url)`

This function iterates through different sections of the Indian Express website, calling `scrape_news` for each section. The sections include Business, Entertainment, Sports, Politics, Lifestyle, Education, and Technology.

## `homepage_url`

This variable stores the URL of the Indian Express homepage. You can modify this if you want to scrape from a different source.

## Writing to CSV

The script writes the scraped data to a CSV file named `indian_express_combined_news_1.csv`. The CSV file contains columns for the title, content, and section of each article. You can customize the filename in the script as needed.

**Note:** Ensure that you comply with the website's terms of service and legal guidelines when using web scraping tools. Misuse or unauthorized scraping may violate website policies.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv

def scrape_news(url, section_label, num_pages=3):
    all_data = []

    for page_number in range(1, num_pages + 1):
        page_url = f'{url}page/{page_number}/'
        print(f'Scraping section: {section_label}, page: {page_number}')
        if section_label.lower() == 'technology':
            page_data = scrape_technology_page(page_url)
        else:
            page_data = scrape_page(page_url, section_label)
        all_data.extend(page_data)

    return all_data

def scrape_technology_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all article links on the page
    article_list = soup.find('ul', class_='article-list')
    article_items = article_list.find_all('li')

    # Initialize a list to store the data
    data = []

    # Iterate over each article
    for item in article_items:
        article_url = item.find('h3').find('a')['href']
        article_title, article_content = scrape_article(article_url)

        # Append the results to the data list
        data.append([article_title, article_content, 'Technology'])

    return data

def scrape_page(url, section_label):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all article links on the page
    article_links = soup.select('.img-context h2.title a')

    # Initialize a list to store the data
    data = []

    # Iterate over each article link
    for link in article_links:
        article_url = link['href']
        article_title, article_content = scrape_article(article_url)

        # Append the results to the data list
        if article_title and article_content:  # Check if title and content are not None
            data.append([article_title, article_content, section_label])

    return data

def scrape_article(article_url):
    response = requests.get(article_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract title
    title_element = soup.find('h1', itemprop='headline', class_='native_story_title')
    title = title_element.text.strip() if title_element else None

    # Extract paragraphs
    content_element = soup.find('div', class_='full-details')
    content = '\n'.join([paragraph.text.strip() for paragraph in content_element.find_all('p')]) if content_element else None

    return title, content

def scrape_homepage_sections(base_url):
    sections = {
        'Business': f'{base_url}section/business/',
        'Entertainment': f'{base_url}section/entertainment/',
        'Sports': f'{base_url}section/sports/',
        'Politics': f'{base_url}section/political-pulse/',
        'Lifestyle': f'{base_url}section/lifestyle/',
        'Education': f'{base_url}section/education/',
        'Technology': f'{base_url}section/technology/',
    }

    all_data = []

    for label, url in sections.items():
        section_data = scrape_news(url, label, num_pages=5)
        all_data.extend(section_data)

    return all_data

# Indian Express homepage URL
homepage_url = 'https://indianexpress.com/'

# Scrape news from different sections on the homepage with pagination
all_data = scrape_homepage_sections(homepage_url)

# Write the data to a CSV file
csv_filename = 'indian_express_combined_news_1.csv'
header = ['Title', 'Content', 'Section']

with open(csv_filename, 'w', newline='', encoding='utf-8') as csv_file:
    csv_writer = csv.writer(csv_file)
    
    # Write the header
    csv_writer.writerow(header)
    
    # Write the data
    csv_writer.writerows(all_data)

print(f'Data has been written to {csv_filename}.')


Scraping section: Business, page: 1
Scraping section: Business, page: 2
Scraping section: Business, page: 3
Scraping section: Business, page: 4
Scraping section: Business, page: 5
Scraping section: Entertainment, page: 1
Scraping section: Entertainment, page: 2
Scraping section: Entertainment, page: 3
Scraping section: Entertainment, page: 4
Scraping section: Entertainment, page: 5
Scraping section: Sports, page: 1
Scraping section: Sports, page: 2
Scraping section: Sports, page: 3
Scraping section: Sports, page: 4
Scraping section: Sports, page: 5
Scraping section: Politics, page: 1
Scraping section: Politics, page: 2
Scraping section: Politics, page: 3
Scraping section: Politics, page: 4
Scraping section: Politics, page: 5
Scraping section: Lifestyle, page: 1
Scraping section: Lifestyle, page: 2
Scraping section: Lifestyle, page: 3
Scraping section: Lifestyle, page: 4
Scraping section: Lifestyle, page: 5
Scraping section: Education, page: 1
Scraping section: Education, page: 2
Scrapi