## **Scraping E-News Articles Using Scrapy**

## **1.Project Overview**

### 1.1 Project Objectives

### **Motivation**

The primary objective of this project is to build a comprehensive dataset for training an abstractive text summarization model specifically for the Telugu language. In the landscape of natural language processing (NLP), quality training data is the cornerstone of developing accurate and nuanced language models.

### **Data Collection Strategy**

Our approach focuses on aggregating diverse textual content from multiple Telugu-language sources

*   E-news articles

### **Key Objectives**

*   **Dataset Diversity:** Collect a wide range of Telugu language content to ensure the summarization model captures linguistic variations and contextual nuances.
*   **Volume and Quality:** Gather a substantial corpus of high-quality, contextually rich Telugu text.
*   **Source Breadth:** Extract content from multiple domains to provide comprehensive language representation.

### **Technical Approach**

By leveraging web scraping techniques, particularly the Scrapy framework, we will:

*   Systematically crawl identified Telugu language websites
*   Extract full-text articles
*   Preprocess and clean collected data
*   Prepare the dataset for machine learning model training

### **Expected Outcomes**

*   A robust, clean dataset of Telugu language texts
*   A foundation for developing an advanced abstractive summarization model
*   Contribution to Telugu language NLP research

##  **2 . Setting Up the Scraping Environment in Colab**

In this section, we focus on installing the Scrapy library and loading the necessary dependencies into the environment. Finally, we initialize a Scrapy project, which sets up the project directory with predefined requirements and dependencies.

#### **What is Scrpay ?**

---


  **Scrapy** is a robust, open-source web scraping framework built in Python. Its strength lies in its ability to automate web crawling, navigate complex website structures, and extract large volumes of data. Scrapy's asynchronous architecture and built-in features for handling requests and data pipelines make it a preferred tool for applications such as data mining, market research, content aggregation, and more.


---






### 2.1 Scrapy Installation

#### Command: `!pip install scrapy`
- **Purpose**: Installs the Scrapy library in your Python environment
- **Breakdown**:
  - `!pip` is a magic command in Jupyter notebooks/Colab to run pip install
  - `install` is the pip subcommand for installing packages
  - `scrapy` is the web scraping framework being installed
- **What Happens**:
  - Downloads Scrapy from Python Package Index (PyPI)
  - Installs Scrapy and its dependencies
  - Ensures the library is ready for use in your project


In [None]:
# Install Scrapy
%%capture output --no-stderr
!pip install scrapy

### 2.2 Version Verification

#### Command: `!scrapy version`
- **Purpose**: Confirms Scrapy installation and displays the installed version

In [None]:
!scrapy version

Scrapy 2.12.0



### 2.3 Project Creation

#### Command: `!scrapy startproject telugu_news_scraper`
- **Purpose**: Initializes a new Scrapy project with standard directory structure
- **What This Command Does**:
  - Creates a new project directory `telugu_news_scraper`
  - Generates essential project files and folders:
    ```
    telugu_news_scraper/
    │
    ├── scrapy.cfg            # Deployment configuration
    │
    └── telugu_news_scraper/
        ├── __init__.py       # Makes directory a Python package
        ├── items.py          # Define data extraction structures
        ├── middlewares.py    # Custom request/response processing
        ├── pipelines.py      # Data processing after extraction
        ├── settings.py       # Project-wide settings
        │
        └── spiders/          # Directory for crawler scripts
            └── __init__.py
    ```


In [None]:
# start scrapy project
!scrapy startproject telugu_news_scraper

New Scrapy project 'telugu_news_scraper', using template directory '/usr/local/lib/python3.11/dist-packages/scrapy/templates/project', created in:
    /content/telugu_news_scraper

You can start your first spider with:
    cd telugu_news_scraper
    scrapy genspider example example.com


### 2.4 Project Structure Explanation
- **scrapy.cfg**: Project-level configuration
- **items.py**: Defines the data model for extracted items
- **middlewares.py**: Custom request/response processing
- **pipelines.py**: Data cleaning and processing
- **settings.py**: Global project settings
- **spiders/**: Directory to store your web crawlers


In [None]:
# Navigate to the project directory.
%cd telugu_news_scraper/telugu_news_scraper/

/content/telugu_news_scraper/telugu_news_scraper


In [None]:
!pwd

/content/telugu_news_scraper/telugu_news_scraper


In [None]:
!touch

/content


## **3 . Defining the News Article Data Model (items.py)**



### 3.1. Introduction to Scrapy Items

When you dive into the world of web scraping, you quickly realize that the internet is a wild, unstructured jungle of information. Enter Scrapy items - the Swiss Army knife of data extraction.



> **What are Scrapy Items?**


*   Items are something that acts like a data structure to store scraped data, supporting various formats like dictionaries and objects.
*   Think of Scrapy items as your digital data organizer. When you're crawling through websites, pulling information from every nook and cranny, you need a reliable way to collect and structure that data. That's exactly what items do.



> **The Magic of scrapy.Field()**





scrapy.Field() is essentially an alias to build a dictionary class for the items we're scraping from the internet.

Imagine you're collecting puzzle pieces from different websites. scrapy.Field() is like having a custom-designed puzzle board that ensures every piece finds its perfect place. It's not just a simple container - it's a smart, adaptable framework that can handle various types of data.



> While scraping data from the internet, spiders return information in a raw, often messy format. The items module is your data's personal stylist - helping to clean, process, and present the scraped data in a neat, organized manner.

* **Structural Integrity:** Converts chaotic web data into a coherent format
* **Flexibility:** Adapts to different types of web content
* **Processing Power:** Enables advanced data cleaning and transformation




In [None]:
# Define here the models for your scraped items

with open('items.py', 'w') as file:
    file.write(
        """import scrapy

class TeluguNewsScraperItem(scrapy.Item):
    \"\"\"
    Define the structure of scraped items with minimal fields.
    Only extracting URL, title, and text as requested.
    \"\"\"
    url = scrapy.Field()    # Full URL of the article
    title = scrapy.Field()  # Title of the article
    text = scrapy.Field()   # Full text content of the article
    """
    )








## **4. Define Spider**

### 4.1 Introduction to spider

>**What is spider**

A spider in Scrapy is more than just a web crawler - it's the intelligent engine that navigates the complex web of online information.

**Key Characteristics**

* **Web Navigator:** Traverses websites recursively
* **Data Extractor:** Pulls specific information from web pages
* **Intelligent Crawler:** Follows links, respects website structures
* **Customizable Explorer:** Adapts to different website architectures



Imagine building a spider that's not just a crawler, but a smart data hunter. Let me break down how we transform a simple script into a web data extraction machine.

* **Step 1: Define Your Hunting Grounds (Domains)**



* **Step 2: Plant Your Starting Seeds (Seed URLs)**


* **Step 3: Create Navigation Rules**


* **Step 4: Identify Your Hunting Zones (Domain and Category Extraction)**



* **Step 5: The Article Extraction Function - Your Data Collector**



* **Step 6: Data Processing Pipeline - The Refinery**



**The Big Picture**

What we've built is more than a script - it's a smart, autonomous web explorer. It knows where to go, what to collect, and how to package its findings. From scattered web pages to a structured dataset, your spider transforms chaos into organized knowledge.

In [None]:
# Navigate to spider folder
%cd spiders

/content/telugu_news_scraper/telugu_news_scraper/spiders


In [None]:
!pwd

/content/telugu_news_scraper/telugu_news_scraper/spiders


In [None]:
# Create a file named spider.py inside the spider folder.

!touch spider1.py

#5. Define Pipeline in spider.py file

In [None]:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
import json
from urllib.parse import urlparse

class TeluguNewsSpider(scrapy.Spider):
  name = 'telugu_news'

  # List of allowed domains the spider is permitted to crawl.
  # This prevents the spider from accidentally going outside these trusted websites.
  allowed_domains = [
      'eenadu.net',
      'te.wikipedia.org',

  ]

  # Start URLs
  start_urls = [
      'https://www.eenadu.net/',
      'https://www.te.wikipedia.org/'
  ]

  rules = (
      Rule(
          LinkExtractor(
              allow=(
                  r'/india/',
                  r'/world/',
                  r'/technology/',
                  r'/science/',
                  r'/politics/',
                  r'/andhra pradesh/',
                  r'/business/'

              ),
                # Avoid these common non-article pages
              deny=(
                  r'/tag/',
                  r'/author/',
                  r'/search/',
                  r'/contact/',
                  r'/about/',
                  r'/privacy-policy/',
                  r'/sitemap/',
                  r'/advertisement/',
                  r'/gallery/',
                  r'/videos/',
                  r'/login/',
                  r'/register/',
                  r'/subscription/',
                  r'/rss/',
              )


          ),
          callback = 'parse_article',
          follow = True

      ),
  )

In [None]:
!scrapy crawl telugu_news -o telugu_news.json

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2025-05-07 11:38:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sakshi.com/index.php/news/politics/tdp-leaders-agitation-and-resignations-party-1171622>
{'url': 'https://www.sakshi.com/index.php/news/politics/tdp-leaders-agitation-and-resignations-party-1171622', 'title': 'టీడీపీలో ఆగ్రహ జ్వాలలు', 'text': 'Published\n| Last Updated on\nవైఎస్సార్\u200c జిల్లా ప్రొద్దుటూరులో టీడీపీ కార్యాలయం ఎదుట ఆ పార్టీ జెండాలను తగలబెడుతున్న వరదరాజులురెడ్డి అనుచరులు, గుంటూరు జిల్లా బాపట్లలో టీడీపీ నేత నరేంద్రవర్మరాజు అనుచరుల నిరసన బైక్\u200c ర్యాలీ\nనలభైయేళ్ల అనుభవం నవ్వుల పాలయ్యింది. లోక్\u200cసభ, అసెంబ్లీ అభ్యర్థుల ఎంపికలో తడబడింది. కొన్నిచోట్ల అభ్యర్థులను మార్చే పరిస్థితి ఏర్పడింది. కొందరు సీనియర్ల బెదిరింపులకు లొంగిపోయి వారు కోరిన విధంగా సీట్లు కేటాయించాల్సి వచ్చింది. ఎలా గోలా ఆ ప్రక్రియ ముగించామనుకుంటే మంగళవారం అసమ్మతి భగ్గుమంది. టీడీపీ సీట్లు దక్కించుకోలేని కొందరు అధినేత చంద్రబాబుపై ఆగ్రహోదగ్రులవుతూ తిరుగుబాటు బావుటా 