Skip to content

Stefen-Taime/ModernDataEngineerPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ModernDataEngineering: Building a Robust Data Pipeline: Integrating Proxy Rotation, Kafka, MongoDB, Redis, Logstash, Elasticsearch, and MinIO for Efficient Web Scraping

Utilizing Proxies and User-Agent Rotation

Proxies and Rotating User Agents: To overcome anti-scraping measures, our system uses a combination of proxies and rotating user agents. Proxies mask the scraper’s IP address, making it difficult for websites to detect and block them. Additionally, rotating user-agent strings further disguises the scraper, simulating requests from different browsers and devices.

Storing Proxies in Redis: Valid proxies are crucial for uninterrupted scraping. Our system stores and manages these proxies in a Redis database. Redis, known for its high performance, acts as an efficient, in-memory data store for managing our proxy pool. This setup allows quick access and updating of proxy lists, ensuring that our scraping agents always have access to working proxies.

RSS Feed Extraction and Kafka Integration

Extracting News from RSS Feeds: The system is configured to extract news from various RSS feeds. RSS, a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format, is an excellent source for automated news aggregation.

Quality Validation and Kafka Integration: Once the news is extracted, its quality is validated. The validated news data is then published to a Kafka topic (Kafka A). Kafka, a distributed streaming platform, is used here for its ability to handle high-throughput data feeds, ensuring efficient and reliable data transfer.

Data Flow and Storage

MongoDB Integration with Kafka Connect: Kafka Connect Mongo Sink consumes data from Kafka topic A and stores it in MongoDB.

MongoDB, a NoSQL database, is ideal for handling large volumes of unstructured data. The upsert functionality, based on the _id field, ensures that the data in MongoDB is current and avoids duplicates.

Data Accessibility in FastAPI: The collected data in MongoDB is made accessible through FastAPI with OAuth 2.0 authentication, providing secure and efficient access for administrators.

Logstash and Elasticsearch Integration: Logstash monitors MongoDB replica sets for document changes, capturing these as events. These events are then indexed in Elasticsearch, a powerful search and analytics engine. This integration allows for real-time data analysis and quick search capabilities.

Data Persistence with Kafka Connect S3-Minio Sink: To ensure data persistence, Kafka Connect S3-Minio Sink is employed. It consumes records from Kafka topic A and stores them in MinIO, a high-performance object storage system. This step is crucial for long-term data storage and backup.

Public Data Access and Search

ElasticSearch for Public Search: The data collected and indexed in Elasticsearch is made publicly accessible through FastAPI. This setup allows users to perform fast and efficient searches across the aggregated data.

Here are some example API calls and their intended functionality:

Basic Request Without Any Parameters:

Search with a General Keyword:

  • Searches across multiple fields (like title, description, and author) using a general keyword.

  • Example API Call: GET http://localhost:8000/api/v1/news/?search=Arsenal

  • This call returns news items where the word “Arsenal” appears in either the title, description, or author.

Search in a Specific Field:

Filter by Language:

Combining General Search with Language Filter:

Combining Specific Field Search with Language Filter:

ModernDataEngineerPipeline - Startup Guide

This guide provides step-by-step instructions for setting up and running the "ModernDataEngineerPipeline" project.

Setup Steps

1. Clone the Repository

Start by cloning the repository from GitHub:

git clone https://github.com/Stefen-Taime/ModernDataEngineerPipeline

2. Navigate to the Project Directory

cd ModernDataEngineerPipeline

3. Launch Services with Docker

Use docker-compose to build and start the services:

docker-compose up --build -d

3.1 Use MongoDB and Redis Clusters

You can use ready-made MongoDB and Redis clusters from MongoAtlas and Redis, or create a free account to get trial clusters. It is also possible to use local MongoDB and Redis clusters by deploying them with Docker.

4. Navigate to the src Folder

cd src

5. Run the Proxy Handler

Execute proxy_handler.py to retrieve proxies and store them in Redis:

python proxy_handler.py

6. Handle RSS Feeds with Kafka

Use rss_handler.py to produce messages towards Kafka:

python rss_handler.py

7. Add JSON Sink Connectors

Add the two JSON Sink connectors found in the connect folder on Confluent Connect or use the Connect API.

8. Launch Logstash

Run Logstash using Docker:

docker exec -it <container_id> /bin/bash -c "mkdir -p ~/logstash_data && bin/logstash -f pipeline/ingest_pipeline.conf --path.data /usr/share/logstash/logstash_data"

9. Start the API

Finally, start the API:

cd api
python main.py

Follow these steps to set up and run the "ModernDataEngineerPipeline" project.

Medium