FDA Drug Information Scraper

This project is an FDA Drug Information Scraper that extracts detailed drug information from the FDA Website and stores the data in MongoDB. The scraper captures essential details about drugs and metadata, offering traceability and optional automation for periodic scraping.

Features

Drug Data Extraction:

Captures detailed drug information including: Drug Name, Active Ingredients, Strength, Dosage Form/Route, Marketing Status, TE Code, RLD and RS.

Metadata Collection:

Stores metadata for full traceability, including:
- Timestamp of data extraction
- Source URL
- Number of records extracted

MongoDB Integration:

Saves scraped data and metadata into MongoDB in a well-organized structure for easy querying.

Error Handling and Logging:

Implements retry mechanisms for handling network or structural issues.
Logs the scraping process, errors and database interactions in a dedicated log file.

File Downloads:

Optionally downloads associated documents (e.g. drug labels and letters) as PDFs into a local directory.

Scheduling:

Automates the scraping process using a scheduler.

Project Structure

Web_Scrapper/
├── src/                  # Core project logic
│   ├── main.py           # Orchestrates scraping tasks
│   ├── scraper.py        # Handles scraping logic
│   ├── database.py       # Manages MongoDB interactions
│   ├── scheduler.py      # Scheduling logic
│   ├── log_config.py     # Logging configuration
├── logs/                 # Log files
│   └── application.log   # Logs for debugging and monitoring
├── requirements.txt      # Python dependencies
├── README.md             # Project documentation
├── venv/                 # Virtual environment

Prerequisites

Ensure the following tools are installed:

Python: Version 3.8 or above.
MongoDB: Installed locally or accessible via MongoDB Atlas.

Running the Project

1. Set Up the Virtual Environment
Activate the environment:

bash

python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

2. Install Dependencies
Install the required libraries:

bash

pip install -r requirements.txt

3. Configure Environment Variables
Create a .env file in the root directory with the following:

MONGO_URI=your_mongodb_connection_string

4. Run the Scraper
Start the scraping process:

bash

python lib/main.py

5. Enable Scheduling
To schedule periodic scraping tasks:

bash

python lib/scheduler.py

Usage

Run Scraper:

Automatically extracts data and stores it in MongoDB.

Schedule Tasks:

Periodically scrape data using the scheduler.

Download Files:

Enable optional PDF downloads when prompted.

Sample Output

Scraped Data Document (json)

{
  "Products on NDA": [
    {
      "Drug Name": "VALSTAR PRESERVATIVE FREE",
      "Active Ingredients": "VALRUBICIN",
      "Strength": "40MG/ML",
      "Dosage Form/Route": "SOLUTION;INTRAVESICAL",
      "Marketing Status": "Prescription",
      "TE Code": "AO",
      "RLD": "Yes",
      "RS": "Yes"
    }
  ],
  "Metadata": {
    "Timestamp": "2024-12-11T19:16:28.687Z",
    "Source URL": "https://example.com",
    "Number of Records": 1
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FDA Drug Information Scraper

Features

Project Structure

Prerequisites

Running the Project

Usage

Sample Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
downloads		downloads
logs		logs
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FDA Drug Information Scraper

Features

Project Structure

Prerequisites

Running the Project

Usage

Sample Output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages