<a href="https://colab.research.google.com/github/Crystalheart0828/AI-projects-notebooks/blob/main/Content_Crawler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content Crawler

Created with heart and soul by [Karen Ding](https://karending.com)<br>
Last updated on Mar 2, 2025


## License and Usage
This tutorial is licensed under the MIT License.

Feel free to use, modify, and share this tutorial!

If you find it helpful, please consider:

1. Linking back to the [original webpage](https://karending.com)
2. Mentioning where you learned from it
3. Contributing improvements back to help others learn too

You are free to:

1. Use and adapt this tutorial for any purpose.
2. Share, copy, and redistribute it in any medium.
3. Modify, remix, and build upon the material.
</br>

For the full license details, please see the [LICENSE](https://github.com/Crystalheart0828/AI-projects-notebooks/blob/main/LICENSE) file included with this tutorial.

## Overview

**Content Crawler** is a seamless web scraping tool that lets you input URLs into a designated column, automatically extracting web page content via a Python script and exporting it to Google Docs, securely stored in your Google Drive.

This tutorial will show you how to build a content scraper that automatically:
1. Reads URLs from a Google Sheet
2. Extracts content from each webpage
3. Saves the content to Google Docs
4. Organizes everything in Google Drive folders

## Prerequisites (Your Key Ingredients! 🧰)

#### **Google Cloud Credentials, and APIs (Google Sheet, Google Drive, Google Doc)**
Don't know what this is? No worries—I've been there too! Simply ask any LLM (e.g., ChatGPT, Claude, Gemini, etc.) the following prompt:

> I'm new to Google Cloud Console and haven't built a project yet. Could you please walk me through the step-by-step process to download the credentials file and activate the Google Sheets API, Google Drive API, and Google Docs API?

#### **Google Sheet with URLs**
- Create a Google Sheet
- Put your URLs in Column A
- Share the sheet with your service account email
- Copy the Sheet ID from the URL (the long string between /d/ and /edit)

#### **Google Drive Folder**
- Create a folder in Google Drive
- Share it with your service account email
- Copy the Folder ID from the URL

#### **Google Colab**
[Google Colab](https://colab.research.google.com/) is a fantastic cloud-based IDE. Like Google Docs, your scripts are saved in your Google Drive and can be easily shared with others. Simply sign in to Google Colab with your Google account, and you're ready to get started.

## Step-by-Step Guide

### **Step 1: Setting Up Your Environment 🛠️**
**What's happening here**: Think of this as preparing your kitchen before cooking. We're getting all our tools ready by installing the necessary Python packages. Just like you can't cook without pots and pans, you can't run the script without these packages!

We are going to install the following Python packages:
* google-auth (for Google authentication)
* google-api-python-client (for using Google services)
* gspread (for working with Google Sheets)
* beautifulsoup4 (for parsing web content)
* requests (for making internet requests)

In [None]:
pip install google-auth google-api-python-client gspread beautifulsoup4 requests

### **Step 2: Importing Required Tools 📚**
**What's happening here**: Now we're bringing in all the tools we'll need. It's like taking out all the utensils and ingredients from your kitchen cabinets before starting to cook.

In [None]:
import json
import gspread
from google.oauth2.service_account import Credentials
from googleapiclient.discovery import build
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import time
from google.colab import userdata

### **Step 3: Setting Up Google Authentication 🔑**
**What's happening here**: This is like showing your ID to enter a restricted area. We're telling Google who we are and what we want to do with their services (in this case, use Google Sheets, Drive, and Docs).

#### Setting up Colab Secrets

Before running any code, you need to set up your secrets in Colab:

1. Click on the folder icon on the left sidebar
2. Click on the key icon (🔑) to open the Secrets panel
3. Add the following secrets:
   - `service_account_json`: Paste the entire contents of your service account JSON file
   - `sheet_id`: Your Google Sheet ID
   - `folder_id`: Your Google Drive folder ID
   - `user_email`: Your email address

In [None]:
# Define what permissions we need
SCOPES = [
    "https://www.googleapis.com/auth/spreadsheets",
    "https://www.googleapis.com/auth/drive",
    "https://www.googleapis.com/auth/documents"
]

credentials_path ='your_credentials_file.json' # Replace with your actual credentials file path


# Get our configuration from Colab secrets
service_account_info = json.load(open(credentials_path))
SHEET_ID = userdata.get('sheet_id')
PARENT_FOLDER_ID = userdata.get('folder_id')
USER_EMAIL = userdata.get('user_email')
SHEET_NAME = "Sheet1"  # Change this if your sheet has a different name

# Authenticate with Google
creds = Credentials.from_service_account_info(service_account_info, scopes=SCOPES)
client = gspread.authorize(creds)
drive_service = build('drive', 'v3', credentials=creds)
docs_service = build('docs', 'v1', credentials=creds)

print("Successfully connected to Google services!")

#### Get Your Service Account Email:

In [None]:
service_account_email = service_account_info['client_email']
print(f"Your service account email is: {service_account_email}")

#### Share Your Google Sheet:
* Open your Google Sheet in your browser
* Click the "Share" button in the top right corner
* In the "Share with people and groups" field, paste your service account email
* Set the role to "Editor"
* Click "Send" (Note: You don't need to check the "Notify people" box)

#### 💡 **Pro Tips!**

* The service account email usually ends with **"@*.iam.gserviceaccount.com"**
* Make sure to give **"Editor"** access if you want the script to write to the sheet.</br>
* You only need to do this sharing step once for each Google Sheet

### **Step 4: Creating Helper Functions 🔧**
What's happening here: We're creating specialized tools for each task. Think of these as your cooking techniques - each one has a specific purpose in preparing our final dish.

#### a. Extract urls function
Retrieves a list of URLs from column F of a specified Google Sheet, skipping the header row.

In [None]:
def extract_urls_from_sheet(sheet_id, sheet_name):
    """Gets all the URLs from our Google Sheet"""
    try:
        sheet = client.open_by_key(sheet_id)
        worksheet = sheet.worksheet(sheet_name)
        urls = worksheet.col_values(6)[1:]  # Get URLs from column F, skip header
        return [url for url in urls if url.strip()]
    except Exception as e:
        print(f"Error accessing sheet: {str(e)}")
        return []

#### b. Scrape content function
Fetches and extracts readable text content from a given webpage, and returns it along with the original URL.

In [None]:
def scrape_content_from_url(url):
    """Reads the content from each webpage"""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, "html.parser")

        paragraphs = soup.find_all('p')
        if not paragraphs:
            paragraphs = soup.find_all('div')

        article_content = ' '.join([p.get_text() for p in paragraphs])

        if len(article_content) < 50:
            print(f"Warning: Content for {url} is very short or empty.")
            return None

        return f"Original URL: {url}\n\n{article_content}"

    except requests.RequestException as e:
        print(f"Failed to retrieve content from {url}: {str(e)}")
        return None

#### c. Create folder in Google Drive function
Creates a new folder in Google Drive under a specified parent folder, or returns the existing folder's ID if it already exists.

In [None]:
def create_folder_in_drive(folder_name, parent_folder_id):
    """Creates a new folder in Google Drive"""
    query = f"'{parent_folder_id}' in parents and mimeType='application/vnd.google-apps.folder' and name='{folder_name}'"
    results = drive_service.files().list(q=query).execute()
    folders = results.get('files', [])

    if folders:
        return folders[0]['id']
    else:
        file_metadata = {
            'name': folder_name,
            'mimeType': 'application/vnd.google-apps.folder',
            'parents': [parent_folder_id]
        }
        folder = drive_service.files().create(body=file_metadata, fields='id').execute()
        return folder['id']

#### d. Export the scraped content to Google Doc function
Creates a new Google Doc with the given content and title, moves it to a specified Google Drive folder, and shares it with a user via email.

In [None]:
def create_and_share_doc(content, title, folder_id):
    """Creates a new Google Doc and shares it"""
    # Create the doc
    doc = docs_service.documents().create(body={"title": title}).execute()
    document_id = doc['documentId']

    # Add content
    requests_body = [{
        'insertText': {
            'location': {'index': 1},
            'text': content
        }
    }]
    docs_service.documents().batchUpdate(
        documentId=document_id,
        body={'requests': requests_body}
    ).execute()

    # Move to folder
    drive_service.files().update(
        fileId=document_id,
        addParents=folder_id,
        fields='id, parents'
    ).execute()

    # Share the doc
    drive_service.permissions().create(
        fileId=document_id,
        body={
            'type': 'user',
            'role': 'writer',
            'emailAddress': USER_EMAIL
        },
        fields='id',
        sendNotificationEmail=True
    ).execute()

    return document_id

### **Step 5: Let's Start Scraping! 🚀**

Now for the fun part - let's put everything together and start scraping! This cell will:
1. Create a new folder for today's content
2. Get all URLs from your sheet
3. Visit each webpage and save its content
4. Create a Google Doc for each page
5. Share everything with you

In [None]:
# Create today's folder
current_date = datetime.now().strftime("%Y%m%d")
session_folder_name = f"Scraped_Content_{current_date}"
folder_id = create_folder_in_drive(session_folder_name, PARENT_FOLDER_ID)
print(f"Created session folder: {session_folder_name}")

# Create Original content folder
original_folder_id = create_folder_in_drive("Original", folder_id)
print("Created 'Original' folder for raw content")

# Get URLs from sheet
urls = extract_urls_from_sheet(SHEET_ID, SHEET_NAME)
print(f"Found {len(urls)} URLs to process")

# Process each URL
for i, url in enumerate(urls, start=1):
    print(f"\nProcessing URL {i}/{len(urls)}: {url}")

    content = scrape_content_from_url(url)
    if content:
        # Create and share the document
        doc_title = f"Original-{i}-{current_date}"
        document_id = create_and_share_doc(content, doc_title, original_folder_id)
        print(f"Created and shared document: https://docs.google.com/document/d/{document_id}/edit")

        # Take a short break
        time.sleep(1)
    else:
        print(f"Skipped URL {url} due to scraping failure")

print("\nAll done! Check your email for the shared documents.")

#### 💡 **Pro Tips:**
1. Your Colab secrets are secure and won't be visible in your notebook's output or shared versions
2. Remember to set up your secrets before running the notebook
3. If you're sharing your notebook, the secrets won't be shared - each user needs to set up their own
4. Colab sessions expire after a while, so you might need to rerun the authentication steps
5. To avoid timeouts during long scraping sessions, you can go to Runtime → Change runtime type and select a GPU or TPU runtime for better performance

© 2025 Karen Ding. Created with heart and soul.
</br>Licensed under the MIT License. See [LICENSE](https://github.com/Crystalheart0828/AI-projects-notebooks/blob/main/LICENSE) file for details.
</br>
Find more tutorials at [karending.com](https://karending.com)