# üìö Free Books Scraper - Google Colab Edition

Scrape 1000+ free books from archive.org and Project Gutenberg with PDFs and HD covers!

## üöÄ Quick Start
1. Run the first cell to install dependencies
2. Run the second cell to start scraping
3. Wait 30-60 minutes for completion
4. Use the third cell to explore your collection

In [None]:
# Install required packages
print("üì¶ Installing required packages...")
!pip install requests beautifulsoup4 pandas lxml --quiet
print("‚úÖ Packages installed successfully!")

In [None]:
# Download and run the book scraper
import warnings
warnings.filterwarnings('ignore')

# Download the scraper files
!wget -q https://raw.githubusercontent.com/your-repo/book_scraper/main/book_scraper.py -O book_scraper.py
!wget -q https://raw.githubusercontent.com/your-repo/book_scraper/main/book_utils.py -O book_utils.py

print("üìö Starting Free Books Scraper...")
print("üéØ Target: 1000+ free books")
print("üìÇ Location: /content/books/")
print("‚è∞ Estimated time: 30-60 minutes")
print()

# Run the scraper
from book_scraper import BookScraper

scraper = BookScraper(download_dir="/content/books")
books, csv_path, zip_path = scraper.run_full_scraping(target_books=1000)

print("\n" + "="*60)
print("üéâ SCRAPING COMPLETED!")
print("="*60)
print(f"üìö Total books: {len(books)}")
print(f"üìä Database: {csv_path}")
print(f"üì¶ Zip file: {zip_path}")

In [None]:
# Explore your book collection
import pandas as pd
import os
from book_utils import BookManager

# Load the collection
manager = BookManager("/content/books")
df = manager.df

print("üìä COLLECTION OVERVIEW")
print("="*40)
print(f"Total books: {len(df)}")
print(f"Books with PDFs: {len(df[df['local_pdf_path'] != ''])}")
print(f"Books with covers: {len(df[df['local_cover_path'] != ''])}")

# Show sample books
print("\nüìñ SAMPLE BOOKS:")
sample = df[['title', 'author', 'categories']].head(5)
for i, (_, book) in enumerate(sample.iterrows()):
    print(f"{i+1}. {book['title']} by {book['author']}")
    print(f"   üìÇ {book['categories']}")

# Show file sizes
if os.path.exists(csv_path):
    csv_size = os.path.getsize(csv_path) / (1024*1024)
    print(f"\nüìÑ CSV file size: {csv_size:.2f} MB")

if os.path.exists(zip_path):
    zip_size = os.path.getsize(zip_path) / (1024*1024)
    print(f"üì¶ Zip file size: {zip_size:.2f} MB")

In [None]:
# Search and browse your collection
print("üîç SEARCH EXAMPLES")
print("="*30)

# Search by title
python_books = manager.search_books('Python')
print(f"\nüêç Python books ({len(python_books)}):")
for i, (_, book) in enumerate(python_books.head(3).iterrows()):
    print(f"  {i+1}. {book['title']}")

# Browse by category
fiction_books = manager.get_books_by_category('Fiction')
print(f"\nüìö Fiction books ({len(fiction_books)}):")
for i, (_, book) in enumerate(fiction_books.head(3).iterrows()):
    print(f"  {i+1}. {book['title']} ({book['date']})")

# Most popular
popular = manager.get_most_popular(5)
print(f"\n‚≠ê Most popular books:")
for i, (_, book) in enumerate(popular.iterrows()):
    print(f"  {i+1}. {book['title']} (üìä {book['download_count']:,} downloads)")

In [None]:
# Download files from Colab
from google.colab import files

print("üíæ DOWNLOAD FILES")
print("="*25)

# Download CSV database
print("üìä Downloading CSV database...")
files.download(csv_path)

# Download zip archive
print("üì¶ Downloading zip archive...")
files.download(zip_path)

print("\n‚úÖ Files downloaded to your computer!")
print("üí° You can also find them in the Colab file browser (left sidebar)")

## üìñ How to Use Your Collection

### Load the Database
```python
import pandas as pd
df = pd.read_csv('books_database.csv')
```

### Find Books
```python
# Search by title
python_books = df[df['title'].str.contains('Python', case=False)]

# Filter by category
fiction = df[df['categories'].str.contains('Fiction', case=False)]

# Most popular
popular = df.nlargest(10, 'download_count')
```

### Access Files
```python
# Get first book
book = df.iloc[0]

# PDF path
pdf_path = book['local_pdf_path']

# Cover path
cover_path = book['local_cover_path']
```

### File Structure
```
/content/books/
‚îú‚îÄ‚îÄ books_database.csv       # Complete database
‚îú‚îÄ‚îÄ free_books_collection.zip # All files zipped
‚îú‚îÄ‚îÄ pdfs/                    # Downloaded PDFs
‚îú‚îÄ‚îÄ covers/                  # HD covers
‚îî‚îÄ‚îÄ metadata/                # Statistics
```