# Fix HTTP 403 and 429 Errors When Downloading Images

This notebook demonstrates how to download images using Python requests while avoiding common errors:
- **HTTP 403 Forbidden**: Server blocks requests without proper headers
- **HTTP 429 Too Many Requests**: Rate limiting from services like Wikimedia

## The Problems

### 1. HTTP 403 Forbidden
```
HTTPError: 403 Client Error: Forbidden
```
Cause: Missing or improper User-Agent header

### 2. HTTP 429 Too Many Requests
```
HTTPError: 429 Client Error: Use thumbnail steps listed on https://w.wiki/GHai
```
Cause: Rate limiting by Wikimedia or other services

## The Solution
1. Add proper HTTP headers (especially User-Agent with identification)
2. Implement retry logic with exponential backoff
3. Handle redirects properly

In [None]:
# Install required packages if needed
# !pip install requests pillow matplotlib

In [None]:
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt
import time

## Method 1: Simple Solution with Retry Logic (RECOMMENDED)

In [None]:
def download_image_with_retry(url, max_retries=5, initial_delay=2):
    """
    Download an image with retry logic for rate limiting.
    
    Args:
        url (str): Image URL
        max_retries (int): Maximum retry attempts
        initial_delay (int): Initial delay in seconds for exponential backoff
    
    Returns:
        PIL.Image: The downloaded image
    """
    # Proper headers for Wikimedia and other services
    # Note: Wikimedia requires identification in User-Agent
    headers = {
        'User-Agent': 'Python Image Downloader/1.0 (Educational/Research; Python requests)',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Referer': 'https://www.google.com/',
        'DNT': '1'
    }
    
    delay = initial_delay
    
    for attempt in range(max_retries):
        try:
            print(f"Attempt {attempt + 1}/{max_retries}...")
            
            # Make request with headers
            response = requests.get(url, headers=headers, timeout=15, allow_redirects=True)
            
            # Handle rate limiting (429) with exponential backoff
            if response.status_code == 429:
                if attempt < max_retries - 1:
                    print(f"⚠ Rate limited (429). Waiting {delay} seconds...")
                    time.sleep(delay)
                    delay *= 2  # Exponential backoff: 2s, 4s, 8s, 16s...
                    continue
                else:
                    raise Exception(f"Rate limit exceeded after {max_retries} attempts")
            
            # Check for other errors
            response.raise_for_status()
            
            # Open and return image
            img = Image.open(BytesIO(response.content))
            print(f"✓ Success! Final URL: {response.url}")
            return img
            
        except requests.exceptions.HTTPError as e:
            if response.status_code == 429 and attempt < max_retries - 1:
                continue
            print(f"✗ HTTP Error: {e}")
            raise
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                print(f"⚠ Request failed. Retrying in {delay} seconds...")
                time.sleep(delay)
                delay *= 2
            else:
                print(f"✗ Request Error: {e}")
                raise
    
    raise Exception("Failed to download image after all retries")

In [None]:
# Download and display the image from bit.ly URL
url = "http://bit.ly/46xv3sL"

try:
    img = download_image_with_retry(url)
    
    # Display the image
    plt.figure(figsize=(12, 9))
    plt.imshow(img)
    plt.axis('off')
    plt.title('Downloaded Image from Wikimedia')
    plt.tight_layout()
    plt.show()
    
    print(f"\nImage details:")
    print(f"Size: {img.size}")
    print(f"Mode: {img.mode}")
    
    # Save the image
    img.save('downloaded_image.png')
    print(f"✓ Saved to: downloaded_image.png")
    
except Exception as e:
    print(f"Failed to download image: {e}")

## Method 2: Direct Wikimedia URL (Alternative)

If you have the direct Wikimedia URL, you can use it directly with proper headers.

In [None]:
# Direct Wikimedia URL (from the error message)
wikimedia_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/Good_Smile_Company_offices_ladies.jpg/800px-Good_Smile_Company_offices_ladies.jpg"

try:
    img = download_image_with_retry(wikimedia_url, max_retries=5, initial_delay=2)
    
    plt.figure(figsize=(12, 9))
    plt.imshow(img)
    plt.axis('off')
    plt.title('Good Smile Company Office')
    plt.show()
    
except Exception as e:
    print(f"Failed: {e}")

## Method 3: Using Requests Session for Multiple Downloads

In [None]:
def create_download_session():
    """
    Create a requests session with proper headers for image downloads.
    Useful for downloading multiple images efficiently.
    """
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Python Image Downloader/1.0 (Educational/Research; Python requests)',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Referer': 'https://www.google.com/',
    })
    return session

# Create session
session = create_download_session()

# Download with session
def download_with_session(session, url, delay=3):
    """Download image using a session with rate limiting delay."""
    time.sleep(delay)  # Respectful delay to avoid rate limits
    response = session.get(url, timeout=15, allow_redirects=True)
    response.raise_for_status()
    return Image.open(BytesIO(response.content))

# Example usage
try:
    img = download_with_session(session, "http://bit.ly/46xv3sL", delay=3)
    plt.figure(figsize=(12, 9))
    plt.imshow(img)
    plt.axis('off')
    plt.show()
except Exception as e:
    print(f"Error: {e}")

## Understanding the Errors

### HTTP 403 Forbidden
**Cause**: Server blocks requests without proper identification

**Solution**: Add a User-Agent header that identifies your application

### HTTP 429 Too Many Requests
**Cause**: Rate limiting to prevent abuse

**Solution**: 
1. Add delays between requests
2. Implement exponential backoff (2s → 4s → 8s → 16s)
3. Use a descriptive User-Agent so servers can contact you if needed

### Wikimedia-Specific Requirements

Wikimedia requires:
- A User-Agent that identifies your bot/tool
- Contact information in the User-Agent (recommended)
- Respecting rate limits (they suggest waiting between requests)
- Following their API guidelines: https://w.wiki/GHai

Example of a good User-Agent for Wikimedia:
```python
'User-Agent': 'MyBot/1.0 (your@email.com) Python/3.x'
```

## Key Takeaways

1. **Always add a User-Agent header** - Identifies your application
2. **Implement retry logic** - Handle temporary failures and rate limits
3. **Use exponential backoff** - Gradually increase wait time between retries
4. **Be respectful** - Don't hammer servers; add delays between requests
5. **Handle redirects** - Use `allow_redirects=True` to follow URL shorteners
6. **Add timeouts** - Prevent hanging requests with `timeout` parameter

## Best Practices

```python
# ✓ Good: Respectful scraping
headers = {'User-Agent': 'MyApp/1.0 (contact@example.com)'}
time.sleep(2)  # Wait between requests
response = requests.get(url, headers=headers, timeout=15)

# ✗ Bad: Will likely get blocked
for url in urls:
    requests.get(url)  # No headers, no delays
```