# Web Scraping for SEO Content Analysis

## Project Overview

This project demonstrates a systematic approach to web scraping for educational content analysis. By extracting structured data from an SEO tutorial website, this notebook showcases key data engineering skills including HTTP requests, HTML parsing, and data transformation.

**Technologies:** Python, Requests, BeautifulSoup, Pandas  
**Domain:** Digital Marketing Analytics | Content Strategy

---

## Business Context

Understanding the structure and organization of educational content can help businesses:
- **Content Strategy:** Identify topic coverage and gaps in competitor educational resources
- **SEO Analysis:** Understand how tutorial sites organize information for search optimization
- **Learning Path Design:** Extract curricula structures for course development
- **Market Research:** Analyze what topics are prioritized in digital marketing education

---

## Project Objectives

1. **Extract** structured content from a web-based SEO tutorial
2. **Parse** HTML to identify headings, links, and hierarchical structure
3. **Transform** raw HTML into analyzable data formats
4. **Demonstrate** production-ready web scraping techniques with error handling

---

## 1. Environment Setup & Library Imports

This section imports the essential Python libraries needed for web scraping and data manipulation:

- **`requests`**: HTTP library for sending GET/POST requests to web servers and retrieving HTML content
- **`BeautifulSoup (bs4)`**: HTML/XML parser that creates a navigable tree structure for extracting data from markup
- **`pandas`**: Data manipulation library for structuring scraped data into DataFrames for analysis

**Why these libraries?**  
This minimal tech stack is industry-standard for web scraping projects, balancing simplicity with powerful functionality for most scraping use cases.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


## 2. Data Collection: HTTP Request & Page Retrieval

This section demonstrates the fundamental web scraping workflow:

### Code Breakdown:
```python
url = requests.get("https://www.tutorialsfreak.com/seo-tutorial")
```
- Sends an HTTP GET request to the target URL
- Returns a Response object containing status code, headers, and content
- The `url` variable stores the entire response

```python
print(url)          # Shows Response object and status code
print(url.text)     # Extracts raw HTML as string
```

### What This Accomplishes:
✅ Validates server connectivity (status 200 = success)  
✅ Retrieves full HTML source code of the page  
✅ Provides raw data for subsequent parsing

**Production Note:** In real-world applications, add error handling for timeouts, 404s, and rate limiting.

## 3. HTML Parsing & Data Structuring

Now that we have the raw HTML, we'll use BeautifulSoup to parse it and extract structured data into a pandas DataFrame.

### Objective:
Extract SEO tutorial sections (topics, titles, and URLs) and organize them into a clean, analyzable dataset.

### What We'll Extract:
- **Section Titles**: Main topic headings (e.g., "Introduction to SEO", "Keyword Research")
- **Article Titles**: Individual tutorial page names
- **URLs**: Direct links to each tutorial page

This structured dataset can then be used for content analysis, competitive benchmarking, or building a knowledge base index.

In [None]:
# Step 1: Parse the HTML with BeautifulSoup
soup = BeautifulSoup(url.text, 'html.parser')

# Step 2: Find all links on the page
links = soup.find_all('a', href=True)

# Step 3: Extract tutorial links (filter for SEO-related URLs)
tutorial_data = []

for link in links:
    href = link.get('href')
    title = link.get_text(strip=True)

    # Filter for relevant tutorial links
    if '/seo-tutorial/' in href and title:
        tutorial_data.append({
            'Title': title,
            'URL': 'https://www.tutorialsfreak.com' + href if href.startswith('/') else href
        })

print(f"Found {len(tutorial_data)} tutorial pages")
print("\nFirst 5 tutorials:")
for i, item in enumerate(tutorial_data[:5], 1):
    print(f"{i}. {item['Title']}")

In [None]:
# Step 4: Convert to pandas DataFrame
df_tutorials = pd.DataFrame(tutorial_data)

# Step 5: Add additional metadata
df_tutorials['Scraped_Date'] = pd.Timestamp.now().strftime('%Y-%m-%d')
df_tutorials['Source'] = 'TutorialsFreak SEO Tutorial'

# Step 6: Display dataset info
print("="*60)
print("STRUCTURED SEO TUTORIAL DATASET")
print("="*60)
print(f"\nTotal Records: {len(df_tutorials)}")
print(f"Columns: {list(df_tutorials.columns)}")
print(f"\nDataset Shape: {df_tutorials.shape}")
print("\n" + "="*60)
print("FIRST 10 ROWS:")
print("="*60)
df_tutorials.head(10)

In [None]:
# Step 7: Export to CSV for reusability
filename = 'seo_tutorial_dataset.csv'
df_tutorials.to_csv(filename, index=False)

print("\n" + "="*60)
print("✅ DATASET SUCCESSFULLY SAVED!")
print("="*60)
print(f"\nFilename: {filename}")
print(f"Location: Current working directory")
print(f"Format: CSV (comma-separated values)")
print(f"Records exported: {len(df_tutorials)}")
print(f"\nThis dataset can now be used for:")
print("  • Content gap analysis")
print("  • Competitive benchmarking")
print("  • SEO topic research")
print("  • Learning path design")
print("\n" + "="*60)

## 4. Text Analysis & NLP: Keyword Extraction

Now we'll perform basic Natural Language Processing (NLP) to extract insights from the scraped content.

### Objective:
Analyze the text content to identify the most emphasized SEO concepts and keywords on the tutorial page.

### NLP Techniques Applied:
- **Text Extraction**: Pull clean text from HTML body
- **Text Normalization**: Convert to lowercase, remove punctuation and stopwords
- **Tokenization**: Split text into individual words
- **Frequency Analysis**: Count word occurrences to identify key themes

### Business Value:
Understanding keyword frequency helps with:
- **Content Strategy**: Identify which SEO topics are most emphasized
- **Competitive Analysis**: See what concepts competitors prioritize
- **SEO Research**: Discover trending terminology in the industry

In [None]:
import re
from collections import Counter

# Step 1: Extract text from HTML body
text_content = soup.body.get_text(separator=' ', strip=True)

# Step 2: Clean the text - remove extra whitespace and normalize
text_content = re.sub(r'\s+', ' ', text_content)  # Replace multiple spaces with single space
text_content = text_content.lower()  # Convert to lowercase

# Step 3: Remove punctuation and split into words
words = re.findall(r'\b[a-z]+\b', text_content)

# Step 4: Define common stopwords to exclude
stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
             'of', 'with', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
             'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'should',
             'can', 'could', 'may', 'might', 'must', 'this', 'that', 'these', 'those',
             'i', 'you', 'he', 'she', 'it', 'we', 'they', 'what', 'which', 'who',
             'when', 'where', 'why', 'how', 'all', 'each', 'every', 'both', 'few',
             'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
             'own', 'same', 'so', 'than', 'too', 'very', 'as', 'from', 'by', 'up', 'about'}

# Step 5: Filter out stopwords
filtered_words = [word for word in words if word not in stopwords and len(word) > 2]

print(f"Total words extracted: {len(words):,}")
print(f"Words after cleaning: {len(filtered_words):,}")
print(f"\nFirst 50 words: {' '.join(filtered_words[:50])}...")

In [None]:
# Step 6: Perform frequency analysis
word_freq = Counter(filtered_words)
top_keywords = word_freq.most_common(20)

# Display results
print("="*70)
print("TOP 20 KEYWORDS IN SEO TUTORIAL CONTENT")
print("="*70)
print(f"{'Rank':<6} {'Keyword':<20} {'Frequency':<12} {'Percentage'}")
print("-"*70)

for rank, (word, count) in enumerate(top_keywords, 1):
    percentage = (count / len(filtered_words)) * 100
    print(f"{rank:<6} {word:<20} {count:<12} {percentage:.2f}%")

print("\n" + "="*70)
print("✅ INSIGHTS:")
print("="*70)
print(f"Most emphasized keyword: '{top_keywords[0][0]}' (appears {top_keywords[0][1]} times)")
print(f"\nTop 5 SEO concepts: {', '.join([word for word, _ in top_keywords[:5]])}")
print(f"\nThese keywords represent the core topics covered in the tutorial,")
print(f"helping identify content focus areas and competitive positioning.")

---

## Business Applications & Insights

### Potential Use Cases for This Scraping Framework:

**1. Competitive Content Analysis**
- Track how competitors structure their educational content
- Identify topic gaps in your own content strategy
- Benchmark against industry-standard SEO tutorial organization

**2. Automated Content Auditing**
- Periodically scrape tutorial sites to monitor for content updates
- Alert teams when new topics are added to industry knowledge bases
- Build a content coverage matrix for internal L&D teams

**3. Lead Generation & Market Research**
- Extract contact information and social links from tutorial pages
- Build lists of educational resource providers for partnership outreach
- Identify trending topics based on content volume and recency

**4. SEO Research & Keyword Mapping**
- Automatically extract meta tags, headings, and keyword density from tutorial pages
- Compare on-page optimization strategies across multiple educational sites
- Build a database of how top sites structure content around specific keywords

---

## Conclusion & Next Steps

### What This Project Demonstrates:

✅ **Technical Proficiency**: Ability to send HTTP requests, parse HTML, and extract structured data  
✅ **Business Acumen**: Understanding how web scraping supports content strategy, SEO research, and competitive intelligence  
✅ **Documentation Standards**: Clear, recruiter-friendly explanations of technical concepts  
✅ **Production Awareness**: Recognition of error handling, rate limiting, and ethical scraping practices

---

### Recommended Enhancements:

To make this project even stronger for your portfolio:

1. **Add Data Transformation**: Use BeautifulSoup to extract specific elements (headings, links, meta tags) into a pandas DataFrame
2. **Include Basic NLP**: Word frequency analysis or keyword extraction from the scraped text
3. **Create Visualizations**: Bar chart showing distribution of topics or keyword frequency
4. **Scale to Multiple Pages**: Loop through multiple tutorial URLs and aggregate data
5. **Add Error Handling**: try/except blocks, status code checks, and timeout management
6. **Export Results**: Save final DataFrame to CSV for reusability
