# Base44 Phenomenon Analysis - Data Collection

This notebook demonstrates the data collection process for Base44 applications using web scraping techniques.

## Research Question
"Base44 has become a phenomenon; we need to analyze the different types of projects using the tool and their level"

## Objectives
1. Scrape Base44 applications from multiple sources
2. Collect comprehensive app metadata
3. Store data in structured format for analysis
4. Validate data quality and completeness

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json

# Import custom modules
from base44_scraper import Base44Scraper

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8')

print("Libraries imported successfully!")
print(f"Analysis started at: {datetime.now()}")

## Data Collection Strategy

We will collect Base44 application data from multiple sources:

1. **Base44.com website** - Official showcase/gallery
2. **Product Hunt** - Base44 app launches
3. **Social Media** - Twitter/X mentions
4. **Web Search** - Google searches for Base44 apps
5. **Community Sources** - Forums and Discord

### Data Points Collected
- App name and URL
- Description/purpose
- Category (MVP, internal tool, portal, SaaS replacement)
- Creation date (if available)
- Creator information
- Industry/domain
- Features mentioned
- User testimonials/reviews

In [None]:
# Initialize the Base44 scraper with rate limiting
scraper = Base44Scraper(rate_limit=2.0)  # 2 second delay between requests

print("Base44 Scraper initialized")
print(f"Rate limit: {scraper.rate_limit} seconds between requests")
print(f"User Agent: {scraper.session.headers['User-Agent'][:50]}...")

## 1. Scraping Base44 Official Showcase

In [None]:
# Scrape Base44 official showcase
print("Starting Base44 showcase scraping...")
showcase_apps = scraper.scrape_base44_showcase()

print(f"Found {len(showcase_apps)} apps from Base44 showcase")
if showcase_apps:
    print("\nSample app from showcase:")
    print(json.dumps(showcase_apps[0], indent=2))

## 2. Scraping Product Hunt

In [None]:
# Search Product Hunt for Base44 applications
print("Starting Product Hunt search...")
ph_apps = scraper.search_product_hunt("Base44")

print(f"Found {len(ph_apps)} apps from Product Hunt")
if ph_apps:
    print("\nSample app from Product Hunt:")
    print(json.dumps(ph_apps[0], indent=2))

## 3. Scraping Social Media Mentions

In [None]:
# Search social media for Base44 mentions
print("Starting social media search...")
social_apps = scraper.search_social_media("twitter")

print(f"Found {len(social_apps)} apps from social media")
if social_apps:
    print("\nSample app from social media:")
    print(json.dumps(social_apps[0], indent=2))

## 4. Web Search for Base44 Applications

In [None]:
# Search web for Base44 app mentions
print("Starting web search...")
web_apps = scraper.search_web_mentions()

print(f"Found {len(web_apps)} apps from web search")
if web_apps:
    print("\nSample app from web search:")
    print(json.dumps(web_apps[0], indent=2))

## 5. Comprehensive Data Collection

In [None]:
# Run complete scraping pipeline
print("Starting comprehensive Base44 application scraping...")
all_apps = scraper.run_full_scrape()

print(f"\nScraping completed!")
print(f"Total unique applications found: {len(all_apps)}")
print(f"Data saved to: data/raw/base44_apps.csv and data/raw/base44_apps.json")

## 6. Initial Data Exploration

In [None]:
# Load the scraped data
try:
    apps_df = pd.read_csv('../data/raw/base44_apps.csv')
    print(f"Loaded {len(apps_df)} applications")
    
    # Display basic information
    print("\n=== Dataset Overview ===")
    print(f"Shape: {apps_df.shape}")
    print(f"Columns: {list(apps_df.columns)}")
    
    # Display first few rows
    print("\n=== First 5 Applications ===")
    display(apps_df.head())
    
except FileNotFoundError:
    print("Data file not found. Please run the scraping cells first.")
    apps_df = pd.DataFrame()

In [None]:
if not apps_df.empty:
    # Data quality assessment
    print("=== Data Quality Assessment ===")
    print("\nMissing values:")
    missing_data = apps_df.isnull().sum()
    print(missing_data[missing_data > 0])
    
    print("\nData types:")
    print(apps_df.dtypes)
    
    # Check for duplicates
    duplicates = apps_df.duplicated(['name', 'url']).sum()
    print(f"\nDuplicate entries: {duplicates}")
    
    # Source distribution
    print("\n=== Source Distribution ===")
    source_counts = apps_df['source'].value_counts()
    print(source_counts)
    
    # Category distribution
    print("\n=== Category Distribution ===")
    category_counts = apps_df['category'].value_counts()
    print(category_counts)

## 7. Initial Visualizations

In [None]:
if not apps_df.empty:
    # Create visualizations for initial data exploration
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Base44 Applications - Initial Data Analysis', fontsize=16)
    
    # Source distribution
    source_counts = apps_df['source'].value_counts()
    axes[0, 0].pie(source_counts.values, labels=source_counts.index, autopct='%1.1f%%')
    axes[0, 0].set_title('Distribution by Data Source')
    
    # Category distribution
    category_counts = apps_df['category'].value_counts()
    axes[0, 1].bar(category_counts.index, category_counts.values)
    axes[0, 1].set_title('Distribution by Application Category')
    axes[0, 1].tick_params(axis='x', rotation=45)
    
    # Industry distribution
    if 'industry' in apps_df.columns:
        industry_counts = apps_df['industry'].value_counts().head(10)
        axes[1, 0].barh(industry_counts.index, industry_counts.values)
        axes[1, 0].set_title('Top 10 Industries')
    
    # Feature count distribution
    if 'features' in apps_df.columns:
        feature_counts = apps_df['features'].apply(
            lambda x: len(str(x).split(',')) if pd.notna(x) else 0
        )
        axes[1, 1].hist(feature_counts, bins=10, edgecolor='black')
        axes[1, 1].set_title('Distribution of Feature Count per App')
        axes[1, 1].set_xlabel('Number of Features')
        axes[1, 1].set_ylabel('Number of Apps')
    
    plt.tight_layout()
    plt.show()

## 8. Summary Statistics

In [None]:
if not apps_df.empty:
    # Generate comprehensive summary statistics
    summary_stats = {
        'total_applications': len(apps_df),
        'unique_sources': apps_df['source'].nunique(),
        'unique_categories': apps_df['category'].nunique(),
        'unique_industries': apps_df['industry'].nunique() if 'industry' in apps_df.columns else 0,
        'apps_with_urls': apps_df['url'].notna().sum(),
        'apps_with_descriptions': apps_df['description'].notna().sum(),
        'apps_with_features': apps_df['features'].notna().sum(),
        'data_collection_date': datetime.now().isoformat(),
        'source_breakdown': apps_df['source'].value_counts().to_dict(),
        'category_breakdown': apps_df['category'].value_counts().to_dict()
    }
    
    print("=== Data Collection Summary ===")
    print(json.dumps(summary_stats, indent=2, default=str))
    
    # Save summary to file
    with open('../data/processed/data_collection_summary.json', 'w') as f:
        json.dump(summary_stats, f, indent=2, default=str)
    
    print("\nSummary saved to: data/processed/data_collection_summary.json")

## 9. Data Validation and Quality Checks

In [None]:
if not apps_df.empty:
    print("=== Data Validation Results ===")
    
    # Check for required fields
    required_fields = ['name', 'description', 'category', 'source']
    for field in required_fields:
        missing_count = apps_df[field].isna().sum()
        completeness = (len(apps_df) - missing_count) / len(apps_df) * 100
        print(f"{field}: {completeness:.1f}% complete ({missing_count} missing)")
    
    # Check URL validity
    valid_urls = apps_df['url'].apply(
        lambda x: str(x).startswith(('http://', 'https://')) if pd.notna(x) else False
    ).sum()
    print(f"\nValid URLs: {valid_urls}/{len(apps_df)} ({valid_urls/len(apps_df)*100:.1f}%)")
    
    # Check description length
    desc_lengths = apps_df['description'].apply(
        lambda x: len(str(x)) if pd.notna(x) else 0
    )
    print(f"\nDescription lengths:")
    print(f"  Average: {desc_lengths.mean():.1f} characters")
    print(f"  Median: {desc_lengths.median():.1f} characters")
    print(f"  Range: {desc_lengths.min()}-{desc_lengths.max()} characters")
    
    # Feature analysis
    if 'features' in apps_df.columns:
        feature_counts = apps_df['features'].apply(
            lambda x: len(str(x).split(',')) if pd.notna(x) and str(x).strip() else 0
        )
        print(f"\nFeature counts per app:")
        print(f"  Average: {feature_counts.mean():.1f} features")
        print(f"  Median: {feature_counts.median():.1f} features")
        print(f"  Range: {feature_counts.min()}-{feature_counts.max()} features")
    
    print(f"\n✓ Data validation completed")
    print(f"✓ Dataset is ready for analysis")

## Next Steps

The data collection phase is now complete. The collected data includes:

1. **Base44 applications** from multiple sources
2. **Comprehensive metadata** for each application
3. **Quality validation** ensuring data integrity

### Files Generated:
- `data/raw/base44_apps.csv` - Main dataset
- `data/raw/base44_apps.json` - JSON format
- `data/processed/data_collection_summary.json` - Collection summary

### Next Notebooks:
1. **02_app_classification.ipynb** - Classify and categorize applications
2. **03_quality_analysis.ipynb** - Evaluate application quality metrics
3. **04_visualization_results.ipynb** - Create comprehensive visualizations

This data will serve as the foundation for analyzing Base44 as a phenomenon in the no-code development space.