# 📊 GitHub Octoverse - Time Series Forecasting Tutorial

**Welcome to Time Series Forecasting for Beginners!**

This notebook will walk you through **step-by-step** how to:
1. 🔍 Load and explore your data
2. 🛠️ Prepare data for time series models
3. 🤖 Build XGBoost and Prophet models
4. 📈 Create 1-year and 5-year forecasts
5. ✅ Validate your models with backtesting
6. 📊 Visualize and interpret results

---

## 🎯 Project Goal
Predict GitHub developer signup growth for **top 10 countries** to support the 2025+ Octoverse publication.

**Target Countries:**
- India, United States, Brazil, China, Japan
- Germany, Indonesia, United Kingdom, Canada
- Major African Markets (Egypt, Nigeria, Kenya, South Africa, Morocco)


## 📦 Step 1: Install and Import Required Libraries

**What we're doing:** Setting up all the tools we need for time series forecasting.

**Why these libraries:**
- `darts`: The main library for time series forecasting (includes XGBoost and Prophet)
- `pandas`: For handling data tables (like Excel but in Python)
- `numpy`: For mathematical operations
- `matplotlib` & `seaborn`: For creating charts and graphs
- `scikit-learn`: For additional machine learning tools

In [None]:
# Install required packages (run this once)
!pip install darts pandas numpy matplotlib seaborn scikit-learn xgboost

In [None]:
# Import all the libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime, timedelta

# Darts - our main time series library
from darts import TimeSeries
from darts.models import XGBModel, Prophet
from darts.metrics import mape, mae, rmse
from darts.utils.timeseries_generation import datetime_attribute_timeseries

# Make plots look nice
plt.style.use('seaborn-v0_8')
warnings.filterwarnings('ignore')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print(f"📅 Analysis started at: {datetime.now()}")

## 📁 Step 2: Load Your Data

**What we're doing:** Loading the CSV files with GitHub developer signup data.

**Expected data format:**
- Columns: `country_name`, `created_cohort` (YYYY-MM), `new_signups`, `prev_year_signups`, `yoy_percent`
- One row per country per month
- Data from Jan 2018 to present

In [None]:
def load_github_signup_data(file_path=None):
    """
    Load GitHub developer signup data from CSV file or create sample data for tutorial.
    
    Returns:
        pandas.DataFrame: Data with columns [country_name, created_cohort, new_signups, yoy_percent]
    """
    
    if file_path and pd.io.common.file_exists(file_path):
        print(f"📂 Loading data from: {file_path}")
        df = pd.read_csv(file_path)
        print(f"✅ Loaded {len(df)} rows of data")
        return df
    else:
        print("📊 Creating sample data for tutorial purposes...")
        
        # Create sample data that looks like real GitHub signup trends
        countries = [
            'India', 'United States', 'Brazil', 'China', 'Japan',
            'Germany', 'Indonesia', 'United Kingdom', 'Canada',
            'Nigeria', 'Kenya', 'Egypt', 'South Africa', 'Morocco'
        ]
        
        # Generate monthly dates from 2018-01 to 2025-08
        date_range = pd.date_range('2018-01', '2025-08', freq='MS')
        
        sample_data = []
        
        for country in countries:
            # Different base growth patterns for different countries
            if country == 'India':
                base = 50000  # High base for India
                growth_rate = 0.02  # 2% monthly growth
            elif country == 'United States':
                base = 40000
                growth_rate = 0.015
            elif country in ['Brazil', 'China']:
                base = 30000
                growth_rate = 0.025
            elif country in ['Nigeria', 'Kenya']:
                base = 15000
                growth_rate = 0.035  # Higher growth for emerging markets
            else:
                base = 25000
                growth_rate = 0.02
            
            for i, date in enumerate(date_range):
                # Add some seasonality and noise
                seasonal_factor = 1 + 0.1 * np.sin(2 * np.pi * i / 12)  # Yearly seasonality
                noise = np.random.normal(1, 0.1)  # Random variation
                covid_impact = 1.5 if '2020' in str(date) and date.month in [3,4,5] else 1.0
                
                signups = int(base * (1 + growth_rate) ** i * seasonal_factor * noise * covid_impact)
                
                sample_data.append({
                    'country_name': country,
                    'created_cohort': date.strftime('%Y-%m'),
                    'new_signups': max(signups, 1000),  # Minimum 1000 signups
                })
        
        df = pd.DataFrame(sample_data)
        print(f"✅ Created {len(df)} rows of sample data")
        return df

# Load your data (replace 'your_data.csv' with your actual file path)
# df = load_github_signup_data('your_data.csv')
df = load_github_signup_data()  # Using sample data for tutorial

# Display first few rows
print("\n🔍 First 5 rows of data:")
print(df.head())

print("\n📊 Data summary:")
print(f"Countries: {df['country_name'].nunique()}")
print(f"Date range: {df['created_cohort'].min()} to {df['created_cohort'].max()}")
print(f"Total months: {len(df) // df['country_name'].nunique()}")

## 🔧 Step 3: Data Preparation and Exploration

**What we're doing:** Converting our data into the format that Darts (our forecasting library) can understand.

**Key concepts:**
- **TimeSeries**: Darts' special format for time series data
- **Datetime index**: Converting month strings to actual dates
- **Individual series**: One TimeSeries for each country

In [None]:
def prepare_time_series_data(df):
    """
    Convert DataFrame to Darts TimeSeries format.
    
    Returns:
        dict: Dictionary with country names as keys and TimeSeries as values
    """
    
    print("🔧 Preparing data for time series forecasting...")
    
    # Convert created_cohort to datetime
    df['date'] = pd.to_datetime(df['created_cohort'])
    
    # Dictionary to store TimeSeries for each country
    country_series = {}
    
    # Create TimeSeries for each country
    for country in df['country_name'].unique():
        # Filter data for this country
        country_data = df[df['country_name'] == country].copy()
        
        # Sort by date
        country_data = country_data.sort_values('date')
        
        # Create TimeSeries
        series = TimeSeries.from_dataframe(
            country_data, 
            time_col='date', 
            value_cols='new_signups'
        )
        
        country_series[country] = series
        print(f"✅ {country}: {len(series)} data points from {series.start_time()} to {series.end_time()}")
    
    return country_series

# Prepare the data
series_dict = prepare_time_series_data(df)

print(f"\n🎉 Successfully created TimeSeries for {len(series_dict)} countries!")

In [None]:
# Let's visualize the data for a few countries to understand the trends
def plot_country_trends(series_dict, countries_to_plot=None, figsize=(15, 10)):
    """
    Plot time series for selected countries.
    """
    if countries_to_plot is None:
        countries_to_plot = list(series_dict.keys())[:6]  # Plot first 6 countries
    
    fig, axes = plt.subplots(2, 3, figsize=figsize)
    axes = axes.flatten()
    
    for i, country in enumerate(countries_to_plot):
        if i < len(axes):
            series = series_dict[country]
            series.plot(ax=axes[i], label=country)
            axes[i].set_title(f'{country} - Developer Signups')
            axes[i].set_ylabel('New Signups')
            axes[i].grid(True, alpha=0.3)
            
            # Format y-axis to show numbers in thousands
            axes[i].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x/1000:.0f}K'))
    
    plt.tight_layout()
    plt.suptitle('📈 GitHub Developer Signups by Country', y=1.02, fontsize=16)
    plt.show()

# Plot trends for top countries
top_countries = ['India', 'United States', 'Brazil', 'China', 'Germany', 'United Kingdom']
plot_country_trends(series_dict, top_countries)

## 🤖 Step 4: Building Your First Forecasting Models

**What we're doing:** Creating and training two different forecasting models.

**The Models:**
1. **Prophet**: Good for data with clear seasonality and trends. Easy to interpret.
2. **XGBoost**: Powerful machine learning model that can capture complex patterns.

**Why use both?** Different models are better for different situations. We'll compare them!

### 🔮 Model 1: Prophet (Facebook's Forecasting Model)

**What Prophet is good at:**
- Handling seasonality (yearly patterns)
- Working with missing data
- Easy to understand and interpret
- Great for business forecasting

In [None]:
def create_prophet_forecast(country_series, country_name, forecast_periods=12):
    """
    Create a Prophet forecast for a specific country.
    
    Parameters:
    - country_series: TimeSeries data for the country
    - country_name: Name of the country
    - forecast_periods: Number of months to forecast (default: 12 months)
    
    Returns:
    - forecast: Predicted values
    - model: Trained Prophet model
    """
    
    print(f"🔮 Creating Prophet forecast for {country_name}...")
    
    # Configure Prophet model
    # These settings tell Prophet how to handle different patterns in the data
    prophet_model = Prophet(
        seasonality_mode='multiplicative',  # Seasonality grows with the trend
        yearly_seasonality=True,            # Look for yearly patterns
        weekly_seasonality=False,           # No weekly patterns (we have monthly data)
        daily_seasonality=False,            # No daily patterns (we have monthly data)
        changepoint_prior_scale=0.05        # How flexible the trend can be
    )
    
    # Train the model
    prophet_model.fit(country_series)
    
    # Make forecast
    forecast = prophet_model.predict(n=forecast_periods)
    
    print(f"✅ Prophet forecast completed for {country_name}!")
    print(f"   📊 Forecasting {forecast_periods} months ahead")
    print(f"   📈 Forecast range: {forecast.start_time()} to {forecast.end_time()}")
    
    return forecast, prophet_model

# Example: Create Prophet forecast for India
india_series = series_dict['India']
india_forecast_prophet, india_prophet_model = create_prophet_forecast(india_series, 'India', forecast_periods=24)

# Visualize the forecast
plt.figure(figsize=(12, 6))
india_series.plot(label='Historical Data', color='blue')
india_forecast_prophet.plot(label='Prophet Forecast', color='red', linestyle='--')
plt.title('📈 India - GitHub Developer Signups: Historical vs Prophet Forecast')
plt.ylabel('New Signups')
plt.legend()
plt.grid(True, alpha=0.3)
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x/1000:.0f}K'))
plt.show()

print(f"\n🎯 Prophet Forecast Summary for India:")
print(f"   Last historical value: {india_series.values()[-1][0]:,.0f} signups")
print(f"   First forecast value: {india_forecast_prophet.values()[0][0]:,.0f} signups")
print(f"   Average monthly forecast: {india_forecast_prophet.values().mean():,.0f} signups")

### 🚀 Model 2: XGBoost (Machine Learning Model)

**What XGBoost is good at:**
- Learning complex patterns
- Using additional features (covariates)
- High accuracy for many types of data
- Handling non-linear relationships

In [None]:
def create_xgboost_forecast(country_series, country_name, forecast_periods=12):
    """
    Create an XGBoost forecast for a specific country.
    
    Parameters:
    - country_series: TimeSeries data for the country
    - country_name: Name of the country
    - forecast_periods: Number of months to forecast
    
    Returns:
    - forecast: Predicted values
    - model: Trained XGBoost model
    """
    
    print(f"🚀 Creating XGBoost forecast for {country_name}...")
    
    # Configure XGBoost model
    # These settings control how the model learns from the data
    xgb_model = XGBModel(
        lags=12,                    # Look at past 12 months to predict next month
        n_estimators=100,           # Number of decision trees to build
        learning_rate=0.1,          # How fast the model learns
        max_depth=6,                # How complex each tree can be
        random_state=42             # For reproducible results
    )
    
    # Train the model
    xgb_model.fit(country_series)
    
    # Make forecast
    forecast = xgb_model.predict(n=forecast_periods)
    
    print(f"✅ XGBoost forecast completed for {country_name}!")
    print(f"   📊 Forecasting {forecast_periods} months ahead")
    print(f"   📈 Forecast range: {forecast.start_time()} to {forecast.end_time()}")
    
    return forecast, xgb_model

# Example: Create XGBoost forecast for India
india_forecast_xgb, india_xgb_model = create_xgboost_forecast(india_series, 'India', forecast_periods=24)

# Visualize both forecasts together
plt.figure(figsize=(14, 8))

# Plot historical data
india_series.plot(label='Historical Data', color='blue', linewidth=2)

# Plot both forecasts
india_forecast_prophet.plot(label='Prophet Forecast', color='red', linestyle='--', linewidth=2)
india_forecast_xgb.plot(label='XGBoost Forecast', color='green', linestyle='-.', linewidth=2)

plt.title('📈 India - Comparing Prophet vs XGBoost Forecasts', fontsize=16)
plt.ylabel('New Signups', fontsize=12)
plt.xlabel('Date', fontsize=12)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x/1000:.0f}K'))
plt.tight_layout()
plt.show()

# Compare the forecasts
print(f"\n🎯 Forecast Comparison for India:")
print(f"   📊 Prophet average monthly: {india_forecast_prophet.values().mean():,.0f} signups")
print(f"   📊 XGBoost average monthly: {india_forecast_xgb.values().mean():,.0f} signups")

prophet_total = india_forecast_prophet.values().sum()
xgb_total = india_forecast_xgb.values().sum()
print(f"   🎯 Prophet 24-month total: {prophet_total:,.0f} signups")
print(f"   🎯 XGBoost 24-month total: {xgb_total:,.0f} signups")
print(f"   📈 Difference: {abs(prophet_total - xgb_total):,.0f} signups ({abs(prophet_total - xgb_total)/prophet_total*100:.1f}%)")

## ✅ Step 5: Model Validation with Backtesting

**What is backtesting?**
We pretend we're in the past and test how well our models would have predicted what actually happened.

**Why is this important?**
It tells us if our models are reliable and which one performs better on real data.

**How it works:**
1. Split data: Use 2018-2023 to train, 2024 to test
2. Make predictions for 2024
3. Compare predictions with actual 2024 data
4. Calculate accuracy metrics

In [None]:
def backtest_models(country_series, country_name, test_months=12):
    """
    Test how well our models would have performed on past data.
    
    Parameters:
    - country_series: TimeSeries data for the country
    - country_name: Name of the country
    - test_months: Number of months to use for testing
    
    Returns:
    - Dictionary with results
    """
    
    print(f"✅ Backtesting models for {country_name}...")
    
    # Split data into train and test
    train_series, test_series = country_series[:-test_months], country_series[-test_months:]
    
    print(f"   📚 Training period: {train_series.start_time()} to {train_series.end_time()}")
    print(f"   🧪 Testing period: {test_series.start_time()} to {test_series.end_time()}")
    
    # Train Prophet model
    prophet_model = Prophet(
        seasonality_mode='multiplicative',
        yearly_seasonality=True,
        weekly_seasonality=False,
        daily_seasonality=False
    )
    prophet_model.fit(train_series)
    prophet_pred = prophet_model.predict(n=test_months)
    
    # Train XGBoost model
    xgb_model = XGBModel(
        lags=12,
        n_estimators=100,
        learning_rate=0.1,
        random_state=42
    )
    xgb_model.fit(train_series)
    xgb_pred = xgb_model.predict(n=test_months)
    
    # Calculate accuracy metrics
    # MAPE = Mean Absolute Percentage Error (lower is better)
    prophet_mape = mape(test_series, prophet_pred)
    xgb_mape = mape(test_series, xgb_pred)
    
    # MAE = Mean Absolute Error (lower is better)
    prophet_mae = mae(test_series, prophet_pred)
    xgb_mae = mae(test_series, xgb_pred)
    
    results = {
        'country': country_name,
        'prophet_mape': prophet_mape,
        'xgb_mape': xgb_mape,
        'prophet_mae': prophet_mae,
        'xgb_mae': xgb_mae,
        'test_series': test_series,
        'prophet_pred': prophet_pred,
        'xgb_pred': xgb_pred,
        'better_model': 'Prophet' if prophet_mape < xgb_mape else 'XGBoost'
    }
    
    print(f"\n   📊 Results for {country_name}:")
    print(f"      Prophet MAPE: {prophet_mape:.1f}% (lower is better)")
    print(f"      XGBoost MAPE: {xgb_mape:.1f}% (lower is better)")
    print(f"      🏆 Better model: {results['better_model']}")
    
    return results

# Backtest for India
india_backtest = backtest_models(india_series, 'India')

# Visualize backtest results
plt.figure(figsize=(14, 8))

# Plot the test data (actual values)
india_backtest['test_series'].plot(label='Actual (Test Data)', color='blue', linewidth=3)

# Plot predictions
india_backtest['prophet_pred'].plot(label='Prophet Prediction', color='red', linestyle='--', linewidth=2)
india_backtest['xgb_pred'].plot(label='XGBoost Prediction', color='green', linestyle='-.', linewidth=2)

plt.title(f'🧪 Backtest Results for India - How Well Did Our Models Predict 2024?', fontsize=16)
plt.ylabel('New Signups', fontsize=12)
plt.xlabel('Date', fontsize=12)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x/1000:.0f}K'))

# Add accuracy info to the plot
plt.text(0.02, 0.98, 
         f"Prophet MAPE: {india_backtest['prophet_mape']:.1f}%\nXGBoost MAPE: {india_backtest['xgb_mape']:.1f}%\nBetter model: {india_backtest['better_model']}",
         transform=plt.gca().transAxes, fontsize=12, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

## 🌍 Step 6: Create Forecasts for All Countries

**What we're doing:** Running our best model on all target countries to get 1-year and 5-year projections.

**Output:** 
- 1-year forecast (next 12 months)
- 5-year forecast (next 60 months)
- Summary statistics and rankings

In [None]:
def create_all_country_forecasts(series_dict, model_type='Prophet'):
    """
    Create forecasts for all countries.
    
    Parameters:
    - series_dict: Dictionary of TimeSeries for all countries
    - model_type: 'Prophet' or 'XGBoost'
    
    Returns:
    - Dictionary with forecasts and summaries
    """
    
    print(f"🌍 Creating {model_type} forecasts for all countries...")
    
    results = {}
    
    for country, series in series_dict.items():
        print(f"\n🔄 Processing {country}...")
        
        try:
            if model_type == 'Prophet':
                # 1-year forecast
                forecast_1yr, _ = create_prophet_forecast(series, country, forecast_periods=12)
                # 5-year forecast  
                forecast_5yr, model = create_prophet_forecast(series, country, forecast_periods=60)
            else:  # XGBoost
                forecast_1yr, _ = create_xgboost_forecast(series, country, forecast_periods=12)
                forecast_5yr, model = create_xgboost_forecast(series, country, forecast_periods=60)
            
            # Calculate totals
            total_1yr = forecast_1yr.values().sum()
            total_5yr = forecast_5yr.values().sum()
            
            # Calculate growth rates
            last_actual = series.values()[-1][0]
            first_forecast = forecast_1yr.values()[0][0]
            growth_rate = ((first_forecast / last_actual) - 1) * 100
            
            results[country] = {
                'forecast_1yr': forecast_1yr,
                'forecast_5yr': forecast_5yr,
                'total_1yr': total_1yr,
                'total_5yr': total_5yr,
                'monthly_avg_1yr': total_1yr / 12,
                'monthly_avg_5yr': total_5yr / 60,
                'last_actual': last_actual,
                'growth_rate': growth_rate,
                'model': model
            }
            
            print(f"   ✅ {country} complete:")
            print(f"      📅 1-year total: {total_1yr:,.0f} signups")
            print(f"      📅 5-year total: {total_5yr:,.0f} signups")
            print(f"      📈 Growth rate: {growth_rate:+.1f}%")
            
        except Exception as e:
            print(f"   ❌ Error with {country}: {str(e)}")
            continue
    
    return results

# Create forecasts for all countries using Prophet (you can change to 'XGBoost')
all_forecasts = create_all_country_forecasts(series_dict, model_type='Prophet')

print(f"\n🎉 Completed forecasts for {len(all_forecasts)} countries!")

## 📊 Step 7: Results Analysis and Visualization

**What we're doing:** 
- Ranking countries by projected growth
- Creating summary tables and charts
- Identifying key insights for the Octoverse report

In [None]:
# Create summary DataFrame for easy analysis
def create_summary_table(all_forecasts):
    """
    Create a summary table with all key metrics.
    """
    
    summary_data = []
    
    for country, data in all_forecasts.items():
        summary_data.append({
            'Country': country,
            '1Y_Total_Signups': data['total_1yr'],
            '5Y_Total_Signups': data['total_5yr'],
            '1Y_Monthly_Avg': data['monthly_avg_1yr'],
            '5Y_Monthly_Avg': data['monthly_avg_5yr'],
            'Last_Actual_Value': data['last_actual'],
            'Growth_Rate_%': data['growth_rate']
        })
    
    df_summary = pd.DataFrame(summary_data)
    
    # Sort by 1-year total signups (descending)
    df_summary = df_summary.sort_values('1Y_Total_Signups', ascending=False)
    df_summary['Rank_1Y'] = range(1, len(df_summary) + 1)
    
    # Sort by 5-year total signups for 5Y ranking
    df_summary['Rank_5Y'] = df_summary['5Y_Total_Signups'].rank(method='dense', ascending=False).astype(int)
    
    return df_summary

summary_df = create_summary_table(all_forecasts)

print("🏆 GITHUB OCTOVERSE - DEVELOPER SIGNUP PROJECTIONS")
print("=" * 80)
print("\n📊 TOP 10 COUNTRIES - 1 YEAR PROJECTION RANKING:")
print("-" * 80)

# Display top 10 for 1-year projections
top10_1y = summary_df.head(10)[['Rank_1Y', 'Country', '1Y_Total_Signups', '1Y_Monthly_Avg', 'Growth_Rate_%']].copy()
top10_1y['1Y_Total_Signups'] = top10_1y['1Y_Total_Signups'].apply(lambda x: f"{x:,.0f}")
top10_1y['1Y_Monthly_Avg'] = top10_1y['1Y_Monthly_Avg'].apply(lambda x: f"{x:,.0f}")
top10_1y['Growth_Rate_%'] = top10_1y['Growth_Rate_%'].apply(lambda x: f"{x:+.1f}%")

print(top10_1y.to_string(index=False))

print("\n📊 TOP 10 COUNTRIES - 5 YEAR PROJECTION RANKING:")
print("-" * 80)

# Display top 10 for 5-year projections
top10_5y = summary_df.sort_values('5Y_Total_Signups', ascending=False).head(10)[['Rank_5Y', 'Country', '5Y_Total_Signups', '5Y_Monthly_Avg']].copy()
top10_5y['5Y_Total_Signups'] = top10_5y['5Y_Total_Signups'].apply(lambda x: f"{x:,.0f}")
top10_5y['5Y_Monthly_Avg'] = top10_5y['5Y_Monthly_Avg'].apply(lambda x: f"{x:,.0f}")

print(top10_5y.to_string(index=False))

# Calculate global totals
global_1y_total = summary_df['1Y_Total_Signups'].sum()
global_5y_total = summary_df['5Y_Total_Signups'].sum()

print("\n🌍 GLOBAL PROJECTIONS:")
print(f"   📈 Total 1-year signups across all markets: {global_1y_total:,.0f}")
print(f"   📈 Total 5-year signups across all markets: {global_5y_total:,.0f}")
print(f"   📊 Average monthly signups (1Y): {global_1y_total/12:,.0f}")
print(f"   📊 Average monthly signups (5Y): {global_5y_total/60:,.0f}")

In [None]:
# Create visualization of top countries
def create_forecast_visualizations(summary_df, all_forecasts):
    """
    Create comprehensive visualizations of the forecast results.
    """
    
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Bar chart of 1-year projections
    ax1 = plt.subplot(2, 3, 1)
    top_countries_1y = summary_df.head(8)
    bars1 = ax1.bar(range(len(top_countries_1y)), top_countries_1y['1Y_Total_Signups']/1000000, 
                    color='skyblue', edgecolor='navy', alpha=0.7)
    ax1.set_title('📈 1-Year Projections\n(Top 8 Countries)', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Countries')
    ax1.set_ylabel('Total Signups (Millions)')
    ax1.set_xticks(range(len(top_countries_1y)))
    ax1.set_xticklabels(top_countries_1y['Country'], rotation=45, ha='right')
    ax1.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars1):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                f'{height:.1f}M', ha='center', va='bottom', fontweight='bold')
    
    # 2. Bar chart of 5-year projections
    ax2 = plt.subplot(2, 3, 2)
    top_countries_5y = summary_df.sort_values('5Y_Total_Signups', ascending=False).head(8)
    bars2 = ax2.bar(range(len(top_countries_5y)), top_countries_5y['5Y_Total_Signups']/1000000,
                    color='lightgreen', edgecolor='darkgreen', alpha=0.7)
    ax2.set_title('📈 5-Year Projections\n(Top 8 Countries)', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Countries')
    ax2.set_ylabel('Total Signups (Millions)')
    ax2.set_xticks(range(len(top_countries_5y)))
    ax2.set_xticklabels(top_countries_5y['Country'], rotation=45, ha='right')
    ax2.grid(True, alpha=0.3)
    
    # Add value labels
    for i, bar in enumerate(bars2):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                f'{height:.1f}M', ha='center', va='bottom', fontweight='bold')
    
    # 3. Growth rate comparison
    ax3 = plt.subplot(2, 3, 3)
    growth_countries = summary_df.sort_values('Growth_Rate_%', ascending=False).head(8)
    colors = ['green' if x > 0 else 'red' for x in growth_countries['Growth_Rate_%']]
    bars3 = ax3.bar(range(len(growth_countries)), growth_countries['Growth_Rate_%'],
                    color=colors, alpha=0.7)
    ax3.set_title('📊 Growth Rate Projections\n(Top 8 Countries)', fontsize=14, fontweight='bold')
    ax3.set_xlabel('Countries')
    ax3.set_ylabel('Growth Rate (%)')
    ax3.set_xticks(range(len(growth_countries)))
    ax3.set_xticklabels(growth_countries['Country'], rotation=45, ha='right')
    ax3.grid(True, alpha=0.3)
    ax3.axhline(y=0, color='black', linestyle='-', alpha=0.5)
    
    # 4. Time series for top 3 countries
    ax4 = plt.subplot(2, 3, (4, 6))
    top_3_countries = summary_df.head(3)['Country'].tolist()
    
    for i, country in enumerate(top_3_countries):
        if country in series_dict and country in all_forecasts:
            # Plot historical data
            series_dict[country].plot(ax=ax4, label=f'{country} (Historical)', alpha=0.7)
            # Plot 1-year forecast
            all_forecasts[country]['forecast_1yr'].plot(ax=ax4, label=f'{country} (Forecast)', 
                                                      linestyle='--', alpha=0.9)
    
    ax4.set_title('📈 Historical Data + 1-Year Forecasts\n(Top 3 Countries)', fontsize=14, fontweight='bold')
    ax4.set_ylabel('New Signups')
    ax4.set_xlabel('Date')
    ax4.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax4.grid(True, alpha=0.3)
    ax4.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x/1000:.0f}K'))
    
    plt.tight_layout()
    plt.suptitle('🌍 GitHub Octoverse - Developer Growth Projections Dashboard', 
                 fontsize=18, fontweight='bold', y=0.98)
    plt.show()

# Create the visualizations
create_forecast_visualizations(summary_df, all_forecasts)

## 💾 Step 8: Export Results

**What we're doing:** Saving our forecasts and analysis to files that you can:
- Share with your team
- Import into other tools
- Use for presentations
- Reference later

In [None]:
# Save summary table to CSV
summary_df.to_csv('GitHub_Octoverse_Forecasts_Summary.csv', index=False)
print("✅ Summary table saved to 'GitHub_Octoverse_Forecasts_Summary.csv'")

# Create detailed monthly forecasts for each country
detailed_forecasts = []

for country, data in all_forecasts.items():
    # 1-year monthly breakdown
    forecast_1yr = data['forecast_1yr']
    for i, timestamp in enumerate(forecast_1yr.time_index):
        detailed_forecasts.append({
            'Country': country,
            'Date': timestamp.strftime('%Y-%m'),
            'Forecast_Period': '1_Year',
            'Month_Number': i + 1,
            'Projected_Signups': int(forecast_1yr.values()[i][0])
        })
    
    # First 12 months of 5-year forecast (to avoid duplication)
    forecast_5yr = data['forecast_5yr']
    for i in range(12, min(60, len(forecast_5yr))):
        timestamp = forecast_5yr.time_index[i]
        detailed_forecasts.append({
            'Country': country,
            'Date': timestamp.strftime('%Y-%m'),
            'Forecast_Period': '5_Year',
            'Month_Number': i + 1,
            'Projected_Signups': int(forecast_5yr.values()[i][0])
        })

detailed_df = pd.DataFrame(detailed_forecasts)
detailed_df.to_csv('GitHub_Octoverse_Detailed_Monthly_Forecasts.csv', index=False)
print("✅ Detailed monthly forecasts saved to 'GitHub_Octoverse_Detailed_Monthly_Forecasts.csv'")

# Create a PowerPoint-ready summary
print("\n" + "="*80)
print("📋 EXECUTIVE SUMMARY FOR OCTOVERSE REPORT")
print("="*80)

print("\n🎯 KEY FINDINGS:")
print(f"• Top market for 1-year growth: {summary_df.iloc[0]['Country']} ({summary_df.iloc[0]['1Y_Total_Signups']:,.0f} signups)")
print(f"• Top market for 5-year growth: {summary_df.sort_values('5Y_Total_Signups', ascending=False).iloc[0]['Country']}")
print(f"• Highest growth rate: {summary_df.sort_values('Growth_Rate_%', ascending=False).iloc[0]['Country']} ({summary_df.sort_values('Growth_Rate_%', ascending=False).iloc[0]['Growth_Rate_%']:+.1f}%)")
print(f"• Global 1-year projection: {global_1y_total:,.0f} new developer signups")
print(f"• Global 5-year projection: {global_5y_total:,.0f} new developer signups")

print("\n📈 METHODOLOGY:")
print("• Used Prophet time series forecasting model")
print("• Based on historical data from 2018-2025")
print("• Accounts for seasonality and growth trends")
print("• Validated using backtesting on 2024 data")

print("\n💡 RECOMMENDATIONS:")
print("• Focus developer outreach on top 3 growth markets")
print("• Plan infrastructure capacity based on projected volumes")
print("• Consider regional customization for high-growth countries")
print("• Monitor actual performance against these forecasts monthly")

print(f"\n📊 Analysis completed on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("🎉 All files saved successfully!")

## 🎓 What You've Learned

Congratulations! You've just completed a full time series forecasting project. Here's what you accomplished:

### ✅ **Skills Mastered:**
1. **Data Loading**: Loading CSV files into Python for analysis
2. **Time Series Preparation**: Converting data into proper time series format
3. **Forecasting Models**: Built both Prophet and XGBoost models
4. **Model Validation**: Used backtesting to test model accuracy
5. **Results Analysis**: Created rankings and summary statistics
6. **Visualization**: Made professional charts and dashboards
7. **Export**: Saved results for sharing and presentations

### 🔧 **Technical Concepts:**
- **TimeSeries objects**: Darts' format for time series data
- **MAPE**: Mean Absolute Percentage Error (accuracy metric)
- **Seasonality**: Yearly patterns in the data
- **Backtesting**: Testing models on historical data
- **Prophet**: Facebook's business forecasting model
- **XGBoost**: Machine learning model for predictions

### 📊 **Business Impact:**
- 1-year and 5-year growth projections for top markets
- Country rankings for strategic planning
- Data-driven insights for Octoverse publication
- Methodology that can be repeated and improved

---

## 🚀 Next Steps

**Want to make your forecasts even better?**

1. **Add Covariates**: Include external factors like:
   - Economic indicators (GDP growth, internet penetration)
   - GitHub product releases (Copilot launches, new features)
   - Developer events and conferences

2. **Ensemble Models**: Combine Prophet and XGBoost predictions

3. **Advanced Validation**: Use cross-validation and confidence intervals

4. **Real-time Updates**: Set up automated data refreshing

**Remember**: The key to good forecasting is continuous improvement based on new data and feedback!
