# GitHub Trending Projects Research

This notebook explores the OSS Insight API to gather trending GitHub projects from the past 24 hours.

## Objectives
- Test the OSS Insight API for trending repositories
- Gather data from past 24 hours (all languages)
- Analyze the data structure and quality
- Explore potential integration with our newsletter pipeline

## API Endpoint
- **Base URL**: `https://api.ossinsight.io/v1/trends/repos/`
- **Parameters**: 
  - `period=past_24_hours`
  - `language=All` (default, all languages)


In [1]:
# Import required libraries
import requests
import json
import pandas as pd
from datetime import datetime
import time

# Set up display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)


In [2]:
# API Configuration
API_BASE_URL = "https://api.ossinsight.io/v1/trends/repos/"
PERIOD = "past_24_hours"
LANGUAGE = "All"  # All languages

# Build the request URL
url = f"{API_BASE_URL}?period={PERIOD}&language={LANGUAGE}"

print(f"🔗 API URL: {url}")
print(f"📅 Period: {PERIOD}")
print(f"🌐 Language: {LANGUAGE}")
print(f"⏰ Request time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


🔗 API URL: https://api.ossinsight.io/v1/trends/repos/?period=past_24_hours&language=All
📅 Period: past_24_hours
🌐 Language: All
⏰ Request time: 2025-09-27 20:34:39


In [3]:
# Make the API request
print("🚀 Making API request...")
start_time = time.time()

try:
    response = requests.get(url, headers={'Accept': 'application/json'}, timeout=30)
    response.raise_for_status()  # Raise an exception for bad status codes
    
    request_time = time.time() - start_time
    print(f"✅ Request successful! ({request_time:.2f}s)")
    print(f"📊 Status Code: {response.status_code}")
    print(f"📏 Response Size: {len(response.content):,} bytes")
    
except requests.exceptions.RequestException as e:
    print(f"❌ Request failed: {e}")
    response = None


🚀 Making API request...
✅ Request successful! (0.52s)
📊 Status Code: 200
📏 Response Size: 36,457 bytes


In [4]:
# Parse and examine the response
if response:
    try:
        data = response.json()
        print("✅ JSON parsing successful!")
        
        # Examine the structure
        print(f"\n📋 Response Structure:")
        print(f"   Type: {data.get('type', 'Unknown')}")
        
        if 'data' in data:
            data_section = data['data']
            print(f"   Data keys: {list(data_section.keys())}")
            
            # Check if we have rows
            if 'rows' in data_section:
                rows = data_section['rows']
                print(f"   Number of repositories: {len(rows)}")
                
                # Show column information
                if 'columns' in data_section:
                    columns = data_section['columns']
                    print(f"   Number of columns: {len(columns)}")
                    print(f"   Columns: {[col['col'] for col in columns]}")
        
        # Show result metadata
        if 'result' in data:
            result = data['result']
            print(f"\n📊 Query Result:")
            print(f"   Code: {result.get('code', 'N/A')}")
            print(f"   Message: {result.get('message', 'N/A')}")
            print(f"   Latency: {result.get('latency', 'N/A')}")
            print(f"   Row Count: {result.get('row_count', 'N/A')}")
        
    except json.JSONDecodeError as e:
        print(f"❌ JSON parsing failed: {e}")
        data = None
else:
    print("❌ No response to parse")
    data = None


✅ JSON parsing successful!

📋 Response Structure:
   Type: sql_endpoint
   Data keys: ['columns', 'rows', 'result']
   Number of repositories: 100
   Number of columns: 11
   Columns: ['repo_id', 'repo_name', 'primary_language', 'description', 'stars', 'forks', 'pull_requests', 'pushes', 'total_score', 'contributor_logins', 'collection_names']


In [5]:
# Convert to DataFrame for easier analysis
if data and 'data' in data and 'rows' in data['data']:
    rows = data['data']['rows']
    
    # Create DataFrame
    df = pd.DataFrame(rows)
    
    print(f"📊 DataFrame created with {len(df)} rows and {len(df.columns)} columns")
    print(f"📋 Columns: {list(df.columns)}")
    
    # Show first few rows
    print(f"\n🔍 First 5 repositories:")
    print(df.head())
    
else:
    print("❌ No data available to create DataFrame")
    df = None


📊 DataFrame created with 100 rows and 11 columns
📋 Columns: ['repo_id', 'repo_name', 'primary_language', 'description', 'stars', 'forks', 'pull_requests', 'pushes', 'total_score', 'contributor_logins', 'collection_names']

🔍 First 5 repositories:
      repo_id            repo_name primary_language  \
0  1042367133      github/spec-kit           Python   
1   997220241   HKUDS/RAG-Anything           Python   
2  1006414368  OpenCut-app/OpenCut       TypeScript   
3   994093166     basecamp/omarchy            Shell   
4  1062234789    JerryZLiu/Dayflow            Swift   

                                                      description stars forks  \
0  💫 Toolkit to help you get started with Spec-Driven Development   277    25   
1                        "RAG-Anything: All-in-One RAG Framework"   246    14   
2                              The open-source CapCut alternative   271    28   
3                                 Opinionated Arch/Hyprland Setup   243    13   
4                

In [6]:
# Analyze the data quality and content
if df is not None:
    print("📈 Data Analysis:")
    print(f"   Total repositories: {len(df)}")
    
    # Check for missing values
    print(f"\n🔍 Missing Values:")
    missing_values = df.isnull().sum()
    for col, missing in missing_values.items():
        if missing > 0:
            print(f"   {col}: {missing} ({missing/len(df)*100:.1f}%)")
    
    # Analyze primary languages
    if 'primary_language' in df.columns:
        print(f"\n🌐 Programming Languages Distribution:")
        lang_counts = df['primary_language'].value_counts().head(10)
        for lang, count in lang_counts.items():
            print(f"   {lang}: {count} ({count/len(df)*100:.1f}%)")
    
    # Analyze stars distribution
    if 'stars' in df.columns:
        df['stars_numeric'] = pd.to_numeric(df['stars'], errors='coerce')
        print(f"\n⭐ Stars Statistics:")
        print(f"   Min: {df['stars_numeric'].min()}")
        print(f"   Max: {df['stars_numeric'].max()}")
        print(f"   Mean: {df['stars_numeric'].mean():.1f}")
        print(f"   Median: {df['stars_numeric'].median():.1f}")
    
    # Show top repositories by stars
    if 'stars_numeric' in df.columns:
        print(f"\n🏆 Top 10 Repositories by Stars:")
        top_repos = df.nlargest(10, 'stars_numeric')[['repo_name', 'primary_language', 'stars', 'description']]
        for idx, row in top_repos.iterrows():
            desc = row['description'][:60] + "..." if len(str(row['description'])) > 60 else row['description']
            print(f"   {row['repo_name']} ({row['primary_language']}) - {row['stars']} stars")
            print(f"      {desc}")
            print()


📈 Data Analysis:
   Total repositories: 100

🔍 Missing Values:

🌐 Programming Languages Distribution:
   Python: 31 (31.0%)
   TypeScript: 26 (26.0%)
   Jupyter Notebook: 8 (8.0%)
   JavaScript: 7 (7.0%)
   : 5 (5.0%)
   Swift: 4 (4.0%)
   Rust: 3 (3.0%)
   HTML: 3 (3.0%)
   Shell: 2 (2.0%)
   PHP: 2 (2.0%)

⭐ Stars Statistics:
   Min: 6
   Max: 277
   Mean: 70.5
   Median: 46.5

🏆 Top 10 Repositories by Stars:
   github/spec-kit (Python) - 277 stars
      💫 Toolkit to help you get started with Spec-Driven Developme...

   OpenCut-app/OpenCut (TypeScript) - 271 stars
      The open-source CapCut alternative

   HKUDS/RAG-Anything (Python) - 246 stars
      "RAG-Anything: All-in-One RAG Framework"

   basecamp/omarchy (Shell) - 243 stars
      Opinionated Arch/Hyprland Setup

   JerryZLiu/Dayflow (Swift) - 219 stars
      Generate a timeline of your day, automatically

   imputnet/helium (Python) - 171 stars
      Private, fast, and honest web browser

   apple/ml-simplefold (Python) - 

In [7]:
# Save the raw data for further analysis
if data:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"../../data/github_trending_{timestamp}.json"
    
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        print(f"💾 Raw data saved to: {filename}")
    except Exception as e:
        print(f"❌ Failed to save data: {e}")

# Save DataFrame as CSV for easy analysis
if df is not None:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    csv_filename = f"../../data/github_trending_{timestamp}.csv"
    
    try:
        df.to_csv(csv_filename, index=False, encoding='utf-8')
        print(f"💾 DataFrame saved to: {csv_filename}")
    except Exception as e:
        print(f"❌ Failed to save CSV: {e}")


❌ Failed to save data: [Errno 2] No such file or directory: '../../data/github_trending_20250927_203439.json'
❌ Failed to save CSV: Cannot save file into a non-existent directory: '../../data'


## Summary and Next Steps

### What We've Discovered:
- ✅ Successfully connected to OSS Insight API
- ✅ Retrieved trending repositories from past 24 hours
- ✅ Analyzed data structure and quality
- ✅ Saved data for further analysis

### Potential Integration with Newsletter:
1. **Content Source**: Trending GitHub projects could be a valuable addition to tech newsletters
2. **Data Quality**: High-quality data with stars, descriptions, and metadata
3. **Real-time**: Past 24 hours data provides fresh, relevant content
4. **Diverse Languages**: Covers projects in multiple programming languages

### Next Research Areas:
- Filter for tech-relevant projects (AI, web development, etc.)
- Analyze project descriptions for quality scoring
- Explore integration with existing pipeline
- Test different time periods and language filters
