# Advanced Visualizations for GitHub Language Analysis

This notebook focuses on creating advanced, interactive visualizations to better understand programming language relationships and patterns in our GitHub data.

## Table of Contents
1. [Setup and Data Loading](#Setup-and-Data-Loading)
2. [Hierarchical Language Relationships](#Hierarchical-Language-Relationships)
3. [Interactive Language Explorer](#Interactive-Language-Explorer)
4. [Repository Success Patterns](#Repository-Success-Patterns)
5. [Temporal Analysis Visualizations](#Temporal-Analysis-Visualizations)
6. [Export and Integration](#Export-and-Integration)

## Goals
- Create interactive and dynamic visualizations
- Explore hierarchical relationships between languages
- Visualize temporal patterns in language usage
- Generate publication-ready visualizations
- Export visualizations for web integration

## Setup and Data Loading

First, let's import necessary libraries and load our prepared data.

In [7]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Set default theme for plotly
pio.templates.default = "plotly_white"

# Read the prepared data
df = pd.read_csv('../data/raw/repositories_enriched.csv')

print("Dataset Info:")
print(f"Number of repositories: {len(df)}")
print(f"Number of languages: {df['language'].nunique()}")
print("\nColumns available:")
print(df.columns.tolist())

Dataset Info:
Number of repositories: 1200
Number of languages: 12

Columns available:
['id', 'name', 'full_name', 'owner', 'description', 'language', 'created_at', 'updated_at', 'pushed_at', 'stars', 'forks', 'watchers', 'open_issues', 'size_kb', 'license', 'has_wiki', 'has_pages', 'contributors_count', 'commits_30d', 'commits_90d', 'commits_365d', 'has_readme', 'has_license', 'has_contributing', 'has_code_of_conduct', 'url', 'stars_normalized', 'forks_normalized', 'watchers_normalized', 'popularity_score', 'commits_30d_normalized', 'contributors_normalized', 'days_since_push', 'recency_score', 'activity_score', 'health_score', 'overall_score', 'stars_per_contributor', 'forks_per_contributor', 'engagement_per_contributor', 'engagement_density', 'recent_commit_share', 'quarter_commit_share', 'issue_to_commit_ratio', 'freshness_index', 'support_load', 'compliance_score', 'enterprise_ready', 'maturity_score', 'growth_signal', 'growth_segment', 'compliance_tier']


## Hierarchical Language Relationships

We'll create several hierarchical visualizations to explore relationships between languages:
1. Sunburst chart showing language categories and metrics
2. Treemap visualization of repository distributions
3. Hierarchical clustering dendrogram based on language similarities

In [8]:
# Create hierarchical category structure for languages
def create_language_hierarchy(df):
    # Calculate metrics for each language
    hierarchy_data = df.groupby('language').agg({
        'stars': 'sum',
        'forks': 'sum',
        'watchers': 'sum',
        'id': 'count'  # Count of repositories
    }).reset_index()
    
    # Categorize languages based on paradigms/features
    language_categories = {
        'Systems Programming': ['C++', 'Rust', 'Go'],
        'Web Development': ['JavaScript', 'TypeScript'],
        'Enterprise': ['Java', 'C#'],
        'Mobile & Modern': ['Kotlin', 'Swift'],
        'Scripting': ['Python', 'Ruby', 'PHP']
    }
    
    # Create data for sunburst chart
    sunburst_data = []
    for category, langs in language_categories.items():
        # Add category level
        category_metrics = hierarchy_data[hierarchy_data['language'].isin(langs)].sum()
        sunburst_data.append({
            'id': category,
            'parent': '',
            'value': category_metrics['id'],
            'stars': category_metrics['stars'],
            'forks': category_metrics['forks']
        })
        
        # Add language level
        for lang in langs:
            lang_data = hierarchy_data[hierarchy_data['language'] == lang]
            if len(lang_data) > 0:
                lang_metrics = lang_data.iloc[0]
                sunburst_data.append({
                    'id': lang,
                    'parent': category,
                    'value': lang_metrics['id'],
                    'stars': lang_metrics['stars'],
                    'forks': lang_metrics['forks']
                })
    
    return pd.DataFrame(sunburst_data)

# Create and display sunburst chart
hierarchy_df = create_language_hierarchy(df)

fig = go.Figure(go.Sunburst(
    ids=hierarchy_df['id'],
    parents=hierarchy_df['parent'],
    values=hierarchy_df['value'],
    branchvalues='total',
    hovertemplate='<b>%{label}</b><br>' +
                  'Repositories: %{value}<br>' +
                  'Stars: %{customdata[0]:,.0f}<br>' +
                  'Forks: %{customdata[1]:,.0f}<extra></extra>',
    customdata=hierarchy_df[['stars', 'forks']].values
))

fig.update_layout(
    title={
        'text': 'Programming Language Hierarchy by Repository Count',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    width=800,
    height=800
 )

# Save the figure to HTML for interactivity
sunburst_path = "../public/visualizations/language_hierarchy_sunburst.html"
fig.write_html(sunburst_path)
print(f"✅ Saved sunburst to {sunburst_path}")
fig.show()

✅ Saved sunburst to ../public/visualizations/language_hierarchy_sunburst.html


In [9]:
# Create treemap visualization with categories and individual repositories
def create_repository_treemap(df):
    # Define language categories
    language_categories = {
        'Rust': 'Systems Programming',
        'C++': 'Systems Programming',
        'Go': 'Systems Programming',
        'TypeScript': 'Web Development',
        'JavaScript': 'Web Development',
        'Java': 'Enterprise',
        'C#': 'Enterprise',
        'Swift': 'Mobile & Modern',
        'Kotlin': 'Mobile & Modern',
        'Python': 'Scripting & AI',
        'Ruby': 'Scripting & AI',
        'PHP': 'Web Backend'
    }
    
    treemap_data = df.copy()
    treemap_data['category'] = treemap_data['language'].map(language_categories)
    treemap_data['repo_display'] = treemap_data['name']
    
    fig = px.treemap(
        treemap_data,
        path=['category', 'language', 'repo_display'],
        values='stars',
        color='overall_score',
        color_continuous_scale='Viridis',
        custom_data=['full_name', 'stars', 'forks', 'commits_30d', 'contributors_count']
    )
    
    fig.update_traces(
        hovertemplate='<b>%{label}</b><br>' +
        'Stars: %{customdata[1]:,.0f}<br>' +
        'Forks: %{customdata[2]:,.0f}<br>' +
        'Monthly Commits: %{customdata[3]:.0f}<br>' +
        'Contributors: %{customdata[4]:,.0f}<extra></extra>',
        marker=dict(line=dict(width=1, color='#1a1a1a'))
    )
    
    fig.update_layout(
        title={
            'text': 'Repository Distribution: Drill-down by Category > Language > Repository',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(color='white', size=20)
        },
        width=1200,
        height=700,
        paper_bgcolor='black',
        plot_bgcolor='black',
        font=dict(color='white')
    )
    
    treemap_path = "../public/visualizations/treemap_top_repos.html"
    fig.write_html(treemap_path)
    print(f"✅ Saved treemap to {treemap_path}")
    return fig

# Create and display treemap
treemap_fig = create_repository_treemap(df)
treemap_fig.show()

✅ Saved treemap to ../public/visualizations/treemap_top_repos.html


## Interactive Language Explorer

In this section, we'll create interactive visualizations that allow for dynamic exploration of language relationships and patterns:
1. Parallel coordinates plot for multi-dimensional analysis
2. 3D scatter plot for exploring relationships between key metrics
3. Interactive gauge charts for comparing language performance

In [10]:
# Create parallel coordinates plot for top repositories
def create_parallel_coordinates(df, top_n=100):
    metrics = ['stars', 'forks', 'watchers', 'open_issues', 'commits_30d', 'contributors_count']
    top_repos = df.nlargest(top_n, 'stars')
    
    fig = go.Figure(data=
        go.Parcoords(
            line=dict(
                color=top_repos['stars'],
                colorscale='Viridis',
                showscale=True,
                cmin=top_repos['stars'].min(),
                cmax=top_repos['stars'].max()
            ),
            dimensions=[
                dict(range=[top_repos[col].min(), top_repos[col].max()],
                     label=col.replace('_', ' ').title(),
                     values=top_repos[col])
                for col in metrics
            ]
        )
    )
    
    fig.update_layout(
        title={
            'text': f'Multi-dimensional Analysis of Top {top_n} Repositories',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        width=1000,
        height=600
    )
    
    parallel_path = "../public/visualizations/parallel_coordinates_top100.html"
    fig.write_html(parallel_path)
    print(f"✅ Saved parallel coordinates to {parallel_path}")
    return fig

parallel_fig = create_parallel_coordinates(df)
parallel_fig.show()

✅ Saved parallel coordinates to ../public/visualizations/parallel_coordinates_top100.html


In [11]:
# Create 3D scatter plot
def create_3d_language_analysis(df):
    language_metrics = df.groupby('language').agg({
        'stars': 'mean',
        'forks': 'mean',
        'contributors_count': 'mean',
        'id': 'count'
    }).reset_index()
    
    language_metrics = language_metrics.rename(columns={'id': 'repository_count'})
    
    fig = go.Figure(data=[
        go.Scatter3d(
            x=language_metrics['stars'],
            y=language_metrics['forks'],
            z=language_metrics['contributors_count'],
            text=language_metrics['language'],
            mode='markers+text',
            marker=dict(
                size=language_metrics['repository_count'] / 10,
                color=language_metrics['repository_count'],
                colorscale='Viridis',
                colorbar=dict(title='Repository Count'),
                opacity=0.8
            ),
            hovertemplate=
            '<b>%{text}</b><br>' +
            'Avg Stars: %{x:.0f}<br>' +
            'Avg Forks: %{y:.0f}<br>' +
            'Avg Contributors: %{z:.1f}<br>' +
            '<extra></extra>'
        )
    ])
    
    fig.update_layout(
        title={
            'text': '3D Language Analysis: Stars, Forks, and Contributors',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        scene=dict(
            xaxis_title='Average Stars',
            yaxis_title='Average Forks',
            zaxis_title='Average Contributors'
        ),
        width=1000,
        height=800
    )
    
    scatter_path = "../public/visualizations/3d_language_analysis.html"
    fig.write_html(scatter_path)
    print(f"✅ Saved 3D scatter to {scatter_path}")
    return fig

scatter_3d_fig = create_3d_language_analysis(df)
scatter_3d_fig.show()

✅ Saved 3D scatter to ../public/visualizations/3d_language_analysis.html


In [12]:
# Create gauge charts for top languages
def create_gauge_charts(df, metric='stars', n_languages=4):
    language_metrics = df.groupby('language')[metric].mean().sort_values(ascending=False)
    top_languages = language_metrics.head(n_languages)
    
    fig = make_subplots(
        rows=2, cols=2,
        specs=[[{'type': 'indicator'}, {'type': 'indicator'}],
               [{'type': 'indicator'}, {'type': 'indicator'}]],
        subplot_titles=top_languages.index
    )
    
    max_value = language_metrics.max()
    
    positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
    for (language, value), (row, col) in zip(top_languages.items(), positions):
        fig.add_trace(
            go.Indicator(
                mode="gauge+number",
                value=value,
                title={'text': language},
                gauge={
                    'axis': {'range': [None, max_value]},
                    'steps': [
                        {'range': [0, max_value/3], 'color': "lightgray"},
                        {'range': [max_value/3, max_value*2/3], 'color': "gray"},
                        {'range': [max_value*2/3, max_value], 'color': "darkgray"}
                    ],
                    'bar': {'color': "darkblue"}
                }
            ),
            row=row, col=col
        )
    
    fig.update_layout(
        title={
            'text': f'Top {n_languages} Languages by Average {metric.replace("_", " ").title()}',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        width=1000,
        height=800
    )
    
    gauge_path = "../public/visualizations/gauge_charts_top4.html"
    fig.write_html(gauge_path)
    print(f"✅ Saved gauge charts to {gauge_path}")
    return fig

gauge_fig = create_gauge_charts(df, metric='stars')
gauge_fig.show()

for metric in ['forks', 'contributors_count', 'commits_30d']:
    create_gauge_charts(df, metric=metric)

✅ Saved gauge charts to ../public/visualizations/gauge_charts_top4.html


✅ Saved gauge charts to ../public/visualizations/gauge_charts_top4.html
✅ Saved gauge charts to ../public/visualizations/gauge_charts_top4.html
✅ Saved gauge charts to ../public/visualizations/gauge_charts_top4.html
