# Flight Route Analysis - OpenFlights Dataset

This notebook processes OpenFlights data for flight network analysis, including:
- Data loading and cleaning
- Route distance calculation
- Network graph building
- Hub analysis and shortest path finding
- Gephi visualization export

**Data Source**: [OpenFlights](https://openflights.org/data.html)

## 1. Setup and Data Loading

Load raw data from OpenFlights repository.

**Data Schemas:**
- **airports.dat**: Airport ID, Name, City, Country, IATA, ICAO, Latitude, Longitude, Altitude, Timezone, DST, Tz database time zone, Type, Source
- **airlines.dat**: Airline ID, Name, Alias, IATA, ICAO, Callsign, Country, Active
- **routes.dat**: Airline, Airline ID, Source airport, Source airport ID, Destination airport, Destination airport ID, Codeshare, Stops, Equipment

In [1]:
# Install dependencies if needed
# %pip install -q pandas requests networkx

import pandas as pd
from io import StringIO
import requests
import networkx as nx
import math
import os
from pathlib import Path
from typing import Tuple, Dict, Any

# OpenFlights data URLs
AIRPORTS_URL = "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat"
AIRLINES_URL = "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat"
ROUTES_URL = "https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat"

# Data schemas
airport_columns = [
    "airport_id", "name", "city", "country", "iata", "icao",
    "latitude", "longitude", "altitude", "timezone", "dst",
    "tz_database_time_zone", "type", "source"
]

airline_columns = [
    "airline_id", "name", "alias", "iata", "icao", "callsign",
    "country", "active"
]

route_columns = [
    "airline", "airline_id", "source_airport", "source_airport_id",
    "destination_airport", "destination_airport_id", "codeshare",
    "stops", "equipment"
]


def fetch_csv(url: str) -> str:
    """Fetch CSV data from URL."""
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    return response.text


def load_openflights_table(url: str, columns: list[str]) -> pd.DataFrame:
    """Load OpenFlights table from URL."""
    raw_text = fetch_csv(url)
    df = pd.read_csv(StringIO(raw_text), header=None, names=columns)
    return df


# Load data
print("Loading data from OpenFlights...")
airports_df = load_openflights_table(AIRPORTS_URL, airport_columns)
airlines_df = load_openflights_table(AIRLINES_URL, airline_columns)
routes_df = load_openflights_table(ROUTES_URL, route_columns)

# Basic type conversions
for col in ["airport_id", "altitude", "timezone"]:
    if col in airports_df.columns:
        airports_df[col] = pd.to_numeric(airports_df[col], errors="coerce")

for col in ["latitude", "longitude"]:
    if col in airports_df.columns:
        airports_df[col] = pd.to_numeric(airports_df[col], errors="coerce")

for col in ["airline_id"]:
    if col in airlines_df.columns:
        airlines_df[col] = pd.to_numeric(airlines_df[col], errors="coerce")

for col in ["stops"]:
    if col in routes_df.columns:
        routes_df[col] = pd.to_numeric(routes_df[col], errors="coerce")

print(f"Loaded: {len(airports_df)} airports, {len(airlines_df)} airlines, {len(routes_df)} routes")


Loading data from OpenFlights...
Loaded: 7698 airports, 6162 airlines, 67663 routes


### 1.1 Verify Loaded Data

Preview the loaded data to verify correctness.

In [2]:
# Preview airports data
print("Airports DataFrame:")
airports_df.head()


Airports DataFrame:


Unnamed: 0,airport_id,name,city,country,iata,icao,latitude,longitude,altitude,timezone,dst,tz_database_time_zone,type,source
0,1,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA,-6.08169,145.391998,5282,10.0,U,Pacific/Port_Moresby,airport,OurAirports
1,2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20708,145.789001,20,10.0,U,Pacific/Port_Moresby,airport,OurAirports
2,3,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.82679,144.296005,5388,10.0,U,Pacific/Port_Moresby,airport,OurAirports
3,4,Nadzab Airport,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569803,146.725977,239,10.0,U,Pacific/Port_Moresby,airport,OurAirports
4,5,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM,AYPY,-9.44338,147.220001,146,10.0,U,Pacific/Port_Moresby,airport,OurAirports


In [3]:
# Preview airlines data
print("Airlines DataFrame:")
airlines_df.head()


Airlines DataFrame:


Unnamed: 0,airline_id,name,alias,iata,icao,callsign,country,active
0,-1,Unknown,\N,-,,\N,\N,Y
1,1,Private flight,\N,-,,,,Y
2,2,135 Airways,\N,,GNL,GENERAL,United States,N
3,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
4,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N


In [4]:
# Preview routes data
print("Routes DataFrame:")
routes_df.head()


Routes DataFrame:


Unnamed: 0,airline,airline_id,source_airport,source_airport_id,destination_airport,destination_airport_id,codeshare,stops,equipment
0,2B,410,AER,2965,KZN,2990,,0,CR2
1,2B,410,ASF,2966,KZN,2990,,0,CR2
2,2B,410,ASF,2966,MRV,2962,,0,CR2
3,2B,410,CEK,2968,KZN,2990,,0,CR2
4,2B,410,CEK,2968,OVB,4078,,0,CR2


## 2. Data Cleaning

Clean and validate the loaded data to ensure quality for analysis.

### 2.1 Airports Cleaning

Remove invalid entries, standardize formats, and validate coordinates.

In [5]:
# Clean airports data
airports_cleaned = airports_df.copy()

# Replace invalid values
airports_cleaned = airports_cleaned.replace({
    '\\N': pd.NA,
    'nan': pd.NA,
    'NaN': pd.NA,
    '': pd.NA,
    'Unknown': pd.NA,
    'unknown': pd.NA
})

# Clean IATA and ICAO codes
airports_cleaned['iata'] = airports_cleaned['iata'].replace(['-', 'nan', 'NaN'], pd.NA)
airports_cleaned['icao'] = airports_cleaned['icao'].replace(['-', 'nan', 'NaN'], pd.NA)

# Validate coordinates
airports_cleaned = airports_cleaned.dropna(subset=['latitude', 'longitude'])
airports_cleaned = airports_cleaned[
    (airports_cleaned['latitude'] >= -90) & (airports_cleaned['latitude'] <= 90) &
    (airports_cleaned['longitude'] >= -180) & (airports_cleaned['longitude'] <= 180)
]

# Clean other columns
airports_cleaned['altitude'] = pd.to_numeric(airports_cleaned['altitude'], errors='coerce').fillna(0)
airports_cleaned['timezone'] = pd.to_numeric(airports_cleaned['timezone'], errors='coerce')
dst_mapping = {'E': 'E', 'A': 'A', 'S': 'S', 'O': 'O', 'Z': 'Z', 'N': 'N', 'U': 'U'}
airports_cleaned['dst'] = airports_cleaned['dst'].map(dst_mapping).fillna('U')
airports_cleaned['type'] = airports_cleaned['type'].fillna('airport')
airports_cleaned['source'] = airports_cleaned['source'].fillna('Unknown')
airports_cleaned = airports_cleaned.drop_duplicates(subset=['airport_id'], keep='first')

# Clean string columns
string_columns = ['name', 'city', 'country', 'iata', 'icao', 'tz_database_time_zone', 'type', 'source']
for col in string_columns:
    if col in airports_cleaned.columns:
        airports_cleaned[col] = airports_cleaned[col].astype(str).str.strip()
        airports_cleaned[col] = airports_cleaned[col].replace('nan', pd.NA)

print(f"Airports cleaned: {len(airports_cleaned)} rows (from {len(airports_df)})")


Airports cleaned: 7698 rows (from 7698)


### 2.2 Airlines Cleaning

Remove invalid airlines and standardize codes.

In [6]:
# Clean airlines data
airlines_cleaned = airlines_df.copy()

# Replace invalid values
airlines_cleaned = airlines_cleaned.replace({
    '\\N': pd.NA,
    'nan': pd.NA,
    'NaN': pd.NA,
    '': pd.NA,
    'Unknown': pd.NA,
    'unknown': pd.NA,
    '-': pd.NA
})

# Clean airline_id - remove invalid IDs
airlines_cleaned['airline_id'] = pd.to_numeric(airlines_cleaned['airline_id'], errors='coerce')
airlines_cleaned = airlines_cleaned[airlines_cleaned['airline_id'] > 0]

# Clean IATA and ICAO codes
airlines_cleaned['iata'] = airlines_cleaned['iata'].replace(['-', 'nan', 'NaN'], pd.NA)
airlines_cleaned['icao'] = airlines_cleaned['icao'].replace(['-', 'nan', 'NaN'], pd.NA)

# Remove airlines without any valid codes
airlines_cleaned = airlines_cleaned[
    ~(airlines_cleaned['iata'].isna() & airlines_cleaned['icao'].isna())
]

# Clean other columns
airlines_cleaned['callsign'] = airlines_cleaned['callsign'].replace(['-', 'nan', 'NaN'], pd.NA)
airlines_cleaned['active'] = airlines_cleaned['active'].replace(['\\N', 'nan', 'NaN'], 'N').fillna('N')
airlines_cleaned['country'] = airlines_cleaned['country'].fillna('Unknown')
airlines_cleaned = airlines_cleaned.drop_duplicates(subset=['airline_id'], keep='first')

# Clean string columns
string_columns = ['name', 'alias', 'iata', 'icao', 'callsign', 'country']
for col in string_columns:
    if col in airlines_cleaned.columns:
        airlines_cleaned[col] = airlines_cleaned[col].astype(str).str.strip()
        airlines_cleaned[col] = airlines_cleaned[col].replace('nan', pd.NA)

print(f"Airlines cleaned: {len(airlines_cleaned)} rows (from {len(airlines_df)})")


Airlines cleaned: 6159 rows (from 6162)


### 2.3 Routes Cleaning

Validate routes, remove duplicates, and ensure referential integrity.

In [7]:
# Clean routes data
routes_cleaned = routes_df.copy()

# Replace invalid values
routes_cleaned = routes_cleaned.replace({
    '\\N': pd.NA,
    'nan': pd.NA,
    'NaN': pd.NA,
    '': pd.NA,
    'Unknown': pd.NA,
    'unknown': pd.NA,
    '-': pd.NA
})

# Clean airline_id - remove invalid IDs
routes_cleaned['airline_id'] = pd.to_numeric(routes_cleaned['airline_id'], errors='coerce')
routes_cleaned = routes_cleaned[routes_cleaned['airline_id'] > 0]

# Clean airport IDs
routes_cleaned['source_airport_id'] = pd.to_numeric(routes_cleaned['source_airport_id'], errors='coerce')
routes_cleaned['destination_airport_id'] = pd.to_numeric(routes_cleaned['destination_airport_id'], errors='coerce')

# Remove routes with missing critical information
routes_cleaned = routes_cleaned.dropna(subset=['source_airport', 'destination_airport'])
routes_cleaned = routes_cleaned.dropna(subset=['source_airport_id', 'destination_airport_id'])

# Clean other columns
routes_cleaned['stops'] = pd.to_numeric(routes_cleaned['stops'], errors='coerce').fillna(0)
routes_cleaned['codeshare'] = routes_cleaned['codeshare'].fillna('N')
routes_cleaned['equipment'] = routes_cleaned['equipment'].fillna('Unknown')

# Remove routes where source and destination are the same
routes_cleaned = routes_cleaned[
    routes_cleaned['source_airport'] != routes_cleaned['destination_airport']
]

# Remove duplicate routes
routes_cleaned = routes_cleaned.drop_duplicates(
    subset=['airline_id', 'source_airport_id', 'destination_airport_id'], 
    keep='first'
)

# Validate references
valid_airline_ids = set(airlines_cleaned['airline_id'].dropna())
valid_airport_ids = set(airports_cleaned['airport_id'].dropna())
routes_cleaned = routes_cleaned[routes_cleaned['airline_id'].isin(valid_airline_ids)]
routes_cleaned = routes_cleaned[routes_cleaned['source_airport_id'].isin(valid_airport_ids)]
routes_cleaned = routes_cleaned[routes_cleaned['destination_airport_id'].isin(valid_airport_ids)]

# Clean string columns
string_columns = ['airline', 'source_airport', 'destination_airport', 'codeshare', 'equipment']
for col in string_columns:
    if col in routes_cleaned.columns:
        routes_cleaned[col] = routes_cleaned[col].astype(str).str.strip()
        routes_cleaned[col] = routes_cleaned[col].replace('nan', pd.NA)

print(f"Routes cleaned: {len(routes_cleaned)} rows (from {len(routes_df)})")


Routes cleaned: 66315 rows (from 67663)


### 2.4 Export Cleaned Data

Export cleaned datasets to CSV files.

In [8]:
# Export cleaned data
os.makedirs('../data/cleaned', exist_ok=True)

airports_cleaned.to_csv('../data/cleaned/airports_cleaned.csv', index=False, encoding='utf-8')
airlines_cleaned.to_csv('../data/cleaned/airlines_cleaned.csv', index=False, encoding='utf-8')
routes_cleaned.to_csv('../data/cleaned/routes_cleaned.csv', index=False, encoding='utf-8')

print("Cleaned data exported to data/cleaned/")


Cleaned data exported to data/cleaned/


## 3. Distance Calculation

Calculate great circle distances between airports using Haversine formula.

In [9]:
def haversine_distance(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Calculate the great circle distance between two points on Earth."""
    # Convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])
    
    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a))
    
    # Radius of earth in kilometers
    r = 6371
    return c * r


def calculate_route_distances(routes_df, airports_df):
    """Calculate distances for all routes using airport coordinates."""
    # Create lookup dictionaries for coordinates
    airport_coords = {}
    for _, row in airports_df.iterrows():
        airport_coords[row['airport_id']] = (row['latitude'], row['longitude'])
    
    # Calculate distances
    distances = []
    for _, route in routes_df.iterrows():
        source_id = route['source_airport_id']
        dest_id = route['destination_airport_id']
        
        if source_id in airport_coords and dest_id in airport_coords:
            lat1, lon1 = airport_coords[source_id]
            lat2, lon2 = airport_coords[dest_id]
            distance = haversine_distance(lat1, lon1, lat2, lon2)
            distances.append(distance)
        else:
            distances.append(None)
    
    routes_with_distance = routes_df.copy()
    routes_with_distance['distance_km'] = distances
    
    return routes_with_distance


# Calculate distances for all routes
print("Calculating route distances...")
routes_with_distance = calculate_route_distances(routes_cleaned, airports_cleaned)

print(f"Routes with distance calculated: {len(routes_with_distance)}")
print(f"Routes with valid distance: {routes_with_distance['distance_km'].notna().sum()}")

# Show sample
print("\nSample routes with distances:")
routes_with_distance[['source_airport', 'destination_airport', 'distance_km']].head(10)


Calculating route distances...
Routes with distance calculated: 66315
Routes with valid distance: 66315

Sample routes with distances:


Unnamed: 0,source_airport,destination_airport,distance_km
0,AER,KZN,1506.825604
1,ASF,KZN,1040.43832
2,ASF,MRV,448.164909
3,CEK,KZN,770.5085
4,CEK,OVB,1338.631467
5,DME,KZN,715.64935
6,DME,NBC,892.382788
8,DME,UUA,951.432198
9,EGO,KGD,1171.881495
10,EGO,KZN,1008.25311


## 4. Network Graph Building

Build NetworkX graph from cleaned routes and airports data.

In [10]:
# Build flight network graph
print("Building flight network graph...")

# Create NetworkX graph
G = nx.Graph()

# Add nodes (airports)
for _, airport in airports_cleaned.iterrows():
    G.add_node(
        airport['airport_id'],
        iata=airport.get('iata', ''),
        name=airport.get('name', ''),
        city=airport.get('city', ''),
        country=airport.get('country', ''),
        latitude=airport.get('latitude', 0),
        longitude=airport.get('longitude', 0)
    )

# Add edges (routes)
for _, route in routes_with_distance.iterrows():
    if pd.notna(route.get('distance_km')):
        G.add_edge(
            route['source_airport_id'],
            route['destination_airport_id'],
            distance=route['distance_km'],
            airline_id=route.get('airline_id', ''),
            stops=route.get('stops', 0)
        )

print(f"Graph created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
print(f"Network density: {nx.density(G):.4f}")
print(f"Is connected: {nx.is_connected(G)}")


Building flight network graph...
Graph created: 7698 nodes, 18679 edges
Network density: 0.0006
Is connected: False


## 5. Analysis Functions

Define functions for route analysis and hub identification.

In [11]:
def find_shortest_path(G, airports_df, source_iata, dest_iata):
    """Find shortest path between two airports."""
    # Find airport IDs from IATA codes
    source_airport = airports_df[airports_df['iata'] == source_iata]
    dest_airport = airports_df[airports_df['iata'] == dest_iata]
    
    if source_airport.empty or dest_airport.empty:
        return {"error": "Airport not found"}
    
    source_id = source_airport.iloc[0]['airport_id']
    dest_id = dest_airport.iloc[0]['airport_id']
    
    try:
        # Find shortest path
        path = nx.shortest_path(G, source_id, dest_id, weight='distance')
        
        # Calculate total distance
        total_distance = 0
        legs = []
        
        for i in range(len(path) - 1):
            edge_data = G[path[i]][path[i+1]]
            distance = edge_data.get('distance', 0)
            total_distance += distance
            
            # Get airport info
            source_airport_info = airports_df[airports_df['airport_id'] == path[i]].iloc[0]
            dest_airport_info = airports_df[airports_df['airport_id'] == path[i+1]].iloc[0]
            
            legs.append({
                "from": source_airport_info.get('iata', ''),
                "to": dest_airport_info.get('iata', ''),
                "distance_km": round(distance, 2)
            })
        
        # Get IATA codes for path
        path_iata = []
        for airport_id in path:
            airport_info = airports_df[airports_df['airport_id'] == airport_id].iloc[0]
            path_iata.append(airport_info.get('iata', ''))
        
        return {
            "source": source_iata,
            "destination": dest_iata,
            "path": path_iata,
            "total_distance_km": round(total_distance, 2),
            "legs": legs,
            "num_stops": len(path) - 2
        }
        
    except nx.NetworkXNoPath:
        return {"error": "No path found between airports"}


def analyze_hubs(G, airports_df, country=None, top_n=10):
    """Analyze airport hubs using centrality measures."""
    # Filter airports by country if specified
    airports_to_analyze = airports_df
    if country:
        airports_to_analyze = airports_df[
            airports_df['country'].str.contains(country, case=False, na=False)
        ]
    
    # Calculate centrality measures
    degree_centrality = nx.degree_centrality(G)
    betweenness_centrality = nx.betweenness_centrality(G, weight='distance')
    
    # Get top hubs
    top_hubs = []
    for airport_id in airports_to_analyze['airport_id']:
        if airport_id in G.nodes:
            airport_info = airports_to_analyze[airports_to_analyze['airport_id'] == airport_id].iloc[0]
            top_hubs.append({
                "airport": airport_info.get('iata', ''),
                "name": airport_info.get('name', ''),
                "city": airport_info.get('city', ''),
                "country": airport_info.get('country', ''),
                "degree_centrality": round(degree_centrality.get(airport_id, 0), 3),
                "betweenness": round(betweenness_centrality.get(airport_id, 0), 3)
            })
    
    # Sort by degree centrality
    top_hubs.sort(key=lambda x: x['degree_centrality'], reverse=True)
    
    return {
        "country": country or "Global",
        "top_hubs": top_hubs[:top_n],
        "total_airports_analyzed": len(airports_to_analyze)
    }


print("Analysis functions defined!")


Analysis functions defined!


## 6. Example Analysis

Run example analyses to demonstrate the functionality.

### 6.1 Shortest Route Example

Find shortest route from Ho Chi Minh City (SGN) to London (LHR).

In [12]:
# Example: Find shortest route
print("=== SHORTEST ROUTE ANALYSIS ===")
route_result = find_shortest_path(G, airports_cleaned, 'SGN', 'LHR')

if 'error' in route_result:
    print(f"Error: {route_result['error']}")
else:
    print("Shortest Route Found:")
    print(f"From: {route_result['source']}")
    print(f"To: {route_result['destination']}")
    print(f"Route: {' -> '.join(route_result['path'])}")
    print(f"Total Distance: {route_result['total_distance_km']} km")
    print(f"Number of Stops: {route_result['num_stops']}")
    print("\nRoute Legs:")
    for i, leg in enumerate(route_result['legs'], 1):
        print(f"  {i}. {leg['from']} -> {leg['to']} ({leg['distance_km']} km)")


=== SHORTEST ROUTE ANALYSIS ===
Shortest Route Found:
From: SGN
To: LHR
Route: SGN -> DME -> LHR
Total Distance: 10229.11 km
Number of Stops: 1

Route Legs:
  1. SGN -> DME (7684.06 km)
  2. DME -> LHR (2545.05 km)


### 6.2 Hub Analysis Example

Analyze top hubs globally and by country (Vietnam example).

In [13]:
# Example: Hub Analysis
print("=== HUB ANALYSIS ===")

# Analyze global hubs
global_hubs = analyze_hubs(G, airports_cleaned, top_n=15)
print(f"\nTop {len(global_hubs['top_hubs'])} Global Hubs:")
for i, hub in enumerate(global_hubs['top_hubs'][:10], 1):
    print(f"{i:2d}. {hub['airport']} - {hub['name']} ({hub['city']}, {hub['country']})")
    print(f"     Degree Centrality: {hub['degree_centrality']}, Betweenness: {hub['betweenness']}")

# Analyze hubs in Vietnam
vietnam_hubs = analyze_hubs(G, airports_cleaned, country='Vietnam', top_n=10)
print(f"\nTop Hubs in Vietnam:")
for i, hub in enumerate(vietnam_hubs['top_hubs'], 1):
    print(f"{i}. {hub['airport']} - {hub['name']} ({hub['city']})")
    print(f"   Degree Centrality: {hub['degree_centrality']}, Betweenness: {hub['betweenness']}")


=== HUB ANALYSIS ===

Top 15 Global Hubs:
 1. FRA - Frankfurt am Main Airport (Frankfurt, Germany)
     Degree Centrality: 0.032, Betweenness: 0.003
 2. AMS - Amsterdam Airport Schiphol (Amsterdam, Netherlands)
     Degree Centrality: 0.032, Betweenness: 0.004
 3. CDG - Charles de Gaulle International Airport (Paris, France)
     Degree Centrality: 0.031, Betweenness: 0.003
 4. ISL - Atatürk International Airport (Istanbul, Turkey)
     Degree Centrality: 0.03, Betweenness: 0.007
 5. ATL - Hartsfield Jackson Atlanta International Airport (Atlanta, United States)
     Degree Centrality: 0.028, Betweenness: 0.004
 6. PEK - Beijing Capital International Airport (Beijing, China)
     Degree Centrality: 0.027, Betweenness: 0.01
 7. ORD - Chicago O'Hare International Airport (Chicago, United States)
     Degree Centrality: 0.027, Betweenness: 0.006
 8. MUC - Munich Airport (Munich, Germany)
     Degree Centrality: 0.025, Betweenness: 0.001
 9. DXB - Dubai International Airport (Dubai, United

## 7. Export Data for Streamlit Application

Export processed data for use in the Streamlit web application.

In [14]:
# Export routes with distances for Streamlit app
print("=== EXPORTING ROUTES WITH DISTANCES ===")

os.makedirs('../data/cleaned', exist_ok=True)

routes_with_distance.to_csv('../data/cleaned/routes_graph.csv', index=False, encoding='utf-8')

print(f"Routes with distances exported: {routes_with_distance['distance_km'].notna().sum()} rows")


=== EXPORTING ROUTES WITH DISTANCES ===
Routes with distances exported: 66315 rows


## 8. Gephi Visualization

Prepare network graph with comprehensive attributes for Gephi visualization.

### 8.1 Create Comprehensive Graph

Create directed graph with all attributes needed for Gephi.

In [15]:
# Create comprehensive network graph for Gephi visualization
print("=== CREATING COMPREHENSIVE NETWORK GRAPH ===")

# Create directed graph (flights have direction)
G_gephi = nx.DiGraph()

# Add nodes (airports) with comprehensive attributes
print("Adding airport nodes...")
for _, airport in airports_cleaned.iterrows():
    G_gephi.add_node(
        airport['airport_id'],
        label=airport.get('iata', ''),
        name=airport.get('name', ''),
        city=airport.get('city', ''),
        country=airport.get('country', ''),
        latitude=float(airport.get('latitude', 0)),
        longitude=float(airport.get('longitude', 0)),
        altitude=float(airport.get('altitude', 0)),
        timezone=float(airport.get('timezone', 0)) if pd.notna(airport.get('timezone')) else 0,
        type=airport.get('type', 'airport'),
        source=airport.get('source', 'Unknown')
    )

print(f"Added {G_gephi.number_of_nodes()} airport nodes")

# Add edges (routes) with comprehensive attributes
print("Adding route edges...")
edge_count = 0
for _, route in routes_with_distance.iterrows():
    if pd.notna(route.get('distance_km')):
        G_gephi.add_edge(
            route['source_airport_id'],
            route['destination_airport_id'],
            weight=float(route['distance_km']),
            distance_km=float(route['distance_km']),
            airline_id=int(route.get('airline_id', 0)) if pd.notna(route.get('airline_id')) else 0,
            stops=int(route.get('stops', 0)) if pd.notna(route.get('stops')) else 0,
            codeshare=route.get('codeshare', 'N'),
            equipment=route.get('equipment', 'Unknown')
        )
        edge_count += 1

print(f"Added {edge_count} route edges")
print(f"Total graph: {G_gephi.number_of_nodes()} nodes, {G_gephi.number_of_edges()} edges")


=== CREATING COMPREHENSIVE NETWORK GRAPH ===
Adding airport nodes...
Added 7698 airport nodes
Adding route edges...
Added 66315 route edges
Total graph: 7698 nodes, 36588 edges


### 8.2 Calculate Network Metrics

Calculate centrality measures and network statistics.

In [17]:
# Calculate network metrics for visualization
print("=== CALCULATING NETWORK METRICS ===")

# Calculate centrality measures
print("Calculating centrality measures...")
degree_centrality = nx.degree_centrality(G_gephi)
betweenness_centrality = nx.betweenness_centrality(G_gephi, weight='weight')
closeness_centrality = nx.closeness_centrality(G_gephi, distance='weight')
pagerank = nx.pagerank(G_gephi, weight='weight')

# Add centrality measures to nodes
for node in G_gephi.nodes():
    G_gephi.nodes[node]['degree_centrality'] = degree_centrality.get(node, 0)
    G_gephi.nodes[node]['betweenness_centrality'] = betweenness_centrality.get(node, 0)
    G_gephi.nodes[node]['closeness_centrality'] = closeness_centrality.get(node, 0)
    G_gephi.nodes[node]['pagerank'] = pagerank.get(node, 0)

# Calculate basic network statistics
print("Calculating network statistics...")
stats = {
    'total_nodes': G_gephi.number_of_nodes(),
    'total_edges': G_gephi.number_of_edges(),
    'density': nx.density(G_gephi),
    'is_strongly_connected': nx.is_strongly_connected(G_gephi),
    'is_weakly_connected': nx.is_weakly_connected(G_gephi),
    'number_of_strongly_connected_components': nx.number_strongly_connected_components(G_gephi),
    'number_of_weakly_connected_components': nx.number_weakly_connected_components(G_gephi)
}

print("Network Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

# Find top hubs
top_hubs = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:10]
print(f"\nTop 10 Hubs by Degree Centrality:")
for i, (node, centrality) in enumerate(top_hubs, 1):
    airport_info = airports_cleaned[airports_cleaned['airport_id'] == node]
    if not airport_info.empty:
        iata = airport_info.iloc[0]['iata']
        name = airport_info.iloc[0]['name']
        city = airport_info.iloc[0]['city']
        country = airport_info.iloc[0]['country']
        print(f"  {i:2d}. {iata} - {name} ({city}, {country}) - Centrality: {centrality:.4f}")


=== CALCULATING NETWORK METRICS ===
Calculating centrality measures...
Calculating network statistics...
Network Statistics:
  total_nodes: 7698
  total_edges: 36588
  density: 0.0006175032918150637
  is_strongly_connected: False
  is_weakly_connected: False
  number_of_strongly_connected_components: 4607
  number_of_weakly_connected_components: 4568

Top 10 Hubs by Degree Centrality:
   1. FRA - Frankfurt am Main Airport (Frankfurt, Germany) - Centrality: 0.0620
   2. CDG - Charles de Gaulle International Airport (Paris, France) - Centrality: 0.0611
   3. AMS - Amsterdam Airport Schiphol (Amsterdam, Netherlands) - Centrality: 0.0602
   4. ISL - Atatürk International Airport (Istanbul, Turkey) - Centrality: 0.0586
   5. ATL - Hartsfield Jackson Atlanta International Airport (Atlanta, United States) - Centrality: 0.0563
   6. ORD - Chicago O'Hare International Airport (Chicago, United States) - Centrality: 0.0531
   7. PEK - Beijing Capital International Airport (Beijing, China) - Centr

### 8.3 Export Gephi Files

Export multiple network views in Gephi format (.gexf) for visualization.

In [18]:
# Export multiple network views to Gephi format (.gexf)
print("=== EXPORTING MULTIPLE NETWORK VIEWS TO GEPHI ===")

# Ensure export directory exists
os.makedirs('../data/gephi', exist_ok=True)

# Clean the graph for Gephi export - remove pandas NA values
print("Cleaning graph attributes for Gephi compatibility...")

def clean_value(value):
    """Convert pandas NA values to None for Gephi compatibility"""
    if pd.isna(value) or value is pd.NA:
        return None
    return value

def create_clean_graph(G):
    """Create a clean copy of graph for Gephi export"""
    G_clean = nx.DiGraph()
    
    # Add nodes with cleaned attributes
    for node, attrs in G.nodes(data=True):
        clean_attrs = {}
        for key, value in attrs.items():
            clean_value_result = clean_value(value)
            if clean_value_result is not None:
                clean_attrs[key] = clean_value_result
        G_clean.add_node(node, **clean_attrs)

    # Add edges with cleaned attributes
    for u, v, attrs in G.edges(data=True):
        clean_attrs = {}
        for key, value in attrs.items():
            clean_value_result = clean_value(value)
            if clean_value_result is not None:
                clean_attrs[key] = clean_value_result
        G_clean.add_edge(u, v, **clean_attrs)
    
    return G_clean

# 1. FULL NETWORK
print("1. Creating full network...")
G_full = create_clean_graph(G_gephi)
nx.write_gexf(G_full, '../data/gephi/flight_network_full.gexf')
print(f"   Full network: {G_full.number_of_nodes()} nodes, {G_full.number_of_edges()} edges")

# 2. MAJOR HUBS NETWORK (top 20% by degree centrality)
print("2. Creating major hubs network...")
top_20_percent = int(len(degree_centrality) * 0.2)
top_hubs_nodes = [node for node, _ in sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:top_20_percent]]
G_major_hubs = G_gephi.subgraph(top_hubs_nodes)
G_major_hubs_clean = create_clean_graph(G_major_hubs)
nx.write_gexf(G_major_hubs_clean, '../data/gephi/flight_network_major_hubs.gexf')
print(f"   Major hubs: {G_major_hubs_clean.number_of_nodes()} nodes, {G_major_hubs_clean.number_of_edges()} edges")

# 3. LONG DISTANCE NETWORK (routes > 5000km)
print("3. Creating long-distance network...")
long_distance_edges = [(u, v) for u, v, d in G_gephi.edges(data=True) if d.get('distance_km', 0) > 5000]
G_long_distance = G_gephi.edge_subgraph(long_distance_edges)
G_long_distance_clean = create_clean_graph(G_long_distance)
nx.write_gexf(G_long_distance_clean, '../data/gephi/flight_network_long_distance.gexf')
print(f"   Long distance: {G_long_distance_clean.number_of_nodes()} nodes, {G_long_distance_clean.number_of_edges()} edges")

print("\n=== NETWORK VIEWS EXPORTED ===")
print("- flight_network_full.gexf (complete network)")
print("- flight_network_major_hubs.gexf (top 20% hubs)")
print("- flight_network_long_distance.gexf (routes > 5000km)")


=== EXPORTING MULTIPLE NETWORK VIEWS TO GEPHI ===
Cleaning graph attributes for Gephi compatibility...
1. Creating full network...
   Full network: 7698 nodes, 36588 edges
2. Creating major hubs network...
   Major hubs: 1539 nodes, 31899 edges
3. Creating long-distance network...
   Long distance: 308 nodes, 2550 edges

=== NETWORK VIEWS EXPORTED ===
- flight_network_full.gexf (complete network)
- flight_network_major_hubs.gexf (top 20% hubs)
- flight_network_long_distance.gexf (routes > 5000km)


## 9. Network Summary

Final statistics and recommendations for Gephi analysis.

In [19]:
# Network Summary
print("=== NETWORK SUMMARY ===")
print(f"Full network: {G_full.number_of_nodes()} airports, {G_full.number_of_edges()} routes")
print(f"Major hubs: {G_major_hubs_clean.number_of_nodes()} airports, {G_major_hubs_clean.number_of_edges()} routes")
print(f"Long distance: {G_long_distance_clean.number_of_nodes()} airports, {G_long_distance_clean.number_of_edges()} routes")

# Show top hubs
print(f"\nTop 10 Hubs by Degree Centrality:")
for i, (node, centrality) in enumerate(top_hubs[:10], 1):
    airport_info = airports_cleaned[airports_cleaned['airport_id'] == node]
    if not airport_info.empty:
        iata = airport_info.iloc[0]['iata']
        name = airport_info.iloc[0]['name']
        city = airport_info.iloc[0]['city']
        country = airport_info.iloc[0]['country']
        print(f"  {i:2d}. {iata} - {name} ({city}, {country}) - Centrality: {centrality:.4f}")

print(f"\n=== GEPHI ANALYSIS RECOMMENDATIONS ===")
print("1. flight_network_major_hubs.gexf - Best for hub analysis (manageable size)")
print("2. flight_network_long_distance.gexf - Best for long-haul routes")
print("3. flight_network_full.gexf - Complete overview (may be too dense)")


=== NETWORK SUMMARY ===
Full network: 7698 airports, 36588 routes
Major hubs: 1539 airports, 31899 routes
Long distance: 308 airports, 2550 routes

Top 10 Hubs by Degree Centrality:
   1. FRA - Frankfurt am Main Airport (Frankfurt, Germany) - Centrality: 0.0620
   2. CDG - Charles de Gaulle International Airport (Paris, France) - Centrality: 0.0611
   3. AMS - Amsterdam Airport Schiphol (Amsterdam, Netherlands) - Centrality: 0.0602
   4. ISL - Atatürk International Airport (Istanbul, Turkey) - Centrality: 0.0586
   5. ATL - Hartsfield Jackson Atlanta International Airport (Atlanta, United States) - Centrality: 0.0563
   6. ORD - Chicago O'Hare International Airport (Chicago, United States) - Centrality: 0.0531
   7. PEK - Beijing Capital International Airport (Beijing, China) - Centrality: 0.0530
   8. MUC - Munich Airport (Munich, Germany) - Centrality: 0.0494
   9. DME - Domodedovo International Airport (Moscow, Russia) - Centrality: 0.0487
  10. DFW - Dallas Fort Worth International