<a href="https://colab.research.google.com/github/AyushiKashyapp/indian_railways_network/blob/main/IndianRailways_Triples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Indian Railways Network

The Indian Railways network is the fourth largest railway network. This project aims to implement the Indian Railways Network as a Knowledge Graph.

The dataset used in this project can be found here: [Railways Dataset](https://www.kaggle.com/datasets/sripaadsrinivasan/indian-railways-dataset).

The files **train** and **stations** are available in GeoJSON format.

In this Python notebook, we process the GeoJSON files and extract information about the stations and trains in the form of triples.

**Importing libraries**

- **pandas**: Provides data structures and functions needed for manipulating numerical tables and time series data.
- **geopandas**: Extends the capabilities of pandas to allow spatial operations on geometric types.
- **shapely.geometry.LineString**: Used for creating and manipulating line geometries, which are essential for representing connections or paths in geographic data.
- **json**: Provides methods for parsing JSON (JavaScript Object Notation) data.
- **networkx**: Provides tools for creating, manipulating, and studying the structure, dynamics, and functions of complex networks of nodes and edges.
- **matplotlib.pyplot**: A plotting library used for creating static, animated, and interactive visualizations in Python.

In [3]:
!pip install geopandas



In [27]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
import json
import networkx as nx
import matplotlib.pyplot as plt

**Function to load GeoJSON file**

This function `load_geojson` is designed to load and process a GeoJSON file, filtering out invalid LineString geometries and converting the valid ones into a GeoDataFrame.

- Open and load the GeoJSON file specified by file_path using the json module.
- Feature Filtering by iterating over each feature in the GeoJSON file, checking if the feature's geometry is of type LineString and contains at least two coordinate points.
- For valid LineString geometries, it attempts to create a LineString object using shapely.geometry.LineString. If successful, the feature is added to the features list.If any feature is invalid or an error occurs during processing, the error is caught and a message is printed, and the feature is ignored.
- After filtering, it creates a GeoDataFrame from the valid features using geopandas.GeoDataFrame.from_features.

In [9]:
def load_geojson(file_path):
    try:
        # Attempt to load GeoJSON file
        with open(file_path) as f:
            data = json.load(f)

        # Filter out features with valid LineString geometries
        features = []
        for feature in data['features']:
            try:
                geometry = feature['geometry']
                if geometry['type'] == 'LineString':
                    coordinates = geometry['coordinates']
                    if len(coordinates) >= 2:  # Ensure LineString has at least 2 points
                        LineString(coordinates)  # Attempt to create LineString object
                        features.append(feature)
            except Exception as e:
                print(f"Ignoring feature due to error: {e}")

        # Create GeoDataFrame from valid features
        gdf = gpd.GeoDataFrame.from_features(features)

        return gdf
    except Exception as e:
        print(f"Error loading GeoJSON: {e}")
        # Return None or an empty GeoDataFrame on error
        return gpd.GeoDataFrame()


In [13]:
trains_file_path = '/content/trains.json'

trains = load_geojson(trains_file_path)

trains.head()

Unnamed: 0,geometry,third_ac,arrival,from_station_code,name,zone,chair_car,first_class,duration_m,sleeper,...,departure,return_train,to_station_code,second_ac,classes,to_station_name,duration_h,type,first_ac,distance
0,"LINESTRING (74.88012 32.70697, 74.95334 32.762...",0,12:15:00,JAT,Jammu Tawi Udhampur Special,NR,0,0,35.0,0,...,10:40:00,4602,UHP,0,,UDHAMPUR,1.0,DEMU,0,53.0
1,"LINESTRING (75.15488 32.92664, 75.14543 32.863...",0,08:35:00,UHP,UDHAMPUR JAMMUTAWI DMU,NR,0,0,50.0,0,...,06:45:00,4601,JAT,0,,JAMMU TAWI,1.0,DEMU,0,53.0
2,"LINESTRING (74.88012 32.70697, 74.95334 32.762...",0,17:50:00,JAT,JAT UDAHMPUR DMU,NR,0,0,35.0,0,...,16:15:00,4604,UHP,0,,UDHAMPUR,1.0,DEMU,0,53.0
3,"LINESTRING (75.15488 32.92664, 75.14543 32.863...",0,19:50:00,UHP,UDHAMPUR JAMMUTAWI DMU,NR,0,0,30.0,0,...,18:20:00,4603,JAT,0,,JAMMU TAWI,1.0,DEMU,0,53.0
4,"LINESTRING (72.84054 19.06191, 72.84008 19.069...",1,12:30:00,BDTS,Mumbai BandraT-Bikaner SF Special,NWR,0,0,55.0,1,...,14:35:00,4727,BKN,1,,BIKANER JN,21.0,SF,0,1212.0


In [14]:
stations_file_path = '/content/stations.json'
stations = gpd.read_file(stations_file_path)
stations.head()

Unnamed: 0,state,code,name,zone,address,geometry
0,Rajasthan,BDHL,Badhal,NWR,"Kishangarh Renwal, Rajasthan",POINT (75.45165 27.25206)
1,,XX-BECE,XX-BECE,,,
2,,XX-BSPY,XX-BSPY,,,
3,,YY-BPLC,YY-BPLC,,,
4,Uttar Pradesh,KHH,KICHHA,NER,"Kichha, Uttar Pradesh",POINT (79.51975 28.91343)


**Constructing a Directed Graph**

This function constructs a directed graph representing the Indian Railways network using the networkx library, and then converts the graph data into RDF-like triples.

- **Adding Nodes (Stations)**: Iterates over each row in the stations DataFrame. Adds each station as a node in the graph G with attributes including state, code, zone, address, and geometry.
- **Adding Edges (Train Connections)**: Iterates over each row in the trains DataFrame. Extracts the 'from' and 'to' station names. Adds an edge between the 'from' and 'to' stations if it does not already exist, with attributes including train number, departure time, arrival time, and duration (hours and minutes).
- **Extracting Nodes and Edges**: Extracts the nodes, the edges and their data as a list.
- **Generating Triples**: Iterates over the nodes and creates RDF-like triples for each node, including a type triple (node, 'rdf:type', 'station') and attribute triples for each node attribute. Iterates over the edges and creates RDF-like triples for each edge, including a connection triple (u, 'connectedTo', v) and attribute triples for each edge attribute.

In [25]:
G = nx.DiGraph()

for idx, station in stations.iterrows():
  G.add_node(station['name'],
             state = station['state'],
             code = station['code'],
             zone = station['zone'],
             address = station['address'],
             geometry = station['geometry'])

for idx, train in trains.iterrows():
  from_station = train['from_station_name']
  to_station = train['to_station_name']

  if not G.has_edge(from_station, to_station):
    G.add_edge(from_station, to_station,
               train_number = train['number'],
               departure = train['departure'],
               arrival = train['arrival'],
               duration_hours = train['duration_h'],
               duration_minutes = train['duration_m'])

nodes = list(G.nodes(data=True))
edges = list(G.edges(data=True))

triples = []

for node, data in nodes:
  triples.append((node, 'rdf:type','station'))
  for key, value in data.items():
    triples.append((node, key, str(value)))

for u, v, data in edges:
  triples.append((u, 'connectedTo', v))
  for key, value in data.items():
    triples.append((u + '-' + v, key, str(value)))

print(f"Triples: {len(triples)}")

Triples: 74448


**Exporting the triples as an Excel file**

In [28]:
df = pd.DataFrame(triples, columns=['Subject', 'Predicate', 'Object'])

excel_file = 'train_station_triples.xlsx'
df.to_excel(excel_file, index=False)

print(f"Triples saved to {excel_file}")

Triples saved to train_station_triples.xlsx
