# ESG Ontology RDF Schema Overview

This notebook generates an ESG (Environmental, Social, Governance) ontology schema using `rdflib`. It structures ESG data semantically for querying and downstream analysis, such as PCA.

## Ontology Structure

### Classes
- `Company`, `Industry`, `ESGMetric`, `Category`, `Pillar`, `CalculationModel`, `ESGObservation`

### Object Properties
- `hasCompany`, `hasMetric`, `hasCategory`, `hasPillar`, `belongsToIndustry`, `relatedToMetric`

### Data Properties
- `hasYear`, `hasValue`, `hasUnit`, `hasDataType`

## Output
The schema is saved in Turtle format at:

```plaintext
./Normalized_Data/esg_ontology_schema.ttl

In [None]:
import os
from rdflib import Graph, Namespace, RDF, OWL, XSD

# === Define Namespaces ===
EX = Namespace("http://example.org/esg#")
g = Graph()
g.bind("ex", EX)
g.bind("xsd", XSD)
g.bind("owl", OWL)

# === Ontology Classes ===
classes = [
    "Company",
    "Industry",
    "ESGMetric",
    "Category",
    "Pillar",
    "CalculationModel",
    "ESGObservation"
]
for cls in classes:
    g.add((EX[cls], RDF.type, OWL.Class))

# === Object Properties ===
object_properties = [
    "hasCompany",
    "hasMetric",
    "hasCategory",
    "hasPillar",
    "belongsToIndustry",
    "relatedToMetric"
]
for prop in object_properties:
    g.add((EX[prop], RDF.type, OWL.ObjectProperty))

# === Data Properties ===
data_properties = [
    "hasYear",
    "hasValue",
    "hasUnit",
    "hasDataType"
]
for prop in data_properties:
    g.add((EX[prop], RDF.type, OWL.DatatypeProperty))

# === Universal schema path ===
schema_dir = "Normalized_Data"
schema_file = "esg_ontology_schema.ttl"
os.makedirs(schema_dir, exist_ok=True)
schema_path = os.path.join(schema_dir, schema_file)

# === Save schema to TTL ===
g.serialize(destination=schema_path, format="turtle")
print(f" ESG ontology schema saved to: {schema_path}")

## ESG RDF Triples: SASB-Aligned Industry Data

This notebook converts cleaned ESG metric data for **Semiconductors** and **Biotechnology & Pharmaceuticals** into RDF triples for semantic querying using `rdflib`. It generates one `graph_output.ttl` file, saved under the `ontology-graphdb/graphdb-import/` folder.

###  Overview:
- **Input**: `semiconductors_sasb_final.csv` and `biopharma_sasb_final.csv` from `Normalized_Data/`
- **Triples** include:
  - Class declarations (`Company`, `Industry`, `ESGMetric`, `Pillar`, etc.)
  - ESG observations per company, metric, and year
  - Object and data property links like `hasPillar`, `hasValue`, `belongsToIndustry`, etc.
- **Output**: Serialized `.ttl` file saved at:

###  Key RDF Concepts:
| RDF Class        | Description                                           |
|------------------|-------------------------------------------------------|
| `Company`        | Each distinct ESG-reporting organization              |
| `Industry`       | The industry a company belongs to                     |
| `ESGMetric`      | A reported ESG indicator (e.g., CO2 Scope 1 Emission) |
| `Pillar`         | ESG high-level dimension (Environmental/Social/etc.)  |
| `Category`       | Group of metrics (e.g., GHG Emissions)                |
| `CalculationModel` | Ontology placeholder for model used                 |
| `ESGObservation` | Unique metric entry with value, year, unit            |

This TTL file is later ingested into **GraphDB** to support SPARQL queries, metric selection, and ontology-enhanced PCA.

In [None]:
import pandas as pd
from rdflib import Graph, Namespace, Literal, RDF, OWL, XSD, URIRef
import os
import re

# === Setup RDF Graph and Namespaces
EX = Namespace("http://example.org/esg#")
g = Graph()
g.bind("ex", EX)
g.bind("xsd", XSD)
g.bind("owl", OWL)

# === URI Cleaner
def safe_uri(text):
    if pd.isnull(text):
        return URIRef(EX + "undefined")
    clean = str(text).strip().lower()
    clean = re.sub(r'[^\w\-]', '_', clean)
    clean = re.sub(r'__+', '_', clean)
    return URIRef(EX + clean.strip('_'))

# === Load SASB Final Files
base_path = "Normalized_Data"
bio_path = os.path.join(base_path, "biopharma_sasb_final.csv")
semi_path = os.path.join(base_path, "semiconductors_sasb_final.csv")
df = pd.concat([pd.read_csv(semi_path), pd.read_csv(bio_path)], ignore_index=True)

# === Preprocess
for col in ["company_name", "Industry", "metric", "category", "pillar"]:
    df[col] = df[col].astype(str).str.strip().str.lower()

df["metric_unit"] = df.get("metric_unit", "unknown")
df = df.drop_duplicates()

# === Generate Triples
for _, row in df.iterrows():
    try:
        company = row["company_name"]
        industry = row["Industry"]
        metric = row["metric"]
        category = row["category"]
        pillar = row.get("pillar", "unknown")
        model = metric + "_model"
        year = int(row["year"])
        value = float(row["metric_value"])
        unit = row.get("metric_unit", "unknown")

        # URIs
        company_uri = safe_uri(company)
        industry_uri = safe_uri(industry)
        metric_uri = safe_uri(f"{metric}_{industry}")
        category_uri = safe_uri(category)
        model_uri = safe_uri(model)
        pillar_uri = safe_uri(pillar)
        obs_uri = safe_uri(f"{company}_{industry}_{metric}_{category}_{pillar}_{year}")

        # === Class Declarations
        g.add((company_uri, RDF.type, EX.Company))
        g.add((industry_uri, RDF.type, EX.Industry))
        g.add((metric_uri, RDF.type, EX.ESGMetric))
        g.add((model_uri, RDF.type, EX.CalculationModel))
        g.add((category_uri, RDF.type, EX.Category))
        g.add((pillar_uri, RDF.type, EX.Pillar))
        g.add((obs_uri, RDF.type, EX.ESGObservation))

        # === Metric Schema and Model Relationship
        g.add((obs_uri, EX.belongsToIndustry, industry_uri))
        g.add((metric_uri, EX.hasPillar, pillar_uri))
        g.add((model_uri, EX.relatedToMetric, metric_uri))

        # === Observation Details
        g.add((obs_uri, EX.hasCompany, company_uri))
        g.add((obs_uri, EX.hasMetric, metric_uri))
        g.add((obs_uri, EX.hasCategory, category_uri))
        g.add((obs_uri, EX.hasYear, Literal(year, datatype=XSD.gYear)))
        g.add((obs_uri, EX.hasValue, Literal(value, datatype=XSD.float)))
        g.add((obs_uri, EX.hasUnit, Literal(unit)))
        g.add((obs_uri, EX.hasPillar, pillar_uri))

    except Exception as e:
        print(f"Skipped row due to error: {e}")
        continue

# === Export TTL to universal path
ttl_output_path = os.path.join("ontology-graphdb", "graphdb-import", "graph_output.ttl")
os.makedirs(os.path.dirname(ttl_output_path), exist_ok=True)

g.serialize(destination=ttl_output_path, format="turtle")
print(f"RDF successfully saved to:\n{ttl_output_path}")