##  OECD.AI ‚Äì AI Policy Metadata Collection

In this section, we automatically collect a small sample of AI policy records from the **OECD.AI** policy observatory. The goal is to construct a structured metadata table that can later be combined with other AI ethics and governance sources (e.g., UNESCO, White House OSTP).

The script performs the following steps:

1. **Set up folders for data storage**  
   - Creates a base directory: `data/policies/`  
   - Creates a subfolder for documents: `data/policies/docs/`  
   These folders will store both the raw OECD CSV file and any downloaded policy documents.

2. **Download the official OECD AI policies CSV**  
   - Uses the URL:  
     `https://wp.oecd.ai/app/uploads/2024/03/oecd-ai-all-ai-policies.csv`  
   - Sends a polite HTTP request with a custom `User-Agent` for class use.  
   - Saves the file locally as `data/policies/oecd_policies.csv`.

3. **Load and sample the OECD dataset**  
   - Reads the CSV into a pandas DataFrame.  
   - Selects the **first 3 records** as a small, representative sample for this project phase.  
   - Extracts key metadata fields such as:
     - `Title`
     - `Country`
     - `URL`
     - `Publication date` (or `Date`)

4. **Normalize publication dates**  
   - Uses `dateutil.parser` to convert fuzzy date strings into normalized ISO date format (`YYYY-MM-DD`) where possible.

5. **Build a clean metadata row for each sampled policy**  
   For each record, we construct a standardized dictionary with fields like:
   - `id` ‚Äì unique identifier for the policy (e.g., `oecd_0`, `oecd_1`, ‚Ä¶)  
   - `title` ‚Äì policy title from the OECD CSV  
   - `organization` ‚Äì fixed as `OECD.AI`  
   - `country` ‚Äì country or region associated with the policy  
   - `url` / `doc_url` ‚Äì original online location of the policy  
   - `publication_date` ‚Äì normalized ISO date  
   - `type`, `format`, `language`, `notes`, `local_path` ‚Äì enriched metadata fields

6. **Optionally download the policy documents themselves**  
   - For each sampled policy, the script attempts to download the content from `doc_url`.  
   - It inspects the HTTP `Content-Type` or file extension to decide whether to save as `.pdf` or `.html`.  
   - Files are stored under `data/policies/docs/` with safe, sanitized filenames.  
   - The `local_path` field is updated with the path to the saved file.

7. **Save the final sample dataset as a CSV**  
   - All metadata rows are written to `data/policies/ai_ethics_policies_sample.csv`.  
   - This CSV represents a **reproducible, documented subset** of OECD AI policy metadata that can be merged with other sources in later project phases.

Overall, this scraper is the first building block for our **AI ethics policy dataset**, demonstrating how to responsibly acquire, normalize, and persist public policy data for analysis.


In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
# one time run
import pathlib

# Base directory in the Google Drive
BASE_DIR = pathlib.Path("/content/drive/MyDrive/DSCI-511")
DATA_DIR = BASE_DIR / "data" / "policies"
DOC_DIR = DATA_DIR / "docs"

DATA_DIR.mkdir(parents=True, exist_ok=True)
DOC_DIR.mkdir(parents=True, exist_ok=True)

print("Data directory:", DATA_DIR)
print("Docs directory:", DOC_DIR)

Data directory: /content/drive/MyDrive/DSCI-511/data/policies
Docs directory: /content/drive/MyDrive/DSCI-511/data/policies/docs


In [29]:
import os, re, csv, time
import pandas as pd
from datetime import datetime
import requests
from dateutil import parser as dateparser

OUT_CSV = DATA_DIR / "ai_ethics_policies_sample.csv"

def sanitize_filename(s: str) -> str:
    """Make filenames safe for saving."""
    return re.sub(r'[^a-zA-Z0-9._-]+', '_', s)[:120]

def fetch(url, headers=None):
    """Wrapper around GET request with a polite user agent."""
    h = {"User-Agent": "Priti-DSCI-Project/1.0 (class use)"} | (headers or {})
    r = requests.get(url, headers=h, timeout=30)
    r.raise_for_status()
    return r

# 1. OECD AI Policies Data Scraping

rows = []
oecd_csv_url = "https://wp.oecd.ai/app/uploads/2024/03/oecd-ai-all-ai-policies.csv"
try:
    # Fetch and save the CSV
    r = fetch(oecd_csv_url)
    local_csv = DATA_DIR / "oecd_policies.csv"
    local_csv.write_bytes(r.content)
    print("Downloaded OECD policies CSV to:", local_csv)

    # Load CSV
    oecd_df = pd.read_csv(local_csv)

    # Take first 3 records as sample
    for i, rec in oecd_df.head(3).iterrows():
        title = str(rec.get("Title") or rec.get("title") or "OECD Policy")
        country = rec.get("Country") or ""
        url = rec.get("URL") or rec.get("Url") or ""
        pub = rec.get("Publication date") or rec.get("Date") or ""

        # Clean publication date
        pub_dt = ""
        if isinstance(pub, str) and pub.strip():
            try:
                pub_dt = dateparser.parse(pub, fuzzy=True).date().isoformat()
            except Exception:
                pub_dt = ""

        rows.append({
            "id": f"oecd_{i}",
            "title": title,
            "organization": "OECD.AI",
            "country": country,
            "url": url,
            "doc_url": url,
            "publication_date": pub_dt,
            "type": "policy",
            "format": "html" if (isinstance(url, str) and url.lower().endswith(".html")) else "unknown",
            "language": "",
            "notes": "From OECD.AI policies CSV",
            "local_path": ""
        })

except Exception as e:
    print("OECD CSV fetch failed:", e)

# 2. Download OECD documents

for row in rows:
    doc_url = row.get("doc_url")
    if not doc_url or not isinstance(doc_url, str) or not doc_url.strip():
        continue

    try:
        print("Downloading doc:", doc_url)
        rr = fetch(doc_url)

        content_type = rr.headers.get("Content-Type", "").lower()
        is_pdf = content_type.startswith("application/pdf") or doc_url.lower().endswith(".pdf")
        ext = ".pdf" if is_pdf else ".html"

        fname = sanitize_filename(f"{row['id']}_{row['title']}{ext}")
        path = DOC_DIR / fname

        if is_pdf:
            path.write_bytes(rr.content)
        else:
            path.write_text(rr.text, encoding="utf-8", errors="ignore")

        row["local_path"] = str(path)

        time.sleep(0.7)  # polite delay

    except Exception as e:
        print(" Document download failed:", e)
        row["notes"] += f" | download_failed:{e}"

# 3. Save metadata CSV

if rows:
    with open(OUT_CSV, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
        w.writeheader()
        w.writerows(rows)

    print(f"Saved {len(rows)} OECD sample records to {OUT_CSV}")
else:
    print(" No OECD records saved.")


Downloaded OECD policies CSV to: /content/drive/MyDrive/DSCI-511/data/policies/oecd_policies.csv
Saved 3 OECD sample records to /content/drive/MyDrive/DSCI-511/data/policies/ai_ethics_policies_sample.csv


## üîß Data Cleaning & Preprocessing

Before merging the OECD dataset, thorough data cleaning was performed to ensure the dataset is accurate, consistent, and suitable for merging with other dataset. Raw data often contains issues such as missing values, duplicated rows, inconsistent data types.

In this section, we walk through the complete data cleaning process, including:
- Identifying and handling missing values  
- Detecting and removing duplicate records   
- Fixing formatting inconsistencies  
- Handling outliers  
- Engineering clean, reliable features  

By the end of this phase, we will have a polished and fully preprocessed dataset that is ready for exploratory analysis.

In [30]:
!find "/content/drive/MyDrive" -iname "oecd_policies.csv"


/content/drive/MyDrive/DSCI-511-Project_AI_Ethics_Dataset/DSCI-511/data/policies/oecd_policies.csv
/content/drive/MyDrive/DSCI-511/data/policies/oecd_policies.csv


In [31]:
file_path = "/content/drive/MyDrive/DSCI-511/data/policies/oecd_policies.csv"


## Analysis of the OECD Data

- Dataset has 1884 rows and 52 columns

In [32]:
import pandas as pd
df = pd.read_csv(file_path, encoding="utf-8", quotechar='"', low_memory=False)

# Data
print(df.shape)
display(df.head(10))

(1884, 52)


Unnamed: 0,Policy initiative ID,Platform URL,English name,Original name(s),Acronym,Country,Start date,End date,Description,Theme area(s),...,Objective,Deployment year,Cancellation reason,Entities involvement,Allocated funding,Methodology in place to assess the risk and evaluate the impact of AI in public services,Measures taken to communicate the use of the AI system to citizens (transparency),Measures taken to enable citizens to understand and challenge the outcome of the AI system (explainability and accountability),"Audit, certification, monitoring, evaluation or regulation process",Entered into force on
0,2021/data/policyInitiatives/1335,https://oecd.ai/en/dashboards/policy-initiativ...,SPACERESOURCES.LU,,,Luxembourg,2016.0,,"Within the SpaceResources.lu initiative, the c...",National AI Policies,...,,,,,,,,,,
1,2021/data/policyInitiatives/1337,https://oecd.ai/en/dashboards/policy-initiativ...,DIGITAL LUXEMBOURG,Digital L√´tzebuerg,,Luxembourg,2014.0,,Consolidating Luxembourgs position in the ICT ...,National AI Policies,...,,,,,,,,,,
2,2021/data/policyInitiatives/1337,https://oecd.ai/en/dashboards/policy-initiativ...,DIGITAL LUXEMBOURG,Digital L√´tzebuerg,,Luxembourg,2014.0,,Consolidating Luxembourgs position in the ICT ...,National AI Policies,...,,,,,,,,,,
3,2021/data/policyInitiatives/1355,https://oecd.ai/en/dashboards/policy-initiativ...,DIGITAL TECH FUND,,,Luxembourg,2016.0,,A seed fund was set up in 2016 jointly by the ...,National AI Policies,...,,,,,,,,,,
4,2021/data/policyInitiatives/13968,https://oecd.ai/en/dashboards/policy-initiativ...,GAMEINN,,,Poland,2016.0,,Funding opportunities for the producers of vid...,National AI Policies,...,,,,,,,,,,
5,2021/data/policyInitiatives/13969,https://oecd.ai/en/dashboards/policy-initiativ...,POLAND-TAIWAN SCIENTIFIC CO-OPERATION,Polsko-Tajwa≈Ñska Wsp√≥≈Çpraca Badawcza,,Poland,2012.0,,Poland-Taiwan scientific co-operation is based...,National AI Policies,...,,,,,,,,,,
6,2021/data/policyInitiatives/13969,https://oecd.ai/en/dashboards/policy-initiativ...,POLAND-TAIWAN SCIENTIFIC CO-OPERATION,Polsko-Tajwa≈Ñska Wsp√≥≈Çpraca Badawcza,,Poland,2012.0,,Poland-Taiwan scientific co-operation is based...,National AI Policies,...,,,,,,,,,,
7,2021/data/policyInitiatives/13969,https://oecd.ai/en/dashboards/policy-initiativ...,POLAND-TAIWAN SCIENTIFIC CO-OPERATION,Polsko-Tajwa≈Ñska Wsp√≥≈Çpraca Badawcza,,Poland,2012.0,,Poland-Taiwan scientific co-operation is based...,National AI Policies,...,,,,,,,,,,
8,2021/data/policyInitiatives/14162,https://oecd.ai/en/dashboards/policy-initiativ...,AI PROGRAMME,Teko√§lyohjelma,,Finland,2017.0,2019.0,"The AI programme, which published its interim ...",National AI Policies,...,,,,,,,,,,
9,2021/data/policyInitiatives/14162,https://oecd.ai/en/dashboards/policy-initiativ...,AI PROGRAMME,Teko√§lyohjelma,,Finland,2017.0,2019.0,"The AI programme, which published its interim ...",National AI Policies,...,,,,,,,,,,


In [33]:
# last 10 columns
display(df.tail(10))

Unnamed: 0,Policy initiative ID,Platform URL,English name,Original name(s),Acronym,Country,Start date,End date,Description,Theme area(s),...,Objective,Deployment year,Cancellation reason,Entities involvement,Allocated funding,Methodology in place to assess the risk and evaluate the impact of AI in public services,Measures taken to communicate the use of the AI system to citizens (transparency),Measures taken to enable citizens to understand and challenge the outcome of the AI system (explainability and accountability),"Audit, certification, monitoring, evaluation or regulation process",Entered into force on
1874,2021/data/policyInitiatives/4525,https://oecd.ai/en/dashboards/policy-initiativ...,RESEARCH PLATFORMS,,,Belgium,2010.0,,"The program Research platforms, a program prev...",National AI Policies,...,,,,,,,,,,
1875,2021/data/policyInitiatives/4525,https://oecd.ai/en/dashboards/policy-initiativ...,RESEARCH PLATFORMS,,,Belgium,2010.0,,"The program Research platforms, a program prev...",National AI Policies,...,,,,,,,,,,
1876,2021/data/policyInitiatives/4759,https://oecd.ai/en/dashboards/policy-initiativ...,HORIZON 2020,,H2020,European Union,2014.0,2020.0,The EU framework programme for Research and In...,National AI Policies,...,,,,,,,,,,
1877,2021/data/policyInitiatives/4759,https://oecd.ai/en/dashboards/policy-initiativ...,HORIZON 2020,,H2020,European Union,2014.0,2020.0,The EU framework programme for Research and In...,National AI Policies,...,,,,,,,,,,
1878,2021/data/policyInitiatives/4759,https://oecd.ai/en/dashboards/policy-initiativ...,HORIZON 2020,,H2020,European Union,2014.0,2020.0,The EU framework programme for Research and In...,National AI Policies,...,,,,,,,,,,
1879,2021/data/policyInitiatives/4759,https://oecd.ai/en/dashboards/policy-initiativ...,HORIZON 2020,,H2020,European Union,2014.0,2020.0,The EU framework programme for Research and In...,National AI Policies,...,,,,,,,,,,
1880,2021/data/policyInitiatives/5016,https://oecd.ai/en/dashboards/policy-initiativ...,INNOVATION FUND DENMARK,INNOVATIONSFONDEN,,Denmark,2014.0,,Innovation Fund Denmark is the main public fun...,National AI Policies,...,,,,,,,,,,
1881,2021/data/policyInitiatives/5133,https://oecd.ai/en/dashboards/policy-initiativ...,ATHENA MAGAZINE,,,Belgium,1984.0,,Publication of a free monthly magazine ‚ÄúAthena...,National AI Policies,...,,,,,,,,,,
1882,2021/data/policyInitiatives/5133,https://oecd.ai/en/dashboards/policy-initiativ...,ATHENA MAGAZINE,,,Belgium,1984.0,,Publication of a free monthly magazine ‚ÄúAthena...,National AI Policies,...,,,,,,,,,,
1883,2021/data/policyInitiatives/5295,https://oecd.ai/en/dashboards/policy-initiativ...,AI R&D FRAMEWORK AND ACTIVITIES OF THE ISRAELI...,◊û◊°◊í◊®◊™ ◊§◊¢◊ô◊ú◊ï◊™ ◊©◊ú ◊®◊©◊ï◊™ ◊î◊ó◊ì◊©◊†◊ï◊™ ◊ë◊™◊ó◊ï◊û◊ô ◊ë◊ô◊†◊î ◊û◊ú◊ê◊õ◊ï◊™◊ô◊™,IIA_AI,Israel,2019.0,,The Planned AI R&D Framework & Activities in p...,National AI Policies,...,,,,,,,,,,


- Understanding the Data columns(like Description)

In this step, we examine a subset of the **Description** column from our dataset.

In [34]:
display(df['Description'][1:10])

Unnamed: 0,Description
1,Consolidating Luxembourgs position in the ICT ...
2,Consolidating Luxembourgs position in the ICT ...
3,A seed fund was set up in 2016 jointly by the ...
4,Funding opportunities for the producers of vid...
5,Poland-Taiwan scientific co-operation is based...
6,Poland-Taiwan scientific co-operation is based...
7,Poland-Taiwan scientific co-operation is based...
8,"The AI programme, which published its interim ..."
9,"The AI programme, which published its interim ..."


In [35]:
# for examining the long text in 'description' column
import pandas as pd
from IPython.display import display

with pd.option_context('display.max_colwidth', None):
    display(df['Description'].iloc[1:10])

Unnamed: 0,Description
1,Consolidating Luxembourgs position in the ICT fields in the longer term.
2,Consolidating Luxembourgs position in the ICT fields in the longer term.
3,A seed fund was set up in 2016 jointly by the Ministry of Economy and a group of private investors.
4,Funding opportunities for the producers of video games (the thematic scope of the programme comprises the application of Artificial Intelligence in video games).
5,"Poland-Taiwan scientific co-operation is based on the agreement between the National Centre for Research and Development and the Ministry of Science and Technology of the Republic of China (originally established as the National Science Council of Taiwan) and consists in joint funding of scientific and R&D projects and organisation of scientific seminars, among others."
6,"Poland-Taiwan scientific co-operation is based on the agreement between the National Centre for Research and Development and the Ministry of Science and Technology of the Republic of China (originally established as the National Science Council of Taiwan) and consists in joint funding of scientific and R&D projects and organisation of scientific seminars, among others."
7,"Poland-Taiwan scientific co-operation is based on the agreement between the National Centre for Research and Development and the Ministry of Science and Technology of the Republic of China (originally established as the National Science Council of Taiwan) and consists in joint funding of scientific and R&D projects and organisation of scientific seminars, among others."
8,"The AI programme, which published its interim report with eight proposals in October 2017 and final report in 2019, steered Finland into the track of becoming a leading country in the application of AI."
9,"The AI programme, which published its interim report with eight proposals in October 2017 and final report in 2019, steered Finland into the track of becoming a leading country in the application of AI."


### Checking Missing Values in the Dataset

Before performing any analysis, it is important to understand how much data is missing in each column.  
- This step calculates and summarizes the number and percentage of missing values across the dataset.

- This summary helps identify columns that may require imputation, removal, or further investigation.


In [36]:
nan_counts = df.isna().sum()
nan_percent = (nan_counts / len(df)) * 100

missing_summary = pd.DataFrame({
    'Missing Count': nan_counts,
    'Missing %': nan_percent.round(2)
}).sort_values(by='Missing %', ascending=False)

print(f"Total rows: {len(df)}\n")
missing_summary


Total rows: 1884



Unnamed: 0,Missing Count,Missing %
Measures taken to enable citizens to understand and challenge the outcome of the AI system (explainability and accountability),1880,99.79
Budget amount\n(in local currency),1879,99.73
Methodology in place to assess the risk and evaluate the impact of AI in public services,1879,99.73
Cancellation reason,1875,99.52
Consultation process end date,1872,99.36
"Audit, certification, monitoring, evaluation or regulation process",1872,99.36
Allocated funding,1871,99.31
Consultation process objective,1869,99.2
Other AI Policy Area(s),1866,99.04
Consultation process begin date,1866,99.04


- The missing-data percentages show that most AI ethics policies include high-level elements like transparency, explainability, citizen rights, and risk assessment. However, large portions of practical information are missing such as awareness plans, responsible organizations, target groups, implementation details, and yearly funding plans. This suggests that while countries focus on ethical principles, many policies still lack concrete execution, budgeting, and public-engagement strategies.

- AI Ethics Policies with Start date as 2024

In [37]:

df[df['Start date'] == 2024]

Unnamed: 0,Policy initiative ID,Platform URL,English name,Original name(s),Acronym,Country,Start date,End date,Description,Theme area(s),...,Objective,Deployment year,Cancellation reason,Entities involvement,Allocated funding,Methodology in place to assess the risk and evaluate the impact of AI in public services,Measures taken to communicate the use of the AI system to citizens (transparency),Measures taken to enable citizens to understand and challenge the outcome of the AI system (explainability and accountability),"Audit, certification, monitoring, evaluation or regulation process",Entered into force on
1818,2021/data/policyInitiatives/27588,https://oecd.ai/en/dashboards/policy-initiativ...,AI INNOVATION PACKAGE TO SUPPORT ATIFICIAL INT...,,,European Union,2024.0,,The Commission has launched a package of measu...,National AI Policies,...,,,,,,,,,,
1819,2021/data/policyInitiatives/27588,https://oecd.ai/en/dashboards/policy-initiativ...,AI INNOVATION PACKAGE TO SUPPORT ATIFICIAL INT...,,,European Union,2024.0,,The Commission has launched a package of measu...,National AI Policies,...,,,,,,,,,,
1820,2021/data/policyInitiatives/27588,https://oecd.ai/en/dashboards/policy-initiativ...,AI INNOVATION PACKAGE TO SUPPORT ATIFICIAL INT...,,,European Union,2024.0,,The Commission has launched a package of measu...,National AI Policies,...,,,,,,,,,,
1821,2021/data/policyInitiatives/27588,https://oecd.ai/en/dashboards/policy-initiativ...,AI INNOVATION PACKAGE TO SUPPORT ATIFICIAL INT...,,,European Union,2024.0,,The Commission has launched a package of measu...,National AI Policies,...,,,,,,,,,,
1822,2021/data/policyInitiatives/27589,https://oecd.ai/en/dashboards/policy-initiativ...,NATIONAL ARTIFICIAL INTELLIGENCE RESEARCH RESO...,,NAIRR,United States,2024.0,2026.0,"The NAIRR pilot brings together computational,...",National AI Policies,...,,,,,,,,,,
1853,2021/data/policyInitiatives/27618,https://oecd.ai/en/dashboards/policy-initiativ...,Open-source artificial intelligence algorithm ...,Servicio de algoritmos de inteligencia artific...,,Peru,2024.0,,Continuous service enabled for public institut...,National AI Policies,...,,,,,,,,,,
1854,2021/data/policyInitiatives/27619,https://oecd.ai/en/dashboards/policy-initiativ...,Technological sandbox on the use of exponentia...,Sandbox tecnol√≥gico sobre uso de tecnolog√≠as e...,,Peru,2024.0,,Use of technologies that allow massive process...,National AI Policies,...,,,,,,,,,,
1855,2021/data/policyInitiatives/27620,https://oecd.ai/en/dashboards/policy-initiativ...,Continuous Artificial Intelligence Program for...,Programa de Inteligencia Artificial de manera ...,,Peru,2024.0,,The program includes the diagnosis and design ...,National AI Policies,...,,,,,,,,,,
1856,2021/data/policyInitiatives/27621,https://oecd.ai/en/dashboards/policy-initiativ...,Public-private regulatory sandbox regarding th...,Sandbox regulatorio p√∫blico-privado respecto a...,,Peru,2024.0,,The project covers: improving the integration ...,National AI Policies,...,,,,,,,,,,


- We can Drop columns with more than 60% missing values(this is the criteria we used to have the columns, which can help in further analysis and modeling)

- Columns with too many missing values provide little useful information, so we remove any column where more than 60% of entries are missing.

In [38]:

# Drop columns having more than 60% missing data
missing_threshold = 60  # percent(we can change this according to the use case)
cols_to_drop = missing_summary[missing_summary["Missing %"] > missing_threshold].index.tolist()

df_clean = df.drop(columns=cols_to_drop)

print("Dropped Columns:", len(cols_to_drop))
cols_to_drop


Dropped Columns: 26


['Measures taken to enable citizens to understand and challenge the outcome of the AI system (explainability and accountability)',
 'Budget amount\n(in local currency)',
 'Methodology in place to assess the risk and evaluate the impact of AI in public services',
 'Cancellation reason',
 'Consultation process end date',
 'Audit, certification, monitoring, evaluation or regulation process',
 'Allocated funding',
 'Consultation process objective',
 'Other AI Policy Area(s)',
 'Consultation process begin date',
 'Coordinating institution name',
 'Entities involvement',
 'Measures taken to communicate the use of the AI system to citizens (transparency)',
 'Shift(s) related to Covid',
 'Entered into force on',
 'Strategy priority targets and deadlines',
 'Deployment year',
 'Evaluation provides input to',
 'Evaluation type',
 'Objective',
 'Evaluation performed by',
 'Evaluation URL',
 'Link',
 'End date',
 'Policy instrument description(s)',
 'Acronym']

### Handling missing values in text columns

- For text/object columns, missing entries are replaced with "Unknown" so that the dataset remains usable without introducing errors.

In [39]:
text_cols = df_clean.select_dtypes(include=["object"]).columns
df_clean[text_cols] = df_clean[text_cols].fillna("Unknown")


### Handling missing values in numeric columns

- Numeric columns cannot contain NaN for machine learning, so we fill missing values using column "means" to preserve dataset structure.

In [40]:
# Select all object/text columns that actually exist
text_cols = df_clean.select_dtypes(include=["object"]).columns.tolist()

print("Text columns:", text_cols)

# Fill NaN only in these columns
df_clean[text_cols] = df_clean[text_cols].fillna("Unknown")


Text columns: ['Policy initiative ID', 'Platform URL', 'English name', 'Original name(s)', 'Country', 'Description', 'Theme area(s)', 'Theme(s)', 'Background', 'Objective(s)', 'Target group type(s)', 'Target group(s)', 'Responsible organisation(s)', 'Yearly budget range', 'Public access URL', 'AI Principle(s)', 'AI Policy Area(s)', 'Policy instrument ID', 'Policy instrument type category', 'Policy instrument type', 'Policy instrument name', 'Policy instrument mini-field(s)']


### Fill missing numeric values using column means

- Numeric columns cannot contain NaN during analysis or model training, so we replace missing numeric values with the mean of each column to maintain consistency without losing rows.

In [41]:
num_cols = df_clean.select_dtypes(include=["number"]).columns.tolist()

df_clean[num_cols] = df_clean[num_cols].fillna(df_clean[num_cols].mean())


### Save the cleaned dataset

- We export the fully cleaned DataFrame as oecd_policies_clean.csv
so it can be reused for future analysis and modeling without repeating the cleaning steps.
- Now, there are 26 columns in the Dataset.

In [42]:
df_clean.to_csv("/content/drive/MyDrive/DSCI-511/data/policies/oecd_policies_clean.csv",
                index=False)


In [43]:
df_clean = pd.read_csv("/content/drive/MyDrive/DSCI-511/data/policies/oecd_policies_clean.csv")
df_clean.head()


Unnamed: 0,Policy initiative ID,Platform URL,English name,Original name(s),Country,Start date,Description,Theme area(s),Theme(s),Background,...,Public access URL,Is a structural reform ?,Is evaluated ?,AI Principle(s),AI Policy Area(s),Policy instrument ID,Policy instrument type category,Policy instrument type,Policy instrument name,Policy instrument mini-field(s)
0,2021/data/policyInitiatives/1335,https://oecd.ai/en/dashboards/policy-initiativ...,SPACERESOURCES.LU,Unknown,Luxembourg,2016.0,"Within the SpaceResources.lu initiative, the c...",National AI Policies,National AI policies,"Luxembourg provides a unique legal, regulatory...",...,http://www.spaceresources.public.lu/en.html,False,False,Fostering a digital ecosystem for AI|Investing...,Unknown,http://aipo.oecd.org/2021/data/policyInitiativ...,Governance,"National strategies, agendas and plans",Unknown,Societal challenge(s) emphasised: None specifi...
1,2021/data/policyInitiatives/1337,https://oecd.ai/en/dashboards/policy-initiativ...,DIGITAL LUXEMBOURG,Digital L√´tzebuerg,Luxembourg,2014.0,Consolidating Luxembourgs position in the ICT ...,National AI Policies,National AI policies,"Over the past ten years, Luxembourg‚Äôs digital ...",...,https://gouvernement.lu/en/dossiers/2014/digit...,False,True,"Inclusive growth, sustainable development and ...",Unknown,http://aipo.oecd.org/2021/data/policyInitiativ...,AI enablers and other incentives,Networking and collaborative platforms,Unknown,Unknown
2,2021/data/policyInitiatives/1337,https://oecd.ai/en/dashboards/policy-initiativ...,DIGITAL LUXEMBOURG,Digital L√´tzebuerg,Luxembourg,2014.0,Consolidating Luxembourgs position in the ICT ...,National AI Policies,National AI policies,"Over the past ten years, Luxembourg‚Äôs digital ...",...,https://gouvernement.lu/en/dossiers/2014/digit...,False,True,"Inclusive growth, sustainable development and ...",Unknown,http://aipo.oecd.org/2021/data/policyInitiativ...,Governance,"National strategies, agendas and plans",Unknown,Implementation mechanism: Periodic monitoring ...
3,2021/data/policyInitiatives/1355,https://oecd.ai/en/dashboards/policy-initiativ...,DIGITAL TECH FUND,Unknown,Luxembourg,2016.0,A seed fund was set up in 2016 jointly by the ...,National AI Policies,National AI policies,This fund is part of the strategy Digital L√´tz...,...,https://digital-luxembourg.public.lu/initiativ...,False,False,Fostering a digital ecosystem for AI,Unknown,http://aipo.oecd.org/2021/data/policyInitiativ...,Financial support,Equity financing,Unknown,Focus: Other\nMechanism(s): Fund\nType of fina...
4,2021/data/policyInitiatives/13968,https://oecd.ai/en/dashboards/policy-initiativ...,GAMEINN,Unknown,Poland,2016.0,Funding opportunities for the producers of vid...,National AI Policies,National AI policies,Unknown,...,"http://www.ncbr.gov.pl/en/news/art,4200,playin...",False,False,Unknown,Unknown,http://aipo.oecd.org/2021/data/policyInitiativ...,Financial support,Grants for business R&D and innovation,Unknown,Contribution (e.g. matching funds) required fr...


In [49]:
df_clean.dtypes

Unnamed: 0,0
Policy initiative ID,object
Platform URL,object
English name,object
Original name(s),object
Country,object
Start date,float64
Description,object
Theme area(s),object
Theme(s),object
Background,object


In [51]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1884 entries, 0 to 1883
Data columns (total 26 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Policy initiative ID               1884 non-null   object 
 1   Platform URL                       1884 non-null   object 
 2   English name                       1884 non-null   object 
 3   Original name(s)                   1884 non-null   object 
 4   Country                            1884 non-null   object 
 5   Start date                         1884 non-null   float64
 6   Description                        1884 non-null   object 
 7   Theme area(s)                      1884 non-null   object 
 8   Theme(s)                           1884 non-null   object 
 9   Background                         1884 non-null   object 
 10  Objective(s)                       1884 non-null   object 
 11  Target group type(s)               1884 non-null   objec

For further analysis, some columns should be converted to more suitable dtypes:

- We should Convert Start date from float64 to datetime (e.g., pd.to_datetime(df_clean["Start date"], errors="coerce")) for time-based analysis.

- We should Keep bollean  as Boolean for logical filtering and grouping.

- We should Keep ID fields (Policy initiative ID, Policy instrument ID) as integers or categorical, not numeric measures.

- We should  convert high-cardinality text descriptors (e.g., Country, AI Policy Area(s), Theme area(s), Policy instrument type) to category dtype to save memory and improve groupby operations.

### Locate the cleaned dataset in Google Drive

- We search through the Drive directory to confirm where the cleaned oecd_policies_clean.csv file is saved and verify that it was written correctly.

In [44]:
import os

for root, dirs, files in os.walk("/content/drive/MyDrive", topdown=True):
    for f in files:
        if "oecd_policies_clean" in f.lower():
            print(os.path.join(root, f))

/content/drive/MyDrive/DSCI-511-Project_AI_Ethics_Dataset/DSCI-511/data/policies/oecd_policies_clean.csv
/content/drive/MyDrive/DSCI-511/data/policies/oecd_policies_clean.csv


In [45]:
print("Files in /content/drive/MyDrive/DSCI-511/data/policies:")
import os
print(os.listdir("/content/drive/MyDrive/DSCI-511/data/policies"))

Files in /content/drive/MyDrive/DSCI-511/data/policies:
['docs', 'oecd_policies.csv', 'ai_ethics_policies_sample.csv', 'oecd_policies_clean.csv']


### Reload the cleaned dataset and verify missing values
- Loading the cleaned CSV file and generate a new missing-value summary to confirm that all missing data has been properly handled after the cleaning process.

In [48]:
df_clean = pd.read_csv(
    "/content/drive/MyDrive/DSCI-511/data/policies/oecd_policies_clean.csv"
)

nan_counts_clean = df_clean.isna().sum()
nan_percent_clean = (nan_counts_clean / len(df_clean)) * 100

missing_summary_clean = pd.DataFrame({
    "Missing Count": nan_counts_clean,
    "Missing %": nan_percent_clean
}).sort_values(by="Missing %", ascending=False)

display(missing_summary_clean)


Unnamed: 0,Missing Count,Missing %
Policy initiative ID,0,0.0
Platform URL,0,0.0
English name,0,0.0
Original name(s),0,0.0
Country,0,0.0
Start date,0,0.0
Description,0,0.0
Theme area(s),0,0.0
Theme(s),0,0.0
Background,0,0.0


### Conclusion:
This is first part of our Project . Now, we need to scrap two more websites to enrich this Dataset.
