## **Scope Defination for Knowledge Graph**

The below are some questions that the Knowledge Graph should be able to answer.

### **1. Provenance & Temporal Query**

> **Question:**
> “Which videos published between 2019 and 2020 introduce the concept of *programming* or *Python basics*, and from which dataset batch do they originate?”

**Why it matters:**
Tests your ability to:

* Filter by **semantic keywords** (`programming`, `Python`)
* Handle **temporal reasoning** (date range)
* Trace **graph provenance** via named graph URIs


### **2. Contextual Semantic Retrieval**

> **Question:**
> “Find all videos that explain *image segmentation* methods, including U-Net, LinkNet, watershed, and StarDist, even if the exact term ‘segmentation’ isn’t used in the title.”

**Why it matters:**

* Tests **semantic equivalence** (e.g., “object segmentation” vs “U-Net”)
* Checks **cross-batch linking** (topics appear in multiple years)
* Evaluates **description-text understanding** beyond keyword match


### **3. Provenance Chain Trace**

> **Question:**
> “Show the provenance trail for the video *‘3D U-Net for semantic segmentation’* — which dataset it was derived from, when it was added, and what transformations (cleaning, enrichment) were applied before indexing.”

**Why it matters:**

* Tests **Named Graph identity and lineage tracking** (`prov:wasDerivedFrom`, `prov:wasGeneratedBy`)
* Validates correct use of **PROV-O ontology**
* Ensures **auditability** — a core provenance requirement



### **4. Thematic Evolution Query**

> **Question:**
> “How has the topic of *image analysis* evolved across datasets — list videos per year that mention histogram, segmentation, or DCT, ordered chronologically.”

**Why it matters:**

* Tests **temporal aggregation and trend extraction**
* Requires joining data across all four named graphs
* Validates **semantic normalization** (different ways of describing similar topics)


### **5. Quality & Duration Correlation**

> **Question:**
> “Which videos longer than 15 minutes discuss *COVID-19 data analysis* and what are their publish timestamps and batch provenance?”

**Why it matters:**

* Tests numeric comparison (`Approx Duration`)
* Combines **topic reasoning** with **duration filtering**
* Validates consistent data typing and temporal metadata

### **Preprocessing steps:**
**For our data to be able to answer the above questions there are some `Preprocessing steps` that must be passed.**
1. **Normalize column names** – make them lowercase, underscore-separated, and consistent across batches.
2. **Clean text fields** – remove newlines, escape characters, and trailing spaces in titles and descriptions.
3. **Convert timestamps** – ensure `Video Publish Timestamp` is in valid ISO 8601 (`xsd:dateTime`) format.
4. **Assign unique IDs** – generate stable identifiers for each video (e.g., `video_001`, `video_002`).
5. **Extract publish year** – derive a `publish_year` column from the timestamp for temporal filtering.
6. **Add batch provenance tag** – include a `batch_source` column noting which dataset the row came from.


In [1]:
import pandas as pd
import os

In [None]:
# Load the dataset
file_path = "../Prove_Data/Provenance_Datasets/first_batch_metadata.csv"
first_batch = pd.read_csv(file_path)

# Display first few rows
first_batch.head(10)

Unnamed: 0,Approx Duration (ms),Video Description (Original),Video Title (Original),Video Publish Timestamp
0,281000,If you are a student or researcher in any fiel...,01 - Why do you need to learn programming?,2019-05-05T22:09:40+00:00
1,394000,What is programming?\nA few programming terms....,02 - What is programming?,2019-05-05T22:23:57+00:00
2,579000,As a coder you need to understand the basics o...,03 - What is command prompt?,2019-05-06T23:02:25+00:00
3,644000,It is very important to understand what a digi...,04 - What is a digital image?,2019-05-06T23:07:35+00:00
4,519000,If you are an absolute beginner programmer the...,05 - What is Python?,2019-05-06T23:14:26+00:00
5,1145000,This video provides an overview of various dev...,06 - Python basics - IDE & operators,2019-05-06T23:28:29+00:00
6,1214000,This is a continuation of previous video and i...,07 - Python basics - logical operators and bas...,2019-05-06T23:37:05+00:00
7,148000,This video warns you about potential round-off...,08 - A warning about round off errors in Python,2019-05-06T23:37:41+00:00
8,716000,In this video you'll learn about using if and ...,09 - if else elif statements in Python,2019-05-07T00:01:55+00:00
9,1284000,Images are nothing but a list of numbers. Ther...,10 - lists tuples and dictionaries,2019-05-07T00:24:18+00:00


#### **1. Normalize Column Names**

In [3]:
# Normalize column names: lowercase + underscores
first_batch.columns = [c.strip().lower().replace(" ", "_") for c in first_batch.columns]

print("Normalized Column Names:")
print(first_batch.columns.tolist())

Normalized Column Names:
['approx_duration_(ms)', 'video_description_(original)', 'video_title_(original)', 'video_publish_timestamp']


#### **2. Clean Text Fields:**

Here we remove `\n`, multiple spaces, and trailing spaces which will be very helpful for clean RDF literals.

In [8]:
# Function to check if cleaning will alter text
def check_text_changes(df, col):
    cleaned = (
        df[col].astype(str).str.replace(r'\s+', ' ', regex=True).str.strip()
    )
    changes = df.loc[df[col] != cleaned, [col]].copy()
    return len(changes), changes.head(5)

for col in existing_cols:
    count, sample = check_text_changes(first_batch, col)
    if count > 0:
        print(f"{count} rows in '{col}' would be affected.")
    else:
        print(f"No changes needed for '{col}'. It’s already clean.")


91 rows in 'video_description_(original)' would be affected.
4 rows in 'video_title_(original)' would be affected.


In [9]:
# Clean newline, extra spaces, and escape characters in text columns
text_cols = ['video_description_(original)', 'video_title_(original)']

for col in text_cols:
    if col in first_batch.columns:
        first_batch[col] = (
            first_batch[col]
            .astype(str)
            .str.replace(r'\s+', ' ', regex=True)
            .str.strip()
        )

# Check cleaned results
first_batch[text_cols].head(5)

Unnamed: 0,video_description_(original),video_title_(original)
0,If you are a student or researcher in any fiel...,01 - Why do you need to learn programming?
1,What is programming? A few programming terms. ...,02 - What is programming?
2,As a coder you need to understand the basics o...,03 - What is command prompt?
3,It is very important to understand what a digi...,04 - What is a digital image?
4,If you are an absolute beginner programmer the...,05 - What is Python?


#### **3. Convert Timestamps to ISO 8601 Format**
This will help us enforces RDF-compatible date-time format (`xsd:dateTime`).

In [10]:
# Convert timestamps to ISO 8601
if 'video_publish_timestamp' in first_batch.columns:
    first_batch['video_publish_timestamp'] = pd.to_datetime(
        first_batch['video_publish_timestamp'], errors='coerce'
    ).dt.strftime('%Y-%m-%dT%H:%M:%SZ')

first_batch[['video_publish_timestamp']].head(5)

Unnamed: 0,video_publish_timestamp
0,2019-05-05T22:09:40Z
1,2019-05-05T22:23:57Z
2,2019-05-06T23:02:25Z
3,2019-05-06T23:07:35Z
4,2019-05-06T23:14:26Z


#### **4. Assign Unique IDs**
These unique IDs become stable URIs later (e.g., `ex:video_first_batch_vid_0001`).

In [11]:
# Assign stable unique IDs
batch_name = "first_batch"
first_batch.insert(0, 'video_id', [f"{batch_name}_vid_{i+1:04d}" for i in range(len(first_batch))])

# Preview
first_batch[['video_id', 'video_title_(original)']].head(5)

Unnamed: 0,video_id,video_title_(original)
0,first_batch_vid_0001,01 - Why do you need to learn programming?
1,first_batch_vid_0002,02 - What is programming?
2,first_batch_vid_0003,03 - What is command prompt?
3,first_batch_vid_0004,04 - What is a digital image?
4,first_batch_vid_0005,05 - What is Python?


#### **5. Extract Publish Year**
This will be very useful for filtering or grouping videos by publication year.

In [12]:
# Extract year from timestamp
first_batch['publish_year'] = pd.to_datetime(
    first_batch['video_publish_timestamp'], errors='coerce'
).dt.year

first_batch[['video_publish_timestamp', 'publish_year']].head(5)

Unnamed: 0,video_publish_timestamp,publish_year
0,2019-05-05T22:09:40Z,2019.0
1,2019-05-05T22:23:57Z,2019.0
2,2019-05-06T23:02:25Z,2019.0
3,2019-05-06T23:07:35Z,2019.0
4,2019-05-06T23:14:26Z,2019.0


#### **6. Add Batch Provenance Tag**
This helps track which original CSV each record came from. This is useful for provenance graphs.

In [13]:
# Add batch provenance tag
first_batch['batch_source'] = batch_name

first_batch[['video_id', 'batch_source']].head(5)

Unnamed: 0,video_id,batch_source
0,first_batch_vid_0001,first_batch
1,first_batch_vid_0002,first_batch
2,first_batch_vid_0003,first_batch
3,first_batch_vid_0004,first_batch
4,first_batch_vid_0005,first_batch


#### **Functionising the whole process**

In [24]:
import pandas as pd
import os
from datetime import datetime

In [None]:
# def preprocess_provenance_csv(file_path: str, batch_name: str, output_dir: str = "../Data/Preprocessed_Provenance_Datasets"):
#     """Preprocess a provenance metadata CSV file and save a cleaned version."""

#     # Ensure output directory exists
#     os.makedirs(output_dir, exist_ok=True)

#     # Load CSV
#     df = pd.read_csv(file_path)

#     # 1. Normalize column names
#     df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

#     # 2. Clean text fields
#     text_cols = ['video_description_(original)', 'video_title_(original)']
#     for col in text_cols:
#         if col in df.columns:
#             df[col] = df[col].astype(str).str.replace(r'\s+', ' ', regex=True).str.strip()

#     # 3. Convert timestamps to ISO 8601
#     if 'video_publish_timestamp' in df.columns:
#         df['video_publish_timestamp'] = pd.to_datetime(df['video_publish_timestamp'], errors='coerce').dt.strftime('%Y-%m-%dT%H:%M:%SZ')

#     # 4. Assign unique IDs
#     df.insert(0, 'video_id', [f"{batch_name}_vid_{i+1:04d}" for i in range(len(df))])

#     # 5. Extract publish year
#     if 'video_publish_timestamp' in df.columns:
#         df['publish_year'] = pd.to_datetime(df['video_publish_timestamp'], errors='coerce').dt.year

#     # 6. Add batch provenance tag
#     df['batch_source'] = batch_name

#     # Save preprocessed file
#     output_path = os.path.join(output_dir, f"{batch_name}_metadata_preprocessed.csv")
#     df.to_csv(output_path, index=False)

#     print(f"Preprocessed and saved: {output_path}")
#     return df

In [None]:
# /home/chukwuemeka-james/Desktop/Data-Collection-and-Preparation/Advanced_Implementation/validation/validation_report.ttl
# @prefix flow: <http://flow.ai/schema/> .
# @prefix sh: <http://www.w3.org/ns/shacl#> .
# @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# [] a sh:ValidationReport ;
#     sh:conforms false ;
#     sh:result [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0107> ;
#             sh:resultMessage "Each Video must have a valid publish timestamp." ;
#             sh:resultPath flow:publishTimestamp ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b5 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/first_batch_vid_0105> ;
#             sh:resultMessage "Each Video must have a valid publish timestamp." ;
#             sh:resultPath flow:publishTimestamp ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b5 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0106> ;
#             sh:resultMessage "Each Video must have a valid publish timestamp." ;
#             sh:resultPath flow:publishTimestamp ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b5 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/first_batch_vid_0105> ;
#             sh:resultMessage "Each Video must have a publish year." ;
#             sh:resultPath flow:publishYear ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b6 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0106> ;
#             sh:resultMessage "Each Video must have a publish year." ;
#             sh:resultPath flow:publishYear ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b6 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0108> ;
#             sh:resultMessage "Each Video must have a publish year." ;
#             sh:resultPath flow:publishYear ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b6 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0104> ;
#             sh:resultMessage "Each Video must have a valid publish timestamp." ;
#             sh:resultPath flow:publishTimestamp ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b5 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0108> ;
#             sh:resultMessage "Each Video must have a valid publish timestamp." ;
#             sh:resultPath flow:publishTimestamp ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b5 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0104> ;
#             sh:resultMessage "Each Video must have a publish year." ;
#             sh:resultPath flow:publishYear ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b6 ],
#         [ a sh:ValidationResult ;
#             sh:focusNode <http://flow.ai/video/fourth_batch_vid_0107> ;
#             sh:resultMessage "Each Video must have a publish year." ;
#             sh:resultPath flow:publishYear ;
#             sh:resultSeverity sh:Violation ;
#             sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
#             sh:sourceShape _:n9fefde06c143498eb56acf4ae7f60d25b6 ] .

# _:n9fefde06c143498eb56acf4ae7f60d25b5 sh:datatype xsd:dateTime ;
#     sh:message "Each Video must have a valid publish timestamp." ;
#     sh:minCount 1 ;
#     sh:path flow:publishTimestamp .

# _:n9fefde06c143498eb56acf4ae7f60d25b6 sh:datatype xsd:gYear ;
#     sh:message "Each Video must have a publish year." ;
#     sh:minCount 1 ;
#     sh:path flow:publishYear .



In [None]:
def preprocess_provenance_csv(file_path: str, batch_name: str, output_dir: str = "../Data/Preprocessed_Provenance_Datasets"):
    """Preprocess a provenance metadata CSV file and save a cleaned version."""

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Load CSV
    df = pd.read_csv(file_path)

    # 1. Normalize column names
    df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

    # 2. Clean text fields
    text_cols = ['video_description_(original)', 'video_title_(original)']
    for col in text_cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.replace(r'\s+', ' ', regex=True).str.strip()

    # 3. Convert timestamps to ISO 8601, handle missing or invalid
    if 'video_publish_timestamp' in df.columns:
        df['video_publish_timestamp'] = pd.to_datetime(df['video_publish_timestamp'], errors='coerce')

        # Fill missing timestamps with batch median or fallback default
        if df['video_publish_timestamp'].isna().any():
            median_date = df['video_publish_timestamp'].dropna().median()
            fallback_date = median_date if pd.notnull(median_date) else pd.Timestamp(f"{datetime.now().year}-01-01T00:00:00Z")
            df['video_publish_timestamp'] = df['video_publish_timestamp'].fillna(fallback_date)

        # Reformat as ISO 8601 string
        df['video_publish_timestamp'] = df['video_publish_timestamp'].dt.strftime('%Y-%m-%dT%H:%M:%SZ')

    # 4. Assign unique IDs
    df.insert(0, 'video_id', [f"{batch_name}_vid_{i+1:04d}" for i in range(len(df))])

    # 5. Extract publish year, filling any missing with batch median year
    if 'video_publish_timestamp' in df.columns:
        df['publish_year'] = pd.to_datetime(df['video_publish_timestamp'], errors='coerce').dt.year
        if df['publish_year'].isna().any():
            median_year = int(df['publish_year'].dropna().median()) if df['publish_year'].dropna().any() else datetime.now().year
            df['publish_year'] = df['publish_year'].fillna(median_year).astype(int)

    # 6. Add batch provenance tag
    df['batch_source'] = batch_name

    # Save preprocessed file
    output_path = os.path.join(output_dir, f"{batch_name}_metadata_preprocessed.csv")
    df.to_csv(output_path, index=False)

    print(f"Preprocessed and saved: {output_path}")
    return df


In [None]:
# Apply preprocessing to all four datasets
datasets = {
    "first_batch": "../Prove_Data/Provenance_Datasets/first_batch_metadata.csv",
    "second_batch": "../Prove_Data/Provenance_Datasets/second_batch_metadata.csv",
    "third_batch": "../Prove_Data/Provenance_Datasets/third_batch_metadata.csv",
    "fourth_batch": "../Prove_Data/Provenance_Datasets/fourth_batch_metadata.csv"
}

for batch_name, path in datasets.items():
    preprocess_provenance_csv(path, batch_name)

Preprocessed and saved: ../Data/Preprocessed_Provenance_Datasets/first_batch_metadata_preprocessed.csv
Preprocessed and saved: ../Data/Preprocessed_Provenance_Datasets/second_batch_metadata_preprocessed.csv
Preprocessed and saved: ../Data/Preprocessed_Provenance_Datasets/third_batch_metadata_preprocessed.csv
Preprocessed and saved: ../Data/Preprocessed_Provenance_Datasets/fourth_batch_metadata_preprocessed.csv


In [None]:
# Load the dataset
file_path = "../Prove_Data/Preprocessed_Provenance_Datasets/first_batch_metadata_preprocessed.csv"
first_batch = pd.read_csv(file_path)

# Display first few rows
first_batch.head(10)

Unnamed: 0,video_id,approx_duration_(ms),video_description_(original),video_title_(original),video_publish_timestamp,publish_year,batch_source
0,first_batch_vid_0001,281000,If you are a student or researcher in any fiel...,01 - Why do you need to learn programming?,2019-05-05T22:09:40Z,2019,first_batch
1,first_batch_vid_0002,394000,What is programming? A few programming terms. ...,02 - What is programming?,2019-05-05T22:23:57Z,2019,first_batch
2,first_batch_vid_0003,579000,As a coder you need to understand the basics o...,03 - What is command prompt?,2019-05-06T23:02:25Z,2019,first_batch
3,first_batch_vid_0004,644000,It is very important to understand what a digi...,04 - What is a digital image?,2019-05-06T23:07:35Z,2019,first_batch
4,first_batch_vid_0005,519000,If you are an absolute beginner programmer the...,05 - What is Python?,2019-05-06T23:14:26Z,2019,first_batch
5,first_batch_vid_0006,1145000,This video provides an overview of various dev...,06 - Python basics - IDE & operators,2019-05-06T23:28:29Z,2019,first_batch
6,first_batch_vid_0007,1214000,This is a continuation of previous video and i...,07 - Python basics - logical operators and bas...,2019-05-06T23:37:05Z,2019,first_batch
7,first_batch_vid_0008,148000,This video warns you about potential round-off...,08 - A warning about round off errors in Python,2019-05-06T23:37:41Z,2019,first_batch
8,first_batch_vid_0009,716000,In this video you'll learn about using if and ...,09 - if else elif statements in Python,2019-05-07T00:01:55Z,2019,first_batch
9,first_batch_vid_0010,1284000,Images are nothing but a list of numbers. Ther...,10 - lists tuples and dictionaries,2019-05-07T00:24:18Z,2019,first_batch


In [None]:
# Load the dataset
file_path = "../Prove_Data/Preprocessed_Provenance_Datasets/second_batch_metadata_preprocessed.csv"
first_batch = pd.read_csv(file_path)

# Display first few rows
first_batch.head(10)

Unnamed: 0,video_id,approx_duration_(ms),video_description_(original),video_title_(original),video_publish_timestamp,publish_year,batch_source
0,second_batch_vid_0001,691000,Discrete cosine transformation (DCT) is simila...,112 - Averaging image stack in real and DCT sp...,2020-03-30T07:00:11Z,2020,second_batch
1,second_batch_vid_0002,1023000,If the image histogram is confined only to a s...,113 - Histogram equalization and CLAHE,2020-04-01T07:00:27Z,2020,second_batch
2,second_batch_vid_0003,551000,BRISQUE calculates the no-reference image qual...,114 - Automatic image quality assessment using...,2020-04-03T07:00:09Z,2020,second_batch
3,second_batch_vid_0004,462000,Otsu is a well known approach for image segmen...,115 - Auto segmentation using multi-otsu,2020-04-06T07:00:16Z,2020,second_batch
4,second_batch_vid_0005,1282000,Social distancing is a social responsibility t...,Effect of Social Distancing on the spread of C...,2020-03-20T00:54:20Z,2020,second_batch
5,second_batch_vid_0006,1147000,Data analysis is a key step that often follows...,107 - Analysis of COVID-19 data using Python -...,2020-03-23T07:00:01Z,2020,second_batch
6,second_batch_vid_0007,1610000,Data analysis is a key step that often follows...,108 - Analysis of COVID-19 data using Python -...,2020-03-25T07:00:13Z,2020,second_batch
7,second_batch_vid_0008,914000,This video explains the process of fitting dat...,109 - Predicting COVID-19 cases using Python,2020-03-26T05:00:01Z,2020,second_batch
8,second_batch_vid_0009,1259000,This video describes the process of reading an...,110 - Visualizing COVID-19 cases & death infor...,2020-03-27T23:20:09Z,2020,second_batch
9,second_batch_vid_0010,505000,This video covers the topic of digging into th...,111 - What are the top 10 countries with highe...,2020-03-28T09:00:18Z,2020,second_batch


In [None]:
# Load the dataset
file_path = "../Prove_Data/Preprocessed_Provenance_Datasets/third_batch_metadata_preprocessed.csv"
first_batch = pd.read_csv(file_path)

# Display first few rows
first_batch.head(10)

Unnamed: 0,video_id,approx_duration_(ms),video_description_(original),video_title_(original),video_publish_timestamp,publish_year,batch_source
0,third_batch_vid_0001,2271000,Multiclass semantic segmentation using U-Net w...,"210 - Multiclass U-Net using VGG, ResNet, and ...",2021-03-24T07:00:02Z,2021,third_batch
1,third_batch_vid_0002,1915000,Multiclass semantic segmentation using Linknet...,211 - U-Net vs LinkNet for multiclass semantic...,2021-03-31T07:00:10Z,2021,third_batch
2,third_batch_vid_0003,1332000,Classification of mnist hand sign language alp...,212 - Classification of mnist sign language al...,2021-04-07T07:00:00Z,2021,third_batch
3,third_batch_vid_0004,1531000,Classification of mnist hand sign language alp...,213 - Ensemble of networks for improved accura...,2021-04-14T09:00:00Z,2021,third_batch
4,third_batch_vid_0005,1773000,Improving semantic segmentation (U-Net) perfor...,214 - Improving semantic segmentation (U-Net) ...,2021-04-21T07:00:04Z,2021,third_batch
5,third_batch_vid_0006,1474000,Code generated in the video can be downloaded ...,204 - U-Net for semantic segmentation of mitoc...,2021-02-25T08:00:08Z,2021,third_batch
6,third_batch_vid_0007,1666000,This video explains U-Net segmentation of imag...,205 - U-Net plus watershed for instance segmen...,2021-03-02T08:00:09Z,2021,third_batch
7,third_batch_vid_0008,3000000,"Can be applied to 3D volumes from FIB-SEM, CT,...",215 - 3D U-Net for semantic segmentation,2021-04-28T07:00:03Z,2021,third_batch
8,third_batch_vid_0009,86000,Just got bored of spending time at home so dec...,Hidden gems around the Bay Area - Santa Cruz -...,2021-02-21T23:13:36Z,2021,third_batch
9,third_batch_vid_0010,407000,Link to sign up for Mito: https://hubs.ly/H0H0...,Python Tips and Tricks - 1: Mito (trymito.io) ...,2021-02-22T20:04:04Z,2021,third_batch


In [None]:
# Load the dataset
file_path = "../Prove_Data/Preprocessed_Provenance_Datasets/fourth_batch_metadata_preprocessed.csv"
first_batch = pd.read_csv(file_path)

# Display first few rows
first_batch.head(10)

Unnamed: 0,video_id,approx_duration_(ms),video_description_(original),video_title_(original),video_publish_timestamp,publish_year,batch_source
0,fourth_batch_vid_0001,616000,Code generated in the video can be downloaded ...,276 - Grain segmentation using less than 10 li...,2022-06-29T07:00:12Z,2022,fourth_batch
1,fourth_batch_vid_0002,614000,Code generated in the video can be downloaded ...,277 - 3D object segmentation in python,2022-07-06T07:00:00Z,2022,fourth_batch
2,fourth_batch_vid_0003,758000,Code generated in the video can be downloaded ...,273 - What is Voronoi - explanation using pyth...,2022-06-08T07:00:13Z,2022,fourth_batch
3,fourth_batch_vid_0004,877000,Code generated in the video can be downloaded ...,274 - Object segmentation using voronoi and otsu,2022-06-15T07:00:30Z,2022,fourth_batch
4,fourth_batch_vid_0005,603000,Code generated in the video can be downloaded ...,282 - IHC color separation followed by nuclei ...,2022-08-10T07:00:06Z,2022,fourth_batch
5,fourth_batch_vid_0006,853000,Code generated in the video can be downloaded ...,278 - IHC color separation followed by nuclei ...,2022-07-13T07:00:06Z,2022,fourth_batch
6,fourth_batch_vid_0007,1244000,For details on installing StarDist library and...,279 - An introduction to object segmentation u...,2022-07-20T07:00:04Z,2022,fourth_batch
7,fourth_batch_vid_0008,1471000,Code generated in the video can be downloaded ...,280 - Custom object segmentation using StarDis...,2022-07-27T07:00:06Z,2022,fourth_batch
8,fourth_batch_vid_0009,819000,Code generated in the video can be downloaded ...,281 - Segmenting whole slide images (WSI) for ...,2022-08-03T07:00:15Z,2022,fourth_batch
9,fourth_batch_vid_0010,445000,Tips Tricks - 32 Code generated in the video c...,Automate periodic mouse movements using python,2022-04-22T07:00:16Z,2022,fourth_batch
