## Project - Nobel Prize Winners and University Data

In this project, data from two different sources is combined to explore connections between Nobel Prize winners and universities. Nobel laureate information is imported from a Kaggle CSV file into a SQLite database. University data is retrieved from DBpedia using a SPARQL query which also stored in the same database.

To link the datasets, a SQL view is created based on matching university names from both tables. This makes it possible to analyze which universities are associated with Nobel laureates, along with additional information like the number of students and Wikipedia page IDs of the universities.

If this notebook does not display correctly, it is also available on my GitHub repository: https://github.com/Fab2102/dke_project.

<br>

## Installing and importing all dependencies
- SPARQL Wrapper
    - to query SPARQL endpoints like DBpedia

- sqlite3
    - to work with local SQL databases

- pandas
    - to handle and analyze CSV data

In [1]:
%pip install SPARQLWrapper
%pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
import sqlite3 
import pandas as pd

<br>

## Load SQL Extension and create/connect to SQLite Database

- Enabling SQL magic commands in jupyter

- Creating an SQL Lite database named "data.db"

- Creating a python connection object for programmatic access

In [3]:
%load_ext sql
%sql sqlite:///data.db

In [4]:
conn = sqlite3.connect('data.db')

<br><br><br>

# CSV data
The dataset `"Nobel Prize Winners: 1901 to 2023"` is loaded in with pandas and then turned into an SQL table. This makes it easier to join with RDF data from a SPARQL query later on. The dataset was downloaded from Kaggle and can be found [here](https://www.kaggle.com/datasets/sazidthe1/nobel-prize-data).

<br>

## Processing CSV

- Keep only the following 4 columns: `year`, `category`, `fullName`, `organizationName` 

- Transforming the pandas DataFrame into an SQL Table named "nobel"

- Printing out the pandas DataFrame

In [5]:
df_csv = pd.read_csv("nobel_laureates_data.csv", encoding='utf8').iloc[:, [0, 1, 5, 13]].copy()
df_csv.to_sql("nobel", conn, if_exists="replace", index=False)
df_csv

Unnamed: 0,year,category,fullName,organizationName
0,2023,medicine,Katalin Kariko,Szeged University
1,2023,economics,Claudia Goldin,Harvard University
2,2023,peace,Narges Mohammadi,
3,2023,literature,Jon Fosse,
4,2023,chemistry,Alexei Ekimov,Nanocrystals Technology Inc.
...,...,...,...,...
995,1901,peace,Frederic Passy,
996,1901,peace,Henry Dunant,
997,1901,medicine,Emil von Behring,Marburg University
998,1901,chemistry,Jacobus H. van 't Hoff,Berlin University


<br>

## Showing CSV data via SQL query

The following SQL query returns the first 10 rows from the `nobel` table.

In [6]:
%sql select * from nobel limit 10;

 * sqlite:///data.db
Done.


year,category,fullName,organizationName
2023,medicine,Katalin Kariko,Szeged University
2023,economics,Claudia Goldin,Harvard University
2023,peace,Narges Mohammadi,
2023,literature,Jon Fosse,
2023,chemistry,Alexei Ekimov,Nanocrystals Technology Inc.
2023,chemistry,Louis Brus,Columbia University
2023,chemistry,Moungi Bawendi,Massachusetts Institute of Technology (MIT)
2023,physics,Anne L Huillier,Lund University
2023,physics,Ferenc Krausz,Max Planck Institute of Quantum Optics
2023,physics,Pierre Agostini,The Ohio State University


<br><br><br>

# RDF data

In this section, RDF data is retrieved from the DBpedia SPARQL endpoint. The query fetches information about universities, including their English names, the number of students and their Wikipedia page IDs.

<br>

## Reading in DBpedia data via SPARQL

- Connects to the DBpedia SPARQL endpoint by using `SPARQLWrapper`

- Queries following information: `universityName`, `numStudents` and `wikiPageID`

- Returns results in a JSON format and prints out the first 10 entries

In [7]:
sparql = SPARQLWrapper("https://dbpedia.org/sparql")

sparql.setQuery("""
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo:  <http://dbpedia.org/ontology/>

SELECT DISTINCT
  ?universityName
  ?numStudents
  ?wikiPageID
WHERE {
  ?university rdf:type dbo:University .
  ?university rdfs:label ?universityName .
  ?university dbo:numberOfStudents ?numStudents .
  ?university dbo:wikiPageID ?wikiPageID .
  FILTER(lang(?universityName) = "en")
}
""")

sparql.setReturnFormat(JSON)
results = sparql.query().convert()["results"]["bindings"]

for res in results[:10]:
    name = res["universityName"]["value"]
    num_students = res["numStudents"]["value"]
    wiki_id = res["wikiPageID"]["value"]
    print(f"{name}, {num_students}, {wiki_id}")


Ca' Foscari University of Venice, 21000, 463087
Cabrillo College, 11033, 1615527
Cadet College Rawalpindi, 350, 42402295
Cadi Ayyad University, 102000, 37478842
Cairn University, 1558, 19575612
Cairo University, 231584, 609874
Cal Poly Pomona College of Business Administration, 4919, 19360211
Cal Poly Pomona College of Education and Integrative Studies, 1489, 19360252
Cal Poly Pomona College of Engineering, 5858, 4540702
Cal Poly Pomona College of Environmental Design, 1632, 19360061


<br>

## Transforming RDF data to CSV/SQL

- Uses a list comprehension to convert the SPARQL results into a pandas DataFrame

- Saves the DataFrame as a CSV file called `universites.csv`

- Transforms the CSV into an SQL Table called `universities`

- Printing out the pandas DataFrame


In [8]:
df_rdf = pd.DataFrame([{
    "UniversityName": res["universityName"]["value"],
    "NumStudents": res.get("numStudents", {}).get("value", ""),
    "WikiPageID": res.get("wikiPageID", {}).get("value", "")
} for res in results])

df_rdf.to_csv("universites.csv", sep=";", index=False, encoding="utf8")
pd.read_csv("universites.csv", delimiter=";").to_sql("universities", conn, if_exists="replace", index=False);

df_rdf


Unnamed: 0,UniversityName,NumStudents,WikiPageID
0,Ca' Foscari University of Venice,21000,463087
1,Cabrillo College,11033,1615527
2,Cadet College Rawalpindi,350,42402295
3,Cadi Ayyad University,102000,37478842
4,Cairn University,1558,19575612
...,...,...,...
9517,Talimuddin Inter College,6000,47065856
9518,TUM Department of Sport and Health Sciences,2228,66184710
9519,TUM School of Medicine,2180,66184374
9520,TU Austria,46000,48662412


<br>

## Showing RDF data via SQL query
Displays the first 10 rows from the `universities` table to preview the imported RDF data

In [9]:
%sql select * from universities limit 10;

 * sqlite:///data.db
Done.


UniversityName,NumStudents,WikiPageID
Ca' Foscari University of Venice,21000,463087
Cabrillo College,11033,1615527
Cadet College Rawalpindi,350,42402295
Cadi Ayyad University,102000,37478842
Cairn University,1558,19575612
Cairo University,231584,609874
Cal Poly Pomona College of Business Administration,4919,19360211
Cal Poly Pomona College of Education and Integrative Studies,1489,19360252
Cal Poly Pomona College of Engineering,5858,4540702
Cal Poly Pomona College of Environmental Design,1632,19360061


<br><br><br>

# Combining CSV + RDF

- Drops the view `nobel_universities` if it already exists for safety reasons

- Creates a new SQL view by joining Nobel Prize winners (CSV) with university data (RDF) on matching University Names

- Displays the first 10 rows from the joined view for a quick preview

In [10]:
%%sql

DROP VIEW IF EXISTS nobel_universities;

CREATE VIEW nobel_universities AS
SELECT a.year, a.category, a.fullName, b.UniversityName, b.NumStudents, b.WikiPageID
FROM nobel a
JOIN universities b ON a.organizationName = b.UniversityName;

SELECT * FROM nobel_universities LIMIT 10;


 * sqlite:///data.db
Done.
Done.
Done.


year,category,fullName,UniversityName,NumStudents,WikiPageID
2023,economics,Claudia Goldin,Harvard University,19218,18426501
2023,chemistry,Louis Brus,Columbia University,33413,6310
2023,physics,Anne L Huillier,Lund University,46000,17843
2022,economics,Douglas Diamond,University of Chicago,18452,32127
2022,chemistry,Morten Meldal,University of Copenhagen,37493,176767
2022,physics,Anton Zeilinger,University of Vienna,91715,53049
2021,economics,David Card,University of California,294662,31921
2021,chemistry,David MacMillan,Princeton University,8419,23922
2021,physics,Giorgio Parisi,Sapienza University of Rome,112564,1222318
2021,physics,Syukuro Manabe,Princeton University,8419,23922
