In [50]:
import requests
import xml.etree.ElementTree as ET
import dicttoxml
import json
import matplotlib.pyplot as plt

# Introduction to knowledge graphs

In this assignment, we set the stage for the group project of 2AMD20. Through some small exercises, we introduce you to our expectations of the project. There are four parts, which correspond to the four requirements you can choose from in the group assignment: 
* Knowledge graphs and modeling (mandatory)
* Data quality and cleaning
* Data exchange
* Visual analytics

**All four parts are required in this mini-assignment, such that you can get a feel for what each of them means.**

# 1. Knowledge graphs
The first requirement for the group assignment is that you use knowledge graphs. This means that you use data in a graph format, preferably RDF, and its corresponding query language. In this mini-assignment, we will use XML, so that you have more time to get used to RDF and SPARQL.

We have prepared a dataset for you, but you may also add download/query/import extra data for this assignment. The data file we have prepared comes from a news api, which gathers news from "over 80,000 large and small news sources and blogs". We have searched for articles about Football, in the languages of all the countries where football is most popular.

In [51]:
# We can open XML data from a file.

tree = ET.parse('football_id.xml')
data = tree.getroot()

# To help you identify the articles, here are the keywords we used.
used_keywords = {
    'England': 'football',
    'Netherlands': 'voetbal',
    'USA': 'soccer',
    'Brazil/Portugal': 'futebol',
    'Germany': 'fussball',
    'Italian': 'calcio',
    'Spain/Argentina/Uruguay': 'fútbol',
    'France': 'le foot',
    'european': 'uefa',
    'world': 'fifa'
}


In [52]:
# If you want, you can retrieve news data from news api, 
# after you get your own apikey by creating an account: https://newsapi.org/docs/endpoints/everything

# apikey = '' 
# keyword = 'fifa'

# r = requests.get('https://newsapi.org/v2/everything?q=' + keyword + '&sortBy=relevancy&page=1&apiKey=' + apikey)
# json_data = json.loads(r.content)['articles']

# xml_data = dicttoxml.dicttoxml(json_data, return_bytes=False)
# data = ET.fromstring(xml_data)



In [53]:
# The structure of the data is as follows: 

print(ET.tostring(data[0]).decode())

<article id="0" type="dict">
        <source type="dict">
        <id type="str">le-monde</id>
        <name type="str">Le Monde</name>
        </source>
        <author type="str">Laura Pottier</author>
        <title type="str">France-Canada\xa0: la gr&#232;ve pour l&#8217;&#233;quit&#233; dans le foot, le combat men&#233; par les Canadiennes</title>
        <description type="str">Championnes olympiques, les Canadiennes, qui affrontent les Bleues d&#8217;Herv&#233; Renard mardi, r&#233;clament des moyens &#233;gaux &#224; ceux de leurs homologues masculins.</description>
        <url type="str">https://www.lemonde.fr/sport/article/2023/04/11/france-canada-la-greve-pour-l-equite-dans-le-foot-le-combat-mene-par-les-canadiennes_6169093_3242.html</url>
        <urlToImage type="str">https://img.lemde.fr/2019/06/15/98/0/3500/2333/1440/960/60/0/7b3b61d_AI_SOCCER-WORLDCUP-CAN-NZL-_0615_1F.JPG</urlToImage>
        <publishedAt type="str">2023-04-11T15:00:01Z</publishedAt>
        <content t

In [54]:
# Articles have the following attributes: id, author, title, description, 
# url, urlToImage, publishedAt, content

# We can access details about an article as follows:

print(data[1].attrib['id'])
print(data[1].find('source').find('id').text)
print(data[1].find('source').find('name').text)
print(data[1].find('author').text)
print(data[1].find('title').text)
print(data[1].find('description').text)
print(data[1].find('url').text)
print(data[1].find('publishedAt').text)
print(data[1].find('content').text)

1
le-monde
Le Monde
Alexia Eychenne
L’aventure sud-africaine des «\xa0mamies foot\xa0»\xa0: «\xa0J’ai mal au dos. J’ai de l’arthrose. Au pire, on prend un cachet après le match\xa0»
Fin mars, treize Françaises, âgées de 52\xa0à 68\xa0ans, s’envolent pour le premier tournoi international des Soccer Grannies, en Afrique du Sud. Pour le plaisir du jeu, du collectif et le défi d’accepter son corps.
https://www.lemonde.fr/m-le-mag/article/2023/03/26/l-aventure-sud-africaine-des-mamies-foot-j-ai-mal-au-dos-j-ai-de-l-arthrose-au-pire-on-prend-un-cachet-apres-le-match_6166993_4500055.html
2023-03-26T02:00:03Z
Entraînement des «\xa0mamies foot\xa0» avec leur coach, Patricia Vittorelli, au gymnase de Saint-Just-Chaleyssin (Isère), le 9\xa0mars\xa02023. SÉBASTIEN ERÔME POUR «\xa0M LE MAGAZINE DU MONDE\xa0»\r\n«\xa0Sautille, Genevi… [+2622 chars]


# 2. Data quality and cleaning (4 points)

One of the three possible dimensions in the group assignment is improving the quality of your data. The dataset we provide is relatively clean, so another way to improve the quality is to extend it. 

In order to practice with data quality, write a script to add the country of origin of the news article. You can consider using the keywords in the `used_keywords`, the tld of the urls, or some other dimension of the data.

In [55]:
#deletes the articles that has Null values or empty string in some of its fields that we will use later for out data manipulation

def is_element_valid(elem):
    fields = [
        #elem.attrib.get('id'),
        #elem.find('author').text,
        elem.find('title').text,
        elem.find('description').text,
        elem.find('url').text,
        #elem.find('publishedAt').text,
        elem.find('content').text
    ]
    return all(field is not None and field.strip() != "" for field in fields)

print(len(data))

# Filtra gli elementi validi
data = [elem for elem in data if is_element_valid(elem)]


print(len(data))


1000
959


In [56]:
#We create a dictionary were every domain of the URLs is mapped to its country of origin

dict_suffix = {
    'fr':	'France',
    'it':	'Italy',
    'de':	'Germany',
    'nl':	'Netherlands',
    'br':	'Brazil',
    'pt':	'Portugal',
    'us':	'USA',
    'ar':	'Argentina',
    'uy':	'Uruguay',
    'es':	'Spain',
    'ch':   'Switzerland',
    'at':   'Austria',
    'cz':   'Czech Republic',
    'uk':   'Great Britain',
    'tr':   'Turkey',
    'be':   'Belgium',
    'eu':	'European Union',
    'org':	'World',
    'com' : 'World',
    'net' : 'World'
}

In [57]:
#In order to classify the article based on its origin we look at the URL of the article
#we extract the URL and we look at the domain of the URL
#we use the dict we created before in order to map every domain to a specific country
#some domains (for example .com, .org, .net, ...) are classified as 'world' articles

from urllib.parse import urlparse

# Il tuo URL
for i in range(len(data)):
    url = data[i].find('url').text

    parsed_url = urlparse(url)

    domain = parsed_url.netloc  # esempio: 'www.lemonde.fr'

    domain_suffix = domain.split('.')[-1]  # 'fr'
    
    data[i].attrib['Origin'] = dict_suffix[domain_suffix]

    #print(data[i].attrib['Origin'] )
    
        


# 3. Data exchange (4 points)

Another dimension of the group assignment is data exchange. In order to practice this, we want you to figure out which individual or football team is the subject of the data, and which country this club is located in. Given the time constraint for this assignment, your analysis does not have to be complete, but please make a good effort to extract some of this data.

You might want to find another data source to help you with this. Consider using wikidata, dbpedia or a more domain-specific resource.


In [58]:
#In order to classify an article based on the nation of the club the article is about we use a dataset that
#contains the most important football teams and the corresponding nationality
#the dataset is: "Soccer_Football Clubs Ranking.csv"

import pandas as pd

team_nation_full = pd.read_csv(r"C:\Users\Tommaso\Desktop\UNIVERSITA\ARTIFICIAL INTELLIGENCE\Knowledge Engineering\Soccer_Football Clubs Ranking.csv",sep = ',')  # sostituisci con il tuo percorso

team_nation_filtered = team_nation_full.iloc[:, [2, 1]]

keys = team_nation_filtered.iloc[:, 1]

#after extracting the name of each club and its nationality we create a dict
#key: name of the club
#value: nationality of the club

team_dict = dict(zip(keys, team_nation_filtered.iloc[:, 0]))

print(team_dict)


{'1. FC Köln': 'Germany', '1. FC Union Berlin': 'Germany', '12 de Octubre de Itaugua': 'Paraguay', '1º de Agosto': 'Angola', '1º de Maio': 'Angola', '2 de Mayo': 'Paraguay', '3 de Febrero': 'Paraguay', 'AaB': 'Denmark', 'Aalesund': 'Norway', 'Aarau': 'Switzerland', 'Aberdeen': 'Scotland', 'Aberystwyth': 'Wales', 'Abha Club': 'Saudi Arabia', 'Abia Warriors': 'Nigeria', 'ABM Galaxy FC': 'Vanuatu', 'Aboomoslem': 'Iran', 'AC Horsens': 'Denmark', 'AC Kuya Sport': 'Congo DR', 'AC LALA FC': 'Venezuela', 'AC Mamahira': 'Mali', 'AC Milan': 'Italy', 'AC Oulu': 'Finland', 'AC Rangers': 'Denmark', 'Academia Cantolao': 'Peru', 'Academia F. Amadou Diallo': 'Ivory Coast', 'Academia Puerto Cabello': 'Venezuela', 'Academia Quintana': 'Puerto Rico', 'Académica': 'Portugal', 'Académica do Soyo': 'Angola', 'Académica Lobito': 'Angola', 'Accra Great Olympics': 'Ghana', 'Accra Lions FC': 'Ghana', 'ACS FC Academica Clinceni': 'Romania', 'ACS Poli Timişoara': 'Romania', 'AD Grecia FC': 'Costa Rica', 'AD Guana

In [59]:
#WE MAKE A CLASSIFICATION OF THE ARTICLES BASED ON THE COUNTRY OF THE CLUB THE ARTICLE IS ABOUT
#we check if the name of a club appears first in the title, then in the description and then in the content of the article

nations_found = 0 
for i in range(len(data)):
    found = False  

    for field in ['title', 'description', 'content']:
        text = data[i].find(field).text
        if text:
            for word in text.split():
                if word in team_dict:
                    #print(team_dict[word])
                    #print(word)
                    data[i].attrib['Nation'] = team_dict[word]
                    nations_found += 1
                    found = True
                    break  
        if found:
            break  

    if not found:
        data[i].attrib['Nation'] = None

    #print(data[i].attrib["Nation"])


print('NUMBER OF ARTICLES: ')
print(len(data))    
print('NUMBER OF ARTICLES CLASSIFIED BY NATION OF THE CLUB: ')
print(nations_found)


NUMBER OF ARTICLES: 
959
NUMBER OF ARTICLES CLASSIFIED BY NATION OF THE CLUB: 
220


# 4. Data visualization (2 points)
We are wondering if the news sources discuss mostly their national football club, or if they also talk about other clubs.

Make an interactive plot that shows the relation between a news articles origin country and the topic that they talk about. It needs to include an informative tooltip.

You can select whichever visualization library you prefer, note that it is also possible to run JS' d3 in Jupyter notebooks.

In [60]:
import plotly.express as px


#plot the data as matrix:
#X-axis: origin of the article
#Y-axis: nationality of the club the article is about

#for the nationality, the articles classified as None (means we were not able to understand the nationality of the club the article is about) 
#are not considered
records = []
for el in data:
    origin = el.attrib.get('Origin', 'Unknown')
    nation = el.attrib.get('Nation', 'None')
    records.append({'Origin': origin, 'Nation': nation})

df = pd.DataFrame(records)

df['Nation'] = df['Nation'].fillna('None')

df_filtered = df[(df['Nation'].str.lower() != 'none')]

grouped = df_filtered.groupby(['Origin', 'Nation']).size().reset_index(name='Article Count')

fig = px.density_heatmap(
    grouped,
    x='Origin',
    y='Nation',
    z='Article Count',
    color_continuous_scale='Blues',
    text_auto=True,
    hover_name='Origin',
    hover_data={'Nation': True, 'Article Count': True},
    title='Relation Between Article Origin and Football Nation Mentioned (Filtered)'
)

fig.update_layout(
    xaxis_title='Article Origin',
    yaxis_title='Mentioned Football Nation',
    hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
    height=600,
)

fig.show()



#CONSIDERATION:
#for most of the countries, the articles are mainly about clubs of its own nation.

