# Mehrstufige Extraktion von Metadateninformationen aus sächsischen OER Inhalten

Sebastian Zug, André Dietrich 
TU Bergakademie Freiberg, Institut für Informatik

In [42]:
import pandas as pd
from pathlib import Path

## 1. Schritt: Bewertung der Metadaten von OER Inhalten aus dem LMS OPAL

In [43]:
data_folder =  "/mnt/9cd5c6a1-07f3-4580-be34-8d8dd9d6fe6d/Connected_Lecturers/Opal/raw/"
data_file_attribs = "OPAL_files_attrib.p"
data_file_meta = "OPAL_files_meta.p"

In [44]:
df_files = pd.read_pickle(Path(data_folder) / data_file_attribs)

In [45]:
def generateEmptinessStatistics(df):
    df_empty = df.apply(lambda x: x == '').sum(axis=0).rename("count").to_frame()
    df_empty['empty_in_percent'] = df_empty / len(df)
    return df_empty 

In [46]:
generateEmptinessStatistics(df_files)

Unnamed: 0,count,empty_in_percent
opal:filename,0,0.0
opal:oer_permalink,0,0.0
opal:license,0,0.0
opal:creator,13737,0.978419
opal:title,13695,0.975427
opal:comment,13848,0.986325
opal:language,14007,0.99765
opal:publicationMonth,13826,0.984758
opal:publicationYear,13826,0.984758
pipe:ID,0,0.0


Ok, wer sind die metainformationsspezifischen Vorbilder?

In [47]:
df_files[df_files["opal:creator"]!=""]['opal:creator'].value_counts().head(10)

opal:creator
Frank Babick                           91
Prof. Dr.-Ing. Johann Zitzelsberger    29
Dominik Kern                           20
Monique Meier                          10
Daniela Dobeleit                        7
Benno Wessely; Frank Babick             5
Valerie Uhlig                           5
Guido Philipp                           4
Nielsen Book Data                       4
Hartmut Simmert                         3
Name: count, dtype: int64

## Schritt 2: Traditionelle Aggregation der Metadaten

Welche Dateitypen sind denn  in den OPAL OER Daten überhaupt vertreten?

In [48]:
df_files["pipe:file_type"].value_counts().head(10)

pipe:file_type
pdf     6992
jpg     1237
mkv      869
mp4      602
png      563
zip      466
docx     441
html     430
pptx     224
xlsx     208
Name: count, dtype: int64

Im Rahmen der ersten Untersuchungsreihe fokussieren wir uns auf {`pdf`, `docx`, `pptx`, `xlsx`}. Ausgangspunkt war eine Stichprobenprüfung der Mediendateien, die keinerlei Metadaten enthielten.

In [49]:
relevant_file_type = ['pdf', 'docx', 'pptx', 'xlsx']
df_files[df_files["pipe:file_type"].isin(relevant_file_type)]["pipe:file_type"].shape[0] / df_files.shape[0]

0.5601851851851852

Schauen wir uns die Resultate an.

In [50]:
df_meta = pd.read_pickle(Path(data_folder) / data_file_meta)

In [51]:
df_meta.shape

(7865, 9)

In [52]:
generateEmptinessStatistics(df_meta)

Unnamed: 0,count,empty_in_percent
pipe:ID,0,0.0
pipe:file_type,0,0.0
file:author,2856,0.363128
file:keywords,6930,0.881119
file:subject,6564,0.834584
file:title,3503,0.445391
file:created,0,0.0
file:modified,0,0.0
file:language,654,0.083153


Für 40 Prozent der Autoren können wir zumindest einen Vorschlage zum Namen machen? Das klingt schon mal vielversprechend. Lassen Sie uns die Metadaten der verschiedenen Quellen vergleichen, um die Qualitäten zu prüfen.

In [53]:
# merge df_files and df_meta on "opal:file_id"
df_merge = pd.merge(df_meta, df_files, on="pipe:ID", how="left")

Schauen wir uns zunächst positiv Beispiel an, die aus beiden Quellen enthielten.

In [54]:
df_merge[(df_merge["opal:creator"]!="") &  (df_merge["file:author"]!="")][["opal:creator", "file:author"]].head(15)

Unnamed: 0,opal:creator,file:author
300,ESRI,ESRI
308,Prof. Dr. Nadine Bergner,Prof. Dr. Nadine Bergner
360,Jost-Hinrich Eschenburg,Jost-Hinrich Eschenburg
379,"Roeder, Klimova, Kuhn",Institute of Medical Informatics and Biometry ...
400,Frank Babick,Frank Babick u.a.
423,Frank Babick,Frank Babick
424,Frank Babick,Frank Babick
443,Frank Babick,Frank Babick
486,Sophia Peukert; Frank Irmler,ms733714
527,Oliver Löwe,Löwe Oliver


Und nun umgekehrt, helfen uns die aus den Dateien extrahierten Metadaten bei der Identifikation der Autoren? Das Bild ist durchwachsen ... offenbar brauchen wir hier noch einen Testmechanismus, der überprüft, ob sich hinter den Angaben eine Person _aus dem akademischen Umfeld_ verbirgt.

In [55]:
df_merge[(df_merge["opal:creator"]=="") &  (df_merge["file:author"]!="")][["opal:creator", "file:author"]].head(15)

Unnamed: 0,opal:creator,file:author
5,,Ralf Laue
6,,"Walter, Susanne (FIN A 2.3)"
10,,home
11,,home
12,,home
21,,Anja
22,,Anna
23,,P. Menzel;Hamza
24,,lschlenker
25,,"Rana M. Tamim, Robert M. Bernard, Eugene Borok..."


### Schritt 3: KI basierte Metadatenaggregation

In [56]:
data_file_aimeta = "OPAL_ai_meta.p"
df_aimeta = pd.read_pickle(Path(data_folder) / data_file_aimeta)

In [57]:
df_aimeta.shape

(439, 6)

In [58]:
df_aimeta.head(10)

Unnamed: 0,pipe:ID,pipe:file_type,ai:author_raw,ai:author,ai:title,ai:keywords
0,8I6sM5zapD60,pdf,"Stephan Gerhold, Marcel Beyer","Stephan Gerhold, Marcel Beyer","""Übung 3 Photogrammetrie""","Fensterabstand, Photogrammetrie, Passpunkte, E..."
0,8ZICOHBmAHyQ,pdf,I don't know. The context provided does not me...,,Grundlagen der Tragwerke,"I'm happy to help! However, I need to clarify ..."
0,3ztCv-WpxJ4U,pdf,Norbert Engemaier,Norbert Engemaier,Referat,"Thesis, Essay, Referat, Prüfung, Zitierweise"
0,6mOhjfscZK2A,pdf,I don't know. The provided context does not me...,,"Vorlesung Technische Mechanik I - AGBF, TU Dre...","I'm happy to help! However, I need to clarify ..."
0,1eteONeHL82Y,pdf,I don't know. The provided context does not me...,,"""Modellelemente auf 'beste Größe' bringen""","Konnektoren, Ereignisse, Syntaxregeln, Prozess..."
0,1mjbqKfwSW7U,pdf,TU Dresden,TU Dresden,"""E-Learning an der TU Dresden""","I can help you with that! However, I need to c..."
0,1BruMQFjEIRY,pdf,"The author of the document ""1BruMQFjEIRY.pdf"" ...","The author of the document ""1BruMQFjEIRY.pdf"" ...",Erste Schritte in OPAL,"OPAL, Navigation, Startseite, Kursangebote, Fa..."
0,1hZf0CFwsBhA,pdf,I don't know. The context does not provide any...,,"""Tafel 3: Wiederholung""","I'd be happy to help! However, I need to clari..."
0,1PrtXFruGers,pdf,I don't know. The context only provides inform...,,"""Interfaces geben keine Struktur vor, sondern ...","Interfaces, Klassen, Generische Schnittstellen..."
0,1LvS8_gxjyYc,pdf,I don't know. The context provided does not me...,,"""Zusammenfassung: IDisposable, IComparable und...","Ich kann leider keine Datei mit dem Namen ""1Lv..."


In [59]:
df_merge_all = pd.merge(df_aimeta, df_merge, on="pipe:ID", how="left")
df_merge_all

Unnamed: 0,pipe:ID,pipe:file_type,ai:author_raw,ai:author,ai:title,ai:keywords,pipe:file_type_x,file:author,file:keywords,file:subject,...,opal:filename,opal:oer_permalink,opal:license,opal:creator,opal:title,opal:comment,opal:language,opal:publicationMonth,opal:publicationYear,pipe:file_type_y
0,8I6sM5zapD60,pdf,"Stephan Gerhold, Marcel Beyer","Stephan Gerhold, Marcel Beyer","""Übung 3 Photogrammetrie""","Fensterabstand, Photogrammetrie, Passpunkte, E...",pdf,,,,...,beleg.pdf,https://bildungsportal.sachsen.de/opal/oer/8I6...,CC BY-SA 4.0 Int.,,,,,,,pdf
1,8ZICOHBmAHyQ,pdf,I don't know. The context provided does not me...,,Grundlagen der Tragwerke,"I'm happy to help! However, I need to clarify ...",pdf,,,,...,Bereich III.pdf,https://bildungsportal.sachsen.de/opal/oer/8ZI...,CC BY-NC-ND 4.0 Int.,,,,,,,pdf
2,3ztCv-WpxJ4U,pdf,Norbert Engemaier,Norbert Engemaier,Referat,"Thesis, Essay, Referat, Prüfung, Zitierweise",pdf,,,,...,000 Prüfungsleistungen Engemaier.pdf,https://bildungsportal.sachsen.de/opal/oer/3zt...,CC BY-SA 4.0 Int.,,,,,,,pdf
3,6mOhjfscZK2A,pdf,I don't know. The provided context does not me...,,"Vorlesung Technische Mechanik I - AGBF, TU Dre...","I'm happy to help! However, I need to clarify ...",pdf,,,,...,Bereich I.pdf,https://bildungsportal.sachsen.de/opal/oer/6mO...,CC BY-NC-ND 4.0 Int.,,,,,,,pdf
4,1eteONeHL82Y,pdf,I don't know. The provided context does not me...,,"""Modellelemente auf 'beste Größe' bringen""","Konnektoren, Ereignisse, Syntaxregeln, Prozess...",pdf,Ralf Laue,,,...,GPM-2018-01-Einfuehrung-EPK.pdf,https://bildungsportal.sachsen.de/opal/oer/1et...,CC BY-NC 4.0 Int.,,,,,,,pdf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
434,4Sq2K1faySBw,pdf,Christoph Scheffel,Christoph Scheffel,ES Experimentalpsychologisches Arbeiten 6. Abs...,"I can help you with that! However, I need to c...",pdf,Diana Vogel,,,...,ES_ExPra2_06_Abschluss.pdf,https://bildungsportal.sachsen.de/opal/oer/4Sq...,CC BY 4.0 Int.,,,,,,,pdf
435,6m4UxzhXYoDg,pdf,Dr.-Ing. Martin Kache,Dr.-Ing. Martin Kache,Fahrdynamik für Verkehrsingenieure,"I can help you with that! However, I need to c...",pdf,"Dr.-Ing. Martin Kache, Professur für Technik s...",Fahrdynamik,"Fahrdynamik, Luftwiderstand, Fahrzeugwiderstan...",...,Folien_Fahrdynamik_VIW_2019_VL_03.pdf,https://bildungsportal.sachsen.de/opal/oer/6m4...,CC BY-NC-ND 4.0 Int.,,,,,,,pdf
436,1pic31NO7y20,pdf,I don't know. The context provided does not me...,,"Keine Antwort, da keine Datei mit diesem Titel...","Lesen, Schreiben, Suchen, Stream-Klasse, C#",pdf,home,,,...,17.pdf,https://bildungsportal.sachsen.de/opal/oer/1pi...,CC BY 4.0 Int.,,,,,,,pdf
437,1tRioCE7tA_U,pdf,Lutz Hellmig,Lutz Hellmig,Grundlagen zur Didaktik der Informatik WS 2018...,"Informatik, Didaktik, Berufswissenschaft, Bild...",pdf,Dr. Lutz Hellmig,,,...,01 Einführung in die Didaktik der Informatik.pdf,https://bildungsportal.sachsen.de/opal/oer/1tR...,CC BY-NC-SA 4.0 Int.,,,,,,,pdf


In [60]:
df_merge_all[ (df_merge_all["opal:creator"]!="") &
             ((df_merge_all["file:author"]!="") | (df_merge_all["ai:author"]!=""))][["opal:creator", "file:author", "ai:author"]].tail(20)

Unnamed: 0,opal:creator,file:author,ai:author
245,ESRI,ESRI,
252,Prof. Dr. Nadine Bergner,Prof. Dr. Nadine Bergner,Prof. Dr. Nadine Bergner
296,Jost-Hinrich Eschenburg,Jost-Hinrich Eschenburg,J.-H. Eschenburg
311,"Roeder, Klimova, Kuhn",Institute of Medical Informatics and Biometry ...,
331,Frank Babick,Frank Babick u.a.,
353,Frank Babick,Frank Babick,Frank Babick
354,Frank Babick,Frank Babick,Frank Babick
371,Frank Babick,Frank Babick,Frank Babick
403,Sophia Peukert; Frank Irmler,ms733714,MANDY SCHÜTZ


In [61]:
df_merge_all[ (df_merge_all["opal:creator"]=="") &
             ((df_merge_all["file:author"]!="") | (df_merge_all["ai:author"]!=""))][["opal:creator", "file:author", "ai:author"]].tail(20)

Unnamed: 0,opal:creator,file:author,ai:author
417,,home,
418,,home,
419,,,Hamann/Meinhold
420,,banusch,"W. Domschke, u.a."
421,,MWill,The author of the document is not specified in...
422,,MWill,
423,,MWill,Bernd Delakowitz
424,,Andreas Sommer,Prof. Dr. Bernd Delakowitz / Dipl. Ing. Eric S...
425,,,M. Hamann
426,,alexander,Alexander Eychmüller
