OData API

https://oda.ft.dk/Home/OdataQuery

https://www.odata.org/

https://www.ft.dk/-/media/sites/ft/pdf/dokumenter/aabne-data/oda-browser_brugervejledning.ashx

https://www.odata.org/documentation/odata-version-3-0/url-conventions/

https://www.odata.org/documentation/odata-version-3-0/odata-version-3-0-core-protocol/

Source root Link: "https://oda.ft.dk/api/"

ftp://oda.ft.dk/ODAXML/Referat/samling/


## Folketinget data

The data related to the meetings in Folketinget is collected following the guide on Folketinget's homepage:
https://www.ft.dk/-/media/sites/ft/pdf/dokumenter/aabne-data/oda-browser_brugervejledning.ashx.
Transcripts from all meetings going back to 2009 are available for download in xml format using a login described in the guide and some specific ftp software. Because it is interesting to compare topics from Twitter with topics in Folketinget, only the transcripts from the same period as the tweets are used for further analysis. 71% of the available tweets are posted in 2018 and after.

The years in Folketinget do not follow the calender year, as they start and end the first Tuesday of October. So, the meetings from 2018 are actually from October 2017 to October 2018. The data from 2017 are kept for the purpose of having slightly more data and because it is assumed that topics vary little enought for it to still be relevant.

In total 284 meeting stranscripts amounting to 256.6Mb of data has been downloaded and parsed for analysis.

It is not stated how the transcripts are made, however, they are proofread before publication. 


In [16]:
#pip install bs4 lxml

In [15]:
import requests
import urllib.request
import os
import io
import json

import numpy as np
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

from bs4 import BeautifulSoup as bs
from bs4.element import Comment
import lxml

import xml.etree.ElementTree as et 
from xml.dom import minidom

import re

In [17]:
path = os.getcwd()

In [18]:
path = path+'/samling/'

# Parsing XML

In [153]:
starttid_vec = []
sluttid_vec=[]
navn_vec = []
efternavn_vec = []
tekst_vec = []
parti_vec = []
rolle_vec = []

# Go through all the folders and files containing transcripts
for subdir, dir, files in os.walk(path):
    for file in files:
        filepath = subdir+'/' + file
        if filepath.endswith(".xml"):
            tree = et.parse(filepath)
            root = tree.getroot() 

            # Do this if the speaker has an associated party, i.e. not a minister or the chairman
            for speech in root.iter('Tale'):
                tags=[node.tag for node in speech.iter()]
                if('TekstGruppe' in tags):
                    if('GroupNameShort' in tags):
                        for node in speech.iter():
                            if(node.tag == 'StartDateTime'):
                                starttid_vec.append(node.text)
                            elif(node.tag == 'OratorFirstName'):
                                navn_vec.append(node.text)
                            elif(node.tag == 'OratorLastName'):
                                efternavn_vec.append(node.text)
                            elif(node.tag == 'OratorRole'):
                                rolle_vec.append(node.text)
                            elif(node.tag=='TekstGruppe'):
                                # This tag is preculiar, as there are multiple children ending with 'Char' as the youngest.
                                # Concate all text within 'TekstGruppe'.
                                tekstgruppe = ''
                                for k in node.iter():
                                    if k.tag=="Char":
                                        tekstgruppe+=' '+k.text
                                tekst_vec.append(tekstgruppe)
                            elif(node.tag=='GroupNameShort'):
                                parti_vec.append(node.text)

                            # If there are more text than speaker names, remove the excess.
                            # We will loose a few pieces of text but only around 1%
                            if len(tekst_vec)>len(navn_vec):
                                tekst_vec.pop()
                            if len(starttid_vec)>len(navn_vec):
                                starttid_vec.pop()
                                
                    # Do this if the speaker is either a minister or the chairman
                    else:
                        parti_vec.append('Ukendt')
                        for node in speech.iter():
                            if(node.tag == 'StartDateTime'):
                                starttid_vec.append(node.text)
                            elif(node.tag == 'OratorFirstName'):
                                navn_vec.append(node.text)
                            elif(node.tag == 'OratorLastName'):
                                efternavn_vec.append(node.text)
                            elif(node.tag == 'OratorRole'):
                                rolle_vec.append(node.text)
                            elif(node.tag=='TekstGruppe'):
                                # This tag is preculiar, as there are multiple children ending with 'Char' as the youngest.
                                # Concate all text within 'TekstGruppe'.
                                tekstgruppe = ''
                                for k in node.iter():
                                    if k.tag=="Char":
                                        tekstgruppe+=' '+k.text
                                tekst_vec.append(tekstgruppe)
                                
                            # If there are more text than speaker names, remove the excess.
                            # We will loose a few pieces of text but only around 1%
                            if len(tekst_vec)>len(navn_vec):
                                tekst_vec.pop()
                            if len(starttid_vec)>len(navn_vec):
                                starttid_vec.pop()

# Add everuthing to a dictionary and transform it into a Pandas dataframe
dictionary= {'StartDateTime':starttid_vec,'OratorFirstName':navn_vec,'OratorLastName':efternavn_vec,
                 'GroupNameShort':parti_vec,'OratorRole':rolle_vec ,'TekstGruppe':tekst_vec} # 'EndDateTime':sluttid_vec,
df = pd.DataFrame(dictionary)

print('In total, there are {} lines of text in the meetings.'.format(len(df)))
#df.to_csv(path[0:26]+'csv')

143075


In [127]:
df.head()

Unnamed: 0,StartDateTime,OratorFirstName,OratorLastName,GroupNameShort,OratorRole,TekstGruppe
0,2019-02-27T13:00:08,Pia,Kjærsgaard,DF,formand,Mødet er åbnet.
1,2019-02-27T13:01:34,Pia,Kjærsgaard,DF,formand,Det første spørgsmål er til justitsministeren...
2,2019-02-27T13:01:43,Pia,Kjærsgaard,DF,formand,Værsgo for at oplæse spørgsmålet.
3,2019-02-27T13:01:45,Christian,Langballe,DF,medlem,Spørgsmålet lyder: Vil regeringen stå fast på...
4,2019-02-27T13:02:05,Pia,Kjærsgaard,DF,formand,"Værsgo, ministeren."
5,2019-02-27T13:02:06,Søren Pape,Poulsen,,minister,Tak for det. Og tak for spørgsmålet. Fremmedk...
6,2019-02-27T13:03:39,Pia,Kjærsgaard,DF,formand,"Tak. Værsgo, hr. Christian Langballe."
7,2019-02-27T13:03:40,Christian,Langballe,DF,medlem,Det står tilbage at få besvaret nogle ting i ...
8,2019-02-27T13:04:53,Pia,Kjærsgaard,DF,formand,Tak. Og værsgo.
9,2019-02-27T13:04:55,Søren Pape,Poulsen,,minister,"Tak. Jeg skal prøve, så godt jeg kan – det va..."


In [128]:
### Cleaning the data ###

# The chariman (formand) does not add value to the context as he or she only moderates the debate
# by passing on the word from person to person.
# Also some meta data is contained regarding when the meeting ended (MødeSlut). This is also irrelevant.
df=df[df['OratorRole']!='formand']
df=df[df['OratorRole']!='MødeSlut']
df.reset_index(drop=True,inplace=True)

# Generating a column of the full name of the speakers
df['FullName'] = df['OratorFirstName']+' '+df['OratorLastName']

# Making start time into datetime64 format
df['StartDateTime'] = df['StartDateTime'].apply(lambda x: dateutil.parser.parse(x))

In [130]:
df.groupby(['FullName'])['TekstGruppe'].apply(lambda x : ' '.join(x))
#df = df.drop_duplicates()

FullName
Aaja Chemnitz Larsen      Mange tak. Inuit Ataqatigiit støtter også lov...
Aki-Matilda Høegh-Dam     Tak for det. Først har jeg en kort kommentar ...
Aleqa Hammond             Grønland er i forandring, rigsfællesskabet er...
Alex Ahrendtsen           Tak. Det åbne spørgsmål er jo, hvad Socialdem...
Alex Vanopslagh           (Talen er under udarbejdelse)  (Talen er unde...
                                               ...                        
Victoria Velasquez        Tak for det. Det her ændringsforslag handler ...
Villum Christensen        Tak for det. Lad mig starte med at sige, at L...
Yildiz Akdogan            Tak for det, formand. Enhedslisten ønsker med...
Zenia Stampe              Tak for det. Der er jo ingen tvivl om, at sko...
Øjvind Vilsholm           Ordføreren siger, at hovedbegrundelsen for at...
Name: TekstGruppe, Length: 263, dtype: object

In [131]:
df.head()

Unnamed: 0,StartDateTime,OratorFirstName,OratorLastName,GroupNameShort,OratorRole,TekstGruppe,FullName
0,2019-02-27T13:01:45,Christian,Langballe,DF,medlem,Spørgsmålet lyder: Vil regeringen stå fast på...,Christian Langballe
1,2019-02-27T13:02:06,Søren Pape,Poulsen,,minister,Tak for det. Og tak for spørgsmålet. Fremmedk...,Søren Pape Poulsen
2,2019-02-27T13:03:40,Christian,Langballe,DF,medlem,Det står tilbage at få besvaret nogle ting i ...,Christian Langballe
3,2019-02-27T13:04:55,Søren Pape,Poulsen,,minister,"Tak. Jeg skal prøve, så godt jeg kan – det va...",Søren Pape Poulsen
4,2019-02-27T13:05:49,Christian,Langballe,DF,medlem,"Altså, jeg synes jo, at de her burkabrude, so...",Christian Langballe


In [140]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68445 entries, 0 to 68444
Data columns (total 7 columns):
StartDateTime      68445 non-null datetime64[ns]
OratorFirstName    68215 non-null object
OratorLastName     68215 non-null object
GroupNameShort     55527 non-null object
OratorRole         68215 non-null object
TekstGruppe        68445 non-null object
FullName           68215 non-null object
dtypes: datetime64[ns](1), object(6)
memory usage: 3.7+ MB


## Topics

In [148]:
stopwords = open('stopord.txt').read().splitlines() #Link: https://gist.github.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b

In [149]:
# convert the text to a tf-idf weighted term-document matrix

# Hyperparameters
corpus = 2000000 #Der er ca. ? forskellige ord, efter stopord
l1_ratio = 0
max_iter = 50000
no_components = 30

min_df = 2
max_df = 0.6

data=df['TekstGruppe']

vectorizer = TfidfVectorizer(max_features=corpus, min_df=min_df, max_df=max_df, stop_words=stopwords)

X = vectorizer.fit_transform(data)
 
idx_to_word = np.array(vectorizer.get_feature_names())


# apply Non-negative Matrix Factorization to get topics using TF-IDF
nmf = NMF(n_components=no_components, solver="mu", l1_ratio = l1_ratio, max_iter = max_iter, random_state=1)

W = nmf.fit_transform(X)

H = nmf.components_
 
# print the topics
 
for i, topic in enumerate(H):
 
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: tusind, tale, enig, sagde, egentlig, tænke, spørge, høre, talen, udarbejdelse
Topic 2: sagt, netop, enig, gå, står, handler, hele, egentlig, går, måde
Topic 3: dansk, folk, verden, ønsker, udlændinge, flygtninge, danske, land, lande, danmark
Topic 4: fælles, parlamentet, regler, europæiske, samarbejde, domstolen, europa, tyrkiet, lande, eu
Topic 5: gruppe, ryge, hjælpe, hjælp, liv, folk, uddannelse, samfund, unge, mennesker
Topic 6: henvisning, lyder, stk, forretningsordens, punktet, dagsordenen, taget, spørgeren, spørgsmålet, udgået
Topic 7: bruge, koster, pengene, året, 100, penge, 000, mio, mia, kr
Topic 8: del, sikre, regler, ordet, mulighed, støtte, forslaget, støtter, lovforslag, lovforslaget
Topic 9: parti, står, støtter, politik, regering, spørge, stemme, venstres, radikale, venstre
Topic 10: udsatte, skole, barnets, familier, barnet, forældrene, barn, børnene, forældre, børn
Topic 11: olesen, politik, ole, birk, stemme, regering, støtter, alliances, alliance, liberal


In [None]:
# Write to a .csv file

#with open('topics_and_weights.csv', 'w', newline='', encoding='utf-16') as csvfile:
 #   topicwriter = csv.writer(csvfile, delimiter=',',
  #                          quotechar='"', quoting=csv.QUOTE_MINIMAL)
# Create the header
   # topicwriter.writerow(['Topic','Word','Weight'])

In [None]:
# print the topics for tableau
 
#for i, topic in enumerate(H):
 
 #   max_weights = topic.argsort()[-10:]
  #  words = [str(x) for x in idx_to_word[topic.argsort()[-10:]]]
     
   # for count, item in enumerate(max_weights):
        
    #    with open('topics_and_weights.csv', 'a', newline='', encoding='utf-16') as csvfile:
     #       topicwriter = csv.writer(csvfile, delimiter=',',
      #                      quotechar='"', quoting=csv.QUOTE_MINIMAL)
       #     topicwriter.writerow(["Topic"+str(i+1) ,words[count],H[i,item]])

In [None]:
#stopwords_edit = open('stopord.txt','a')
#stopwords_edit.write('\nenhedslisten')
#stopwords_edit.close()