# Extracting text from Germany party programs

How often do certain keywords come up in different party programmes? The goal here is a dataframe summarizing the different party programms.

As a first step, we are importing the required packages reading in the data, cleaning it from stopwords, and plotting some relationships.

- ``Tika`` is used to extract the content of each party's programme from the stored programme files in pdf format
- ``Glob`` is used to detect each file that is stored in the party programme folder
- ``Nltk`` can be used for various NLP tasks, here: Removing (German) stopwords as well as splitting up words
- ``Matplotlib`` is used to create plots visualizing the similarity of Tweets and the party programmes' content
- ``pandas`` is used for several dataframe transformations
- ``collections`` is imported for the Counter packages to display how often each word appears in the programmes

In [1]:
# Packages
!pip install tika
from tika import parser 
import glob
import nltk
from nltk import sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt 
import pandas as pd
from collections import Counter



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Printing names of considered party programmes

The glob library is used to print all files that are located in the `party_programmes`-folder inside of the Github repository. Beforehand, we downloaded each programme for all parties currently represented in the German parliament. 

In [2]:
print(glob.glob('party_programmes/*'))

['party_programmes\\eyjg.pdf', 'party_programmes\\FDP_BTW2021_Wahlprogramm_1.pdf', 'party_programmes\\gruene.txt', 'party_programmes\\programm_der_partei_die_linke_erfurt2011_druckfassung2020.pdf', 'party_programmes\\Wahlprogramm-DIE-GRUENEN-Bundestagswahl-2021_barrierefrei.pdf']


### Storing the party programmes 

Here, all programs are stored in a new object. In the second step, the `parser.from.file`-function is used in order to create separate objects for each party's program. 

In [None]:
files = glob.glob('party_programmes/*')


spd_raw = parser.from_file('party_programmes/SPD-Zukunftsprogramm.pdf')
cdu_raw = parser.from_file('party_programmes/eyjg.pdf')
afd_raw = parser.from_file('party_programmes/20210611_AfD_Programm_2021.pdf')
fdp_raw = parser.from_file('party_programmes/FDP_BTW2021_Wahlprogramm_1.pdf')
linke_raw = parser.from_file('party_programmes/DIE_LINKE_Wahlprogramm_zur_Bundestagswahl_2021.pdf')
gruene_raw = parser.from_file('party_programmes/Wahlprogramm-DIE-GRUENEN-Bundestagswahl-2021_barrierefrei.pdf')

### First transformations

In the next chunk, the programs are cleaned for further analysis: Linebreaks are removed, the `word_tokenize`-function is used to splitting up the text into words and German stopwords are removed. These steps are necessary for being able to analyze the programmes' content in a meaningful way, since stopwords do not add meaningful information to the programmes. Thus, removing them enables us to focus on the substantially meaningful information. Here, we are doing this exemplarily for the SPD programme. 

In [7]:

#remove linebreaks
nonnewline = spd_raw['content'].strip()

#separate the text into individual words
text_tokens = word_tokenize(nonnewline)

#delete all stopwords
tokens_without_sw = [word for word in text_tokens if not word in stopwords.words('german')]

Here, we are using the Count library to identify how many single words could be extracted form the programmes, excluding stopwords. 

In [8]:
#check out how many words we could inspect
counted_words = Counter(tokens_without_sw)

Now, we are able to manually specify words for which we want to display the frequency within the respective programmes. Here, we exemplarily the frequency of the word "Klimawandel" (climate change) in the SPD programme and identify that it appears three times.

In [9]:
#for each word we can access the number of words we got
counted_words["Klimawandels"]

3

In [10]:
file1=open(r"party_programmes/gruene.txt","a")
file1.writelines(gruene_raw['content'].strip())