## Data pre-processing

This notebook covers the necessary steps to divide data into paragraphs and prepare them for the labeling. 
The steps that were taken to get to "full_data" from the initial datasets provided by [Marcinkiewicz and Jankowski](https://dataverse.harvard.edu/dataset.xhtml;jsessionid=205fd66a0b719b7a6c0613647c55?persistentId=doi%3A10.7910%2FDVN%2FF2PLOZ&version=&q=&fileTypeGroupFacet=%22Archive%22&fileAccess=Public), were done at the level of Excel. It included filtering rows with speeches less than 12 words and excluding all people who were not legislators nor members of the government at the time of their speech. There were also little steps necessary such as: adding a new column with only name without a title (legilator, PM etc.), assuring the names in the legislator as well as in the full_data set were the same (this also includes keeping one surname for people who did change them over time), removing the indication if the speech was delivered, dropping unnecessary columns.

In [7]:
# imports packages

import pandas as pd
import numpy as np
import os
import re

In [8]:
# sets wd
# os.chdir('C:/...')

In [9]:
# loades the data

speeches = pd.read_csv("data/full_data.csv")
legislators = pd.read_csv("data/legislators.csv")

In [4]:
joined = pd.merge(speeches, legislators,  how='left', on=['name','sejm'])

In [5]:
joined.head()

Unnamed: 0,legislator,name,debate,minister,gov_member,president_office,topic,link,speech,session,date,sejm,party_club_start,party_club_end,change
0,Aleksander Kwaśniewski,Aleksander Kwaśniewski,1.0,0,0,0,Wybór marszałka Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Panie Marszałku Seniorze! Szanowne Panie! Szan...,1.0,14/10/1993,2,SLD,SLD,0.0
1,Izabela Jaruga-Nowacka,Izabela Jaruga-Nowacka,,0,0,0,poza punktami porządku,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Władysław Adamski: Ślubuję. Romuald Ajchler: Ś...,1.0,14/10/1993,2,UP,UP,0.0
2,Jan Rokita,Jan Rokita,3.0,0,0,0,Wybór wicemarszałków Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Panie Marszałku! Panie i Panowie Posłowie! Mam...,1.0,14/10/1993,2,UD,KKL,1.0
3,Jerzy Jaskiernia,Jerzy Jaskiernia,2.0,0,0,0,Pierwsze czytanie projektów uchwał w sprawie u...,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Panie Marszałku! Wysoka Izbo! Art. 4 ust. 2 re...,1.0,14/10/1993,2,SLD,SLD,0.0
4,Krzysztof Kamiński,Krzysztof Kamiński,2.0,0,0,0,Pierwsze czytanie projektów uchwał w sprawie u...,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Panie Marszałku! Pani Premier! Wysoka Izbo! Is...,1.0,14/10/1993,2,KPN,NZ,2.0


### 1.1 Spliting speeches into paragraphs

In [8]:
# drops all remaining non-politians that are not the memebr of the gov
joined = joined[~((joined['gov_member'] == 0) & joined['party_club_end'].isnull())] 

# creates the index
joined['index'] = range(1, len(joined) + 1)

In [10]:
# this is to remove all interupted text from other politicians that is usualy mentioned in the brackets
# this is to make sure the speech contains only the text of a given legislator and noone else
pattern = r'\([^)]*\)|\[[^]]*\]'
joined['speech'] = joined['speech'].str.replace(pattern," ")

  joined['speech'] = joined['speech'].str.replace(pattern," ")


In [11]:
# This function splits the text into paragaphs of approximately 100 words. It tries to split the data first by looking for the end of sentence
# punctuation mark. If it does not find it, then it devides at max_words. Similar procedue is to merge the short paragraphs up to 100.
# This function is not ideal, only later in the process I learnt there is better and faster way to do it in spaCy, which I would use if I was to re do the whole process again

def split_speech_into_paragraphs(df, speech_column, max_words=100):
    new_rows = []
    for _, row in df.iterrows():
        speech = row[speech_column]
        paragraphs = speech.split('\n\n')
        paragraph_number = 0
        for i, paragraph in enumerate(paragraphs):
            if len(paragraph.split()) > max_words:
                chunk_words = paragraph.split()
                current_chunk = ''
                while chunk_words:
                    current_chunk += ' ' + chunk_words.pop(0)
                    if len(chunk_words) == 0 or len(current_chunk.split()) > max_words:
                        last_sentence_end = max([current_chunk.rfind('.'), current_chunk.rfind('!'), current_chunk.rfind('?')])
                        if last_sentence_end == -1:
                            last_sentence_end = max_words
                        else:
                            last_sentence_end += 1
                        paragraph_number += 1
                        new_row = row.copy()
                        new_row[speech_column] = current_chunk[:last_sentence_end].strip()
                        new_row['paragraph_number'] = paragraph_number
                        new_rows.append(new_row)
                        current_chunk = current_chunk[last_sentence_end:].strip()
            else:
                if len(new_rows) > 0 and len(new_rows[-1][speech_column].split()) + len(paragraph.split()) <= max_words:
                    merged_paragraph = new_rows[-1][speech_column] + ' ' + paragraph.strip()
                    last_sentence_end = max([merged_paragraph.rfind('.'), merged_paragraph.rfind('!'), merged_paragraph.rfind('?')])
                    if last_sentence_end == -1:
                        last_sentence_end = max_words
                    else:
                        last_sentence_end += 1
                    new_rows[-1][speech_column] = merged_paragraph[:last_sentence_end].strip()
                    new_rows[-1]['paragraph_number'] = paragraph_number
                else:
                    paragraph_number += 1
                    new_row = row.copy()
                    new_row[speech_column] = paragraph.strip()
                    new_row['paragraph_number'] = paragraph_number
                    new_rows.append(new_row)
        if len(current_chunk) > 0:
            paragraph_number += 1
            new_row = row.copy()
            new_row[speech_column] = current_chunk.strip()
            new_row['paragraph_number'] = paragraph_number
            new_rows.append(new_row)
            
    # some paragaphs got an index of zero, thus this corrects for that:
    if new_rows[-1]['paragraph_number'] == 0:
        max_paragraph_number = max(new_df['paragraph_number'])
        new_rows[-1]['paragraph_number'] = max_paragraph_number + 1
            
    return pd.DataFrame(new_rows)

In [12]:
divided=split_speech_into_paragraphs(joined,"speech")
divided

Unnamed: 0,legislator,name,debate,minister,gov_member,president_office,topic,link,speech,session,date,sejm,party_club_start,party_club_end,change,index,paragraph_number
0,Aleksander Kwaśniewski,Aleksander Kwaśniewski,1.0,0,0,0,Wybór marszałka Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Panie Marszałku Seniorze! Szanowne Panie! Szan...,1.0,14/10/1993,2,SLD,SLD,0.0,1,1
0,Aleksander Kwaśniewski,Aleksander Kwaśniewski,1.0,0,0,0,Wybór marszałka Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,"Tak uważaliśmy dwa lata temu, kiedy kształtowa...",1.0,14/10/1993,2,SLD,SLD,0.0,1,2
0,Aleksander Kwaśniewski,Aleksander Kwaśniewski,1.0,0,0,0,Wybór marszałka Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,"Był działaczem Zrzeszenia Studentów Polskich, ...",1.0,14/10/1993,2,SLD,SLD,0.0,1,3
0,Aleksander Kwaśniewski,Aleksander Kwaśniewski,1.0,0,0,0,Wybór marszałka Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Pełnił funkcję wiceprzewodniczącego Klubu Parl...,1.0,14/10/1993,2,SLD,SLD,0.0,1,4
0,Aleksander Kwaśniewski,Aleksander Kwaśniewski,1.0,0,0,0,Wybór marszałka Sejmu.,http://orka2.sejm.gov.pl/Debata2.nsf/5c30b337b...,Proszę mi jednak pozwolić powiedzieć nieco wię...,1.0,14/10/1993,2,SLD,SLD,0.0,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254450,Wojciech Król,Wojciech Król,49.0,0,0,0,Sprawozdanie Komisji Kultury i Środków Przekaz...,http://www.sejm.gov.pl/Sejm8.nsf/wypowiedz.xsp...,Czy nie będzie nam wstyd za treść uzasadnienia...,86.0,16/10/2019,8,PO-KO,PO-KO,0.0,254203,3
254450,Wojciech Król,Wojciech Król,49.0,0,0,0,Sprawozdanie Komisji Kultury i Środków Przekaz...,http://www.sejm.gov.pl/Sejm8.nsf/wypowiedz.xsp...,Ale tego szacunku nie da się zapisać w uchwale...,86.0,16/10/2019,8,PO-KO,PO-KO,0.0,254203,0
254452,Sekretarz Stanu w Ministerstwie Obrony Narodow...,Wojciech Skurkiewicz,41.0,0,1,0,Pytania w sprawach bieżących.,http://www.sejm.gov.pl/Sejm8.nsf/wypowiedz.xsp...,Pani Marszałek! Wysoka Izbo! Panie Pośle! W pi...,86.0,16/10/2019,8,PIS,PIS,0.0,254205,1
254452,Sekretarz Stanu w Ministerstwie Obrony Narodow...,Wojciech Skurkiewicz,41.0,0,1,0,Pytania w sprawach bieżących.,http://www.sejm.gov.pl/Sejm8.nsf/wypowiedz.xsp...,"To metoda oszustwa, w której przestępca podszy...",86.0,16/10/2019,8,PIS,PIS,0.0,254205,2


In [3]:
#divided.to_csv('data/divided.csv', encoding='utf-8', index=False)
divided = pd.read_csv("data/divided.csv")

### 1.2 Selecting texts for the labeling

In [13]:
first_250=divided[np.isin(divided, ['Kukiz','Konfederacja', 'PIS']).any(axis=1)]
first_250_df= first_250.sample(n=250)

In [14]:
slowa = ['władza','Polki', 'Polacy', 'Polska', 'społeczeńswo', 'kraj', 'politycy', 'rząd', 'rządzący', 'zdjacy', 'zdrada', 'elity']
second_250=divided[divided.speech.str.contains('|'.join(slowa))]
second_250_df= second_250.sample(n=250)

In [15]:
slowa_2=['suweren','obywatele', 'elity']
third_250=divided[divided['speech'].str.contains('|'.join(r"\b{}\b".format(re.escape(word)) for word in slowa_2))]
third_250_df= third_250.sample(n=250)

In [22]:
slowa_3=['elektorat']
first_50=divided[divided['speech'].str.contains('|'.join(r"\b{}\b".format(re.escape(word)) for word in slowa_3))]
first_50_df=first_50.sample(n=50)

In [26]:
last_200_df=divided.sample(n=200)

In [None]:
to_label=pd.concat([first_250_df, second_250_df,third_250_df,first_50_df,last_200_df], ignore_index=True)
to_label.drop_duplicates()
#extracted_joined.to_csv('data/to_label.csv', encoding='utf-8', index=False)

In [15]:
slowa_4 = ['suweren', 'obywatele', 'Polacy']
slowa_5 = [ 'elity', 'rzad', 'politycy']

additional_190 = divided['speech'].str.contains('|'.join(r"\b{}\b".format(re.escape(word)) for word in slowa_4)) & divided['speech'].str.contains('|'.join(r"\b{}\b".format(re.escape(word)) for word in slowa_5))
additional_190_df = divided[additional_190].sample(n=190)


In [None]:
slowa_6 = ['suweren', 'obywatele', 'Polacy']
slowa_7 = [ 'opozycja', 'zdrajcy', 'elity']

additional_120= divided['speech'].str.contains('|'.join(r"\b{}\b".format(re.escape(word)) for word in slowa_6)) & divided['speech'].str.contains('|'.join(r"\b{}\b".format(re.escape(word)) for word in slowa_7))
additional_120_df = divided[additional_120].sample(n=120)


In [None]:
to_label_additional=pd.concat([additional_190_df, additional_120_df], ignore_index=True)
to_label_additional.drop_duplicates()
#additional.to_csv('data/to_label_additional.csv', encoding='utf-8', index=False)

In [49]:
labeled = pd.read_csv("labeled.csv")
indices_to_delete = labeled.index[labeled['populism'] == 0].tolist()

indices_to_delete = np.random.choice(indices_to_delete, size=226, replace=False)
labeled_new = labeled.drop(indices_to_delete)
#labeled_new.to_csv('data/labeled_new.csv', encoding='utf-8', index=False)