>As the quote goes garbage in garbage out so we'll do some cleaning to our data. In accordance to the purpose of this project which is to explore President Duterte's word cloud. I'll explore the data.

>Let's load the file

In [1]:
import numpy as np
import pandas as pd
import pickle as pkl
import os

with open('Duterte_Dialogue.pkl', 'rb') as fl:
    df = pkl.load(fl)

# 1. Cleaning
## 1.1 Dates Formatting
>Since I'm gonna be grouping this speches by dates in later section for exploration I'm gonna check if all dates are formatted correctly.

In [62]:
grouped_date = df.groupby(df.date.map(lambda x: x[:-2] if x.find('_') != -1 else x))

In [63]:
grouped_date.groups

{' , ': Int64Index([478, 501], dtype='int64'),
 'April 02, 2017': Int64Index([573], dtype='int64'),
 'April 02, 2018': Int64Index([704], dtype='int64'),
 'April 03, 2017': Int64Index([572], dtype='int64'),
 'April 03, 2018': Int64Index([703], dtype='int64'),
 'April 04, 2017': Int64Index([570, 571], dtype='int64'),
 'April 05, 2017': Int64Index([568, 569], dtype='int64'),
 'April 05, 2018': Int64Index([701, 702], dtype='int64'),
 'April 06, 2017': Int64Index([566, 567], dtype='int64'),
 'April 09, 2017': Int64Index([565], dtype='int64'),
 'April 09, 2018': Int64Index([700], dtype='int64'),
 'April 1, 2020': Int64Index([14], dtype='int64'),
 'April 10, 2017': Int64Index([564], dtype='int64'),
 'April 10, 2018': Int64Index([308, 698, 699], dtype='int64'),
 'April 11, 2019': Int64Index([138], dtype='int64'),
 'April 12, 2017': Int64Index([562, 563], dtype='int64'),
 'April 12, 2018': Int64Index([307], dtype='int64'),
 'April 13, 2018': Int64Index([306], dtype='int64'),
 'April 13, 2019': 

>So there are data points who has missing dates specifically the group ' , ' and 'ly 19, Ju 2016'(as seen in the list above)

In [64]:
grouped_date.get_group(' , ')

Unnamed: 0,date,title,url,event,location,transcript
478,",",Phone conversation between President Duterte a...,http://pcoo.gov.ph//dec-02-2016-phone-conversa...,,02 December 2016,PRESIDENT DUTERTE: The President-elect Trump w...
501,", _2",Speech of Executive Secretary Salvador C. Medi...,http://pcoo.gov.ph//february-24-2017-speech-of...,,24 February 2017,


>Aside from the fact that the dates are stored in location feature, the record 501 has no transcript. I visited the url and it has no content so I'm gonna have to remove it from the list.

In [70]:
df.loc[478, 'date'] = 'December 02, 2016_2'
df.loc[478, 'location'] = np.nan
df.drop(index=501, inplace=True)

>There already exis a speech in 'December 02, 2016' so I had to add a suffix.

>Now let's try to correct the record who has 'ly 19, Ju, 2016' as a date.

In [95]:
grouped_date = df.groupby(df.date.map(lambda x: x[:-2] if x.find('_') != -1 else x))
grouped_date.get_group('ly 19, Ju, 2016')

Unnamed: 0,date,title,url,event,location,transcript
330,"ly 19, Ju, 2016",President Rodrigo Duterte’s Statement on Forei...,http://pcoo.gov.ph//july-19-2016-president-rod...,,President Duterte: Good afternoon.,I would like the Philippines to know that I pe...


>The date is not formatted properly and the location looks like part of his speech. I visited the link and the correct location is in Malacanan Palace, so I'm gonna correct this and concatenate the current content of the location to transcript

In [111]:
before = grouped_date.get_group('ly 19, Ju, 2016' ).location.values[0] 
missing = ' I would like to arrest if you…You know what is going around that Secretary Yasay of the Department of Foreign Affairs is on his way out. I would to assure the Secretary that he is in good company and there is no truth to the rumor that there is a plan for his ouster, far from it actually. '
after = grouped_date.get_group('ly 19, Ju, 2016' ).transcript.values[0]
trans = before + missing + after

df.loc[330, 'transcript'] = trans
df.loc[330, 'date'] = 'July 19, 2016'
df.loc[330, 'location'] = np.nan

In [116]:
df.loc[330]

date                                              July 19, 2016
title         President Rodrigo Duterte’s Statement on Forei...
url           http://pcoo.gov.ph//july-19-2016-president-rod...
event                                                       NaN
location                                                    NaN
transcript    President Duterte: Good afternoon. I would lik...
Name: 330, dtype: object

In [118]:
df.date.tolist()

['May 26, 2020',
 'May 25, 2020',
 'May 22, 2020',
 'May 19, 2020',
 'May 12, 2020',
 'May 4, 2020',
 'April 27, 2020',
 'April 24, 2020',
 'April 16, 2020',
 'April 14, 2020',
 'April 13, 2020',
 'April 8, 2020',
 'April 6, 2020',
 'April 3, 2020',
 'April 1, 2020',
 'March 31, 2020',
 'March 24, 2020',
 'March 19, 2020',
 'March 16, 2020',
 'March 12, 2020',
 'March 11, 2020',
 'March 10, 2020',
 'March 9, 2020',
 'March 5, 2020',
 'March 3, 2020',
 'February 26, 2020',
 'February 25, 2020',
 'February 20, 2020',
 'February 15, 2020',
 'February 12, 2020',
 'February 11, 2020',
 'February 10, 2020',
 'February 6, 2020',
 'February 3, 2020',
 'January 29, 2020',
 'January 24, 2020',
 'January 21, 2020',
 'January 20, 2020',
 'January 19, 2020',
 'January 17, 2020',
 'January 16, 2020',
 'January 14, 2020',
 'January 13, 2020',
 'January 10, 2020',
 'January 8, 2020',
 'January 7, 2020',
 'January 6, 2020',
 'January 4, 2020',
 'December 30, 2019',
 'December 23, 2019',
 'December 20, 

>Now that we fixed the dates. Let's explore each the speeche having the same date as they could be the same but just trimmed differently

In [127]:
grouped_dates = df.groupby(df.date.map(lambda x: x[:-2] if x.find('_') != -1 else x))
multiple_speeches = []

for group in grouped_dates.groups:
    if grouped_dates.get_group(group).shape[0] > 1:
        multiple_speeches.append(group)
len(multiple_speeches)

101

>There are total of 101 dates that has multiple speeches. I'm gonna manually explore them and see if there are difference.

In [143]:
grouped_dates.get_group(multiple_speeches[9])

Unnamed: 0,date,title,url,event,location,transcript
550,"April 28, 2017",Toast of President Rodrigo Roa Duterte Republi...,http://pcoo.gov.ph//toast-of-president-rodrigo...,,"Rizal Hall, Malacañan Palace",\nHis Excellency President Joko Widodo; Madame...
551,"April 28, 2017_2",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//april-28-2017-speech-of-pr...,,"Fiesta Pavilion, Manila Hotel",\nThank you. Kindly sit down.\nPresident Ramos...
552,"April 28, 2017_3",Joint Press Statement of President Rodrigo Roa...,http://pcoo.gov.ph//joint-press-statement-of-p...,,"Reception Hall, Malacañan Palace",\nPRESIDENT DUTERTE: I get to read my statemen...
553,"April 28, 2017_4",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph//media-interview-with-presi...,,"Malacañan, Palace","\nQ: Kay Mary Jane Veloso, sir. Are you going ..."


So the transcript in record 553 has a dialogue of another person I'm gonna clean this and remove it.

In [9]:
def get_duterte_part(transcript):
    lt_names= ['PRESIDENT RODRIGO DUTERTE:', 'PRESIDENT DUTERTE:', 'PRESIDENT RODRIGO ROA DUTERTE:', 'DUTERTE:']

    lt_trans = transcript.replace('\n', ' ').split(' ')

    full_txt = ''
    isDuterte = False
    for word in lt_trans:
        if ':' in word:
            if any(nm in word for nm in lt_names):
                isDuterte = True
                full_txt += '\n'
            elif 'Q:' not in word.strip():
                isDuterte = True
            else:
                isDuterte = False

        if isDuterte:
            full_txt = full_txt + ' ' + word
    return full_txt

In [156]:
df.loc[553, 'transcript'] = get_duterte_part(df.loc[553].transcript)

In [159]:
grouped_dates.get_group(multiple_speeches[9])

Unnamed: 0,date,title,url,event,location,transcript
550,"April 28, 2017",Toast of President Rodrigo Roa Duterte Republi...,http://pcoo.gov.ph//toast-of-president-rodrigo...,,"Rizal Hall, Malacañan Palace",\nHis Excellency President Joko Widodo; Madame...
551,"April 28, 2017_2",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//april-28-2017-speech-of-pr...,,"Fiesta Pavilion, Manila Hotel",\nThank you. Kindly sit down.\nPresident Ramos...
552,"April 28, 2017_3",Joint Press Statement of President Rodrigo Roa...,http://pcoo.gov.ph//joint-press-statement-of-p...,,"Reception Hall, Malacañan Palace",\nPRESIDENT DUTERTE: I get to read my statemen...
553,"April 28, 2017_4",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph//media-interview-with-presi...,,"Malacañan, Palace",\n DUTERTE: Those are one of the things that a...


>Now that we fixed that let's continue

In [198]:
grouped_dates.get_group(multiple_speeches[34])

Unnamed: 0,date,title,url,event,location,transcript
664,"February 28, 2018",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph/wp-content/uploads/2018/02/...,,"Brgy. Maria Cristina, Balo-i, Lanao del Norte","Sir, I'm Kaye Imson, sir, from TV5 Manila. Sir..."
665,"February 28, 2018_2",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/02/...,,"Phase 2, Brgy. Mipaga, Marawi City, Lanao del Sur",Salamat po. Kindly sit down. Sa kurtesiya niny...


In [202]:
print(df.loc[664].transcript)

Sir, I'm Kaye Imson, sir, from TV5 Manila. Sir, given the situation now with the New People's Army, sir, maraming nagsu-surrender. At this point, are you still considering to --- for the resumption of peace negotiations with them, sir?   Not at this time. Maybe. Alongside with the mass surrenders is also the ferocity of those fighting. At alam mo ganito 'yan. Sabi ko nga, we have to reinvent the doctrine diyan sa detachment-detachment sa highway. Kasi sila nahalata ko, malayo sila, marunong na sila mag-sniper. Natutumba 'yung mga sundalo ko, nakatindig lang diyan sa detachment. So we are reconfiguring our --- the movement of the forces. Sinabi ko sa sundalo na ibahin na ninyo. Except 'yung mga population centers. Pero 'yung sa mga highway, ayaw ko na kasi marunong na silang mag-ano --- sniper. [off mic]   Hindi pa nga… hindi pa matapos ‘yung pagka-surrender. Kung mag-surrender na sila lahat bukas, eh 'di tapos na. What would be the use of… Pero kung nakita ko na lumalaban pa rin sila, 

>Looks like another one contains dialogue of another person. This one is from docx, this happend because the module *docx* failed to read the  name/label of the other person which is Q. So I fixed it.

In [37]:
import docx
import re

def get_duterte_part_docx(dt):
    regex = re.compile('\\[(.*?)\]')
    doc = docx.Document(f'duterte_docx2\\{dt}.docx')
    encountered = False
    isDuterte = True
    trans = []
    lt_names= ['PRESIDENT RODRIGO DUTERTE:', 'PRESIDENT DUTERTE:', 'PRESIDENT RODRIGO ROA DUTERTE:', 'DUTERTE:']
    for paragraph in doc.paragraphs:
            if encountered:
                for run in paragraph.runs:
                    # if the text is bold we ask if it is a name of duterte or not
                     if run.bold:
                        if any(nm in paragraph.text for nm in lt_names):
                                    isDuterte = True
                                    trans.append('\n')
                        else:
                            isDuterte = False

            #we'll only include President Duterte's dialogue in the list
            if isDuterte and encountered:
                if paragraph.text:
                    trans.append(paragraph.text)


            #if we encounter the pattern
            if re.search(regex, paragraph.text.replace('\n', '')) and not encountered:
                encountered = True
                #we'll extract the location
                pattern = paragraph.text
                separator = pattern.find('|')
                loc = pattern[1:separator]
                if 'Delivered at' in loc:
                    loc = loc[len('Delivered at'):].strip()
    single_str = ' '.join(trans)
    return single_str

In [215]:
df.loc[664, 'transcript'] = get_duterte_part_docx(df.loc[664].date)
df.loc[664].to_frame().T

Unnamed: 0,date,title,url,event,location,transcript
664,"February 28, 2018",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph/wp-content/uploads/2018/02/...,,"Brgy. Maria Cristina, Balo-i, Lanao del Norte",PRESIDENT DUTERTE: Not at this time. Maybe. Al...


>Now that it is fixed let's continue

In [225]:
num = 40

In [238]:
display(grouped_dates.get_group(multiple_speeches[num]))
print(num)
num += 1

Unnamed: 0,date,title,url,event,location,transcript
611,"June 11, 2017",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph//june-11-2017-media-intervi...,,"Camp Evangelista Station Hospital, 4th Infantr...",\nQ: …Challenge them in your speeches. Can we ...
612,"June 11, 2017_2",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//june-11-2017-speech-of-pre...,,"Camp Bgen Edilberto Evangelista, Cagayan de Or...",\n\nSandali ha para isang basahan na lang ito....


49


In [247]:
print(df.loc[611].transcript)


Q: …Challenge them in your speeches. Can we now hear from you a message for the residents of Marawi City, 250,000 something who were displaced? What are their assurances about rehabilitation? What are they going to have? Those who returned… 
PRESIDENT DUTERTE: That is… That is sampung questions already.
Well, we had known along the buildup here in Marawi. That is why if you were tracking me, my statements in public was that “do not force my hand into it” ’cause there were already terroristic acts including all — also victims were the innocent men and women and children.
Alam mo kung hindi talaga bobo ‘yang mga oppositors, you know you have to declare martial law in the entire Mindanao kasi wala ito… We are not separated by waters and they can always seek refuge because apparently nandito karamihan wala namang Kristyanos, all Moro from Mindanao. 
So even on hot pursuits, they would go running towards Lanao del Norte or going down to Basilan, kaya kailangan mo… Kasi bakit mo i-martial l

>Another one that contains a dialogue of another person let's extract Duterte's part only

In [288]:
df.loc[611, 'transcript'] = get_duterte_part(df.loc[611].transcript)
df.loc[611].to_frame().T

Unnamed: 0,date,title,url,event,location,transcript
611,"June 11, 2017",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph//june-11-2017-media-intervi...,,"Camp Evangelista Station Hospital, 4th Infantr...",\n DUTERTE: That is… That is sampung questions...


>Now that is done let's continue again

In [722]:
num = 0

In [696]:
num -= 2

In [726]:
display(grouped_dates.get_group(multiple_speeches[num]))
print(num)
num += 1

Unnamed: 0,date,title,url,event,location,transcript
566,"April 06, 2017",Media Interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph//april-06-2017-media-interv...,,"Camp Artemio Ricarte, Puerto Princesa City, Pa...",\n DUTERTE: Few questions ha. I’m having a sun...
567,"April 06, 2017_2",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//april-06-2017-speech-of-pr...,,"Camp Artemio Ricarte, Puerto Princesa City, Pa...",Salamat po. Somebody could give the tikas pahi...


3


In [730]:
with open('need_cleaning_.pkl', 'wb') as fl:
    pkl.dump(df, fl)
    
df.to_csv('need_cleaning.csv')

In [727]:
nm = 566
print(df.loc[nm].url)

http://pcoo.gov.ph//april-06-2017-media-interview-with-president-rodrigo-roa-duterte-following-his-visit-to-the-western-command/


In [728]:
print(df.loc[nm].transcript)


 DUTERTE: Few questions ha. I’m having a sunset limitation time sa Jolo. 
 DUTERTE: Well, the usual. We tried to be friends with everybody but we have to maintain our jurisdiction now, at least the areas under our control. And I have ordered the Armed Forces to occupy all — these so many islands, I think nine or 10 — lagyan ng structures and the Philippine flag. And sa coming Independence Day natin, I might, I may go to Pag-asa island to raise the flag there. Pati ‘yung ano, basta ‘yung bakante na ‘yung atin na, tirahan na natin… Mukhang agawan kasi ito ng isla eh. And what’s ours now at least kunin na natin and make a strong point there that it is ours. 
 DUTERTE: The money is there. I don’t know how the — the Army or the engineering battalion would do it. But that development there has my full support.  Gagastos ako diyan sa fortifications diyan because I want to — including the Benham Rise on the right side of the Philippines in the Pacific — I will officially claim it as ours and 

>Again this contain dialogue of another person

In [298]:
df.loc[609, 'transcript'] = get_duterte_part(df.loc[609].transcript)
df.loc[609].to_frame().T

Unnamed: 0,date,title,url,event,location,transcript
609,"June 17, 2017_2",Media interview with President Rodrigo Roa Dut...,http://pcoo.gov.ph//june-17-2017-media-intervi...,,"Brgy. Bancasi, Butuan City",\n DUTERTE: You want me to explain my absence?...


>We fixed it, fuck this shit I'm just gonna search which data point contains 'Q:' and fix them.

In [345]:
index_contain_q = df[df.transcript.map(lambda x: 'Q:' in x)].index
index_contain_q.shape

(43,)

>There are total of 43 transcripts that contain dialogue of another person

In [354]:
for index in index_contain_q:
    df.loc[index, 'transcript'] = get_duterte_part(df.loc[index].transcript)

>Now that we fixed that let's continue what we are doing preiously(for fuck sake).

In [394]:
num = 60

In [409]:
display(grouped_dates.get_group(multiple_speeches[num]))
print(num)
num += 1

Unnamed: 0,date,title,url,event,location,transcript
580,"May 19, 2017",Episode 1,http://pcoo.gov.ph//president-rodrigo-roa-dute...,,Television Network (PTV) – 4,ROCKY IGNACIO: Magandang araw Pilipinas. Ito p...
581,"May 19, 2017_2",Speech of President Rrodrigo Roa Duterte durin...,http://pcoo.gov.ph//speech-of-president-rrodri...,,"Function Room 2&3, SMX Convention Center, Lana...",\nKindly sit down. Salamat po.\nMayor Inday Sa...
582,"May 19, 2017_3",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//speech-of-president-rodrig...,,"Function Room 1, SMX Convention Center, Lanang...",\nKindly sit down. Salamat po.\nAssistant Secr...
583,"May 19, 2017_4",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//speech-of-president-rodrig...,,"Central Park Subdivision, Bangkal, Brgy. Talom...",\n\nMaayong hapon kaninyong tanan mga silingan...
584,"May 19, 2017_5",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph//speech-of-president-rodrig...,,"Brgy. Los Amigos, Tugbok Dist., Davao City","\nIlang araw minemorize yan, yan. [audience ch..."


70


In [410]:
print(df.loc[580].transcript)

ROCKY IGNACIO: Magandang araw Pilipinas. Ito po si Rocky Ignacio. Ito po ang programang Mula sa Masa, Para sa Masa. Ito po ang programang maghahatid sa panguluhan kung ano ang kanilang mga problema.
At syempre ito rin ang magbibigay daan para maihatid ang mga mahahalagang programa ng pamahalaan.
Ngayong araw po, syempre makakasama natin ang
tinatagurin nating bida ng masa, si Pangulong Rodrigo Duterte.

PRESIDENT DUTERTE: Magandang araw po. I am happy that I have thistime with you para pag-usapan ‘yung mga tinatanong ninyo sa akin. Questions that are never answered by me or by anybody else tapos gusto ninyong malaman ngayon ang mga rason na ‘to and I am ready to respond to your questions.

MS. IGNACIO: Mr. President kasi meron pa rin mga katanungan ang mga Filipino sa usapin ng environment, isa po ‘yan sa mga matagal ng naging issue sa ating bansa, panoorin po natin ito.



 IGNACIO: ‘Yan Mr. President. Kasi marami rin po ang nanghinayang sa hindi nga pagkumpirma kay dating DENR Secret

>I'll turn every character to lower case and remove the name of duterte as well as remove those that are enclosed with brackets, parentheses, quotations and other unnecessary characters. We'll also remove english and filipino stop words. For english stop words we'll use the list from sklearn as for filipino stop words I made a list in *Filipino_Stopwords.txt*

In [1]:
import re

def clean_transcript(single_str):
    lt_names= ['PRESIDENT RODRIGO DUTERTE:', 'PRESIDENT DUTERTE:', 'PRESIDENT RODRIGO ROA DUTERTE:']
    str_cleaned = single_str
    for nm in lt_names:
        str_cleaned = str_cleaned.replace(nm, '')
    str_cleaned = str_cleaned.replace('\xa0', '')
    str_cleaned = str_cleaned.replace('\x0c', '')
    str_cleaned = str_cleaned.replace('\\', '')
    str_cleaned = str_cleaned.replace('—', '')
    str_cleaned = str_cleaned.replace('-', '')
    str_cleaned = str_cleaned.replace('...', '')
    str_cleaned = str_cleaned.replace('…', '')
    str_cleaned = str_cleaned.replace('END', '')
    str_cleaned = re.sub(r"\[(.*?)\]", '', str_cleaned) #remove all that is contained within brackets
    str_cleaned = re.sub(r"\((.*?)\)", '', str_cleaned) #remove all that is contained within parentheses
    str_cleaned = re.sub(r'\“(.+?)\”', '', str_cleaned) #remove all that is in quatation as he probably quoting someone
    str_cleaned = str_cleaned.replace('‘', '')
    str_cleaned = [wrd.lower() for wrd in str_cleaned.split(' ') if wrd != '']
    str_cleaned = ' '.join(str_cleaned)

    return str_cleaned

# TESTING

In [280]:
def get_duterte_part2(transcript):
    lt_names= ['PRESIDENT RODRIGO DUTERTE:', 'PRESIDENT DUTERTE:', 'PRESIDENT RODRIGO ROA DUTERTE:', 'DUTERTE:']

    lt_trans = transcript.replace('\n', ' ').split(' ')

    full_txt = ''
    isDuterte = False
    for word in lt_trans:
        if ':' in word:
            print(word)
            if any(nm in word for nm in lt_names):
                isDuterte = True
                full_txt += '\n'
            elif 'Q:' not in word.strip():
                isDuterte = True
            else:
                isDuterte = False

        if isDuterte:
            full_txt = full_txt + ' ' + word
    return full_txt

In [281]:
txt = df.loc[611].transcript

In [282]:
txt



# CLEANING

In [4]:
with open('need_cleaning_.pkl', 'rb') as fl:
    df = pkl.load(fl)

In [5]:
df

Unnamed: 0,date,title,url,event,location,transcript
0,"May 26, 2020",Excerpts from Speech of President Rodrigo Roa ...,https://pcoo.gov.ph/presidential-speech/excerp...,Meeting with Philippine Army (PA) and Philippi...,"Malago Clubhouse, Malacañang Park, Manila",So ako pati si Bong during my mayorship d...
1,"May 25, 2020",Talk to the People of President Rodrigo Roa Du...,https://pcoo.gov.ph/presidential-speech/talk-t...,On Coronavirus Disease 2019 (COVID-19),Malago Clubhouse in Malacañang,PRESIDENT RODRIGO ROA DUTERTE: I remember dist...
2,"May 22, 2020",Speech of President Rodrigo Roa Duterte during...,https://pcoo.gov.ph/presidential-speech/speech...,Commencement Exercsies of the Philippine Milit...,"Malago Clubhouse, Malacañang Park, Manila","Kindly sit down. [May upuan sila? Okay.], Defe..."
3,"May 19, 2020",Talk to the People of President Rodrigo Roa Du...,https://pcoo.gov.ph/presidential-speech/talk-t...,On Coronavirus Disease 2019 (COVID-19),Malago Clubhouse in Malacañang,PRESIDENT RODRIGO ROA DUTERTE: Good evening my...
4,"May 12, 2020",Talk to the People of President Rodrigo Roa Du...,https://pcoo.gov.ph/presidential-speech/talk-t...,On Coronavirus Disease 2019 (COVID-19),Malago Clubhouse in Malacañang,"PRESIDENT DUTERTE: Sir, one question. Itong op..."
...,...,...,...,...,...,...
713,"May 08, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"the Rizal Hall, Malacañan Palace","Maybe after my prepared speech, this is just a..."
714,"May 05, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"the Isla Ballroom, EDSA Shangri-La, Mandaluyon...",Thank you. Kindly sit down. I’d forego wit...
715,"May 04, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"the SMX Convention Center, Davao City",Salamat po. Thank you for your courtesy. You m...
716,"May 02, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"Liwasang Alfaro G. Aguirre, Mulanay, Quezon",Kindly sit down. Thank you for your courtesy. ...


# CONTAIN DIALOGUES

## 580

In [8]:
print(df.loc[580].transcript)

ROCKY IGNACIO: Magandang araw Pilipinas. Ito po si Rocky Ignacio. Ito po ang programang Mula sa Masa, Para sa Masa. Ito po ang programang maghahatid sa panguluhan kung ano ang kanilang mga problema.
At syempre ito rin ang magbibigay daan para maihatid ang mga mahahalagang programa ng pamahalaan.
Ngayong araw po, syempre makakasama natin ang
tinatagurin nating bida ng masa, si Pangulong Rodrigo Duterte.

PRESIDENT DUTERTE: Magandang araw po. I am happy that I have thistime with you para pag-usapan ‘yung mga tinatanong ninyo sa akin. Questions that are never answered by me or by anybody else tapos gusto ninyong malaman ngayon ang mga rason na ‘to and I am ready to respond to your questions.

MS. IGNACIO: Mr. President kasi meron pa rin mga katanungan ang mga Filipino sa usapin ng environment, isa po ‘yan sa mga matagal ng naging issue sa ating bansa, panoorin po natin ito.



 IGNACIO: ‘Yan Mr. President. Kasi marami rin po ang nanghinayang sa hindi nga pagkumpirma kay dating DENR Secret

In [25]:
lt_other_names = ['IGNACIO:', 'Ignacio:', 'video:', 'Aranzazu:']
df.loc[580, 'transcript'] = get_duterte_part_not_q(df.loc[580].transcript, lt_other_names).replace('Ms.', '')

In [26]:
df.loc[580].transcript

'\n DUTERTE: Magandang araw po. I am happy that I have thistime with you para pag-usapan ‘yung mga tinatanong ninyo sa akin. Questions that are never answered by me or by anybody else tapos gusto ninyong malaman ngayon ang mga rason na ‘to and I am ready to respond to your questions.  MS.\n DUTERTE: The question is ripe. Una na natin pag usapan yung kay Secretary Lopez in the sense as she was rejected by the Commission on Appointments despite of my public pleadings in one or two public gatherings but I do not want to impose anything against the, or impose on the congressional body, Commission on Appointments. So tatanggapin natin yan.I chose Cimatu because I believe in him.\xa0 When I was the mayor of Davao City for the longest time, I was always appointed as the regional peace and order council chairman. So when he was assigned dito sa Davao City, we used to meet every now and then regarding the law and order situation of Region 11 of which I was the sabi ko nga, chairman. And we beca

## 643

In [28]:
df.loc[643].transcript



In [39]:
df.loc[643, 'transcript'] = get_duterte_part_docx(df.loc[643].date)

In [40]:
df.loc[643, 'transcript']



## 374

In [43]:
df.loc[374].transcript

' Philippines): Good afternoon sir. Si Ina po sa CNN. PRESIDENT\n DUTERTE: Akala ko ikaw ‘yung… Ms. Andolong: Sir, I just want to ask about, you apologized to Governor Espino and Mr. Sison and the Baraan, the brother. What went wrong in the validation process of that matrix because I understand–? PRESIDENT\n DUTERTE: Nothing went wrong. Everybody was correct but the focusing was, there were lapses. Connect-connect ‘yan, connect-connect ‘yan, connect-connect. They probably hit a snag somewhere and they stopped. So tinitingnan ko yung iba, sa DILG So kung wala man man kayong trabaho diyan, magtrabaho kayo so give me a chart there. Ms. Andolong: But this list… PRESIDENT\n DUTERTE: Hindi ko ma-connect eh. Ms. Andolong: Sir but this list still ended up on your table and you announced it. Will anyone be held responsible for those lapses? PRESIDENT\n DUTERTE: No, I take full responsibility. Ako ‘yung nag-announce. Hindi naman ito ‘yung… Sabi ko in this drive, I take full responsibility. Even 

In [51]:
missing = 'Now we are dealing with China. I will bring this to their attention. I’m leaving for Vietnam then maybe too for China. Lahat ng materials galing sa China. So we want them also to control their people and increase their focus on the criminals. Kaibigan man kaya tayo. Eh bakit ganon? If you consider as your friend, you want to help us, but most of the materials, lahat and the machines and the boilers are from China, what does that mean? Why I’m asking the Filipino in the coming days kung totohanin talaga ng Amerika, I’m going to ask you to sacrifice a little bit. But by next year, I would have entered alliances with so many countries. I will have an alliance in the military, trade and commerce with China. I have talked to President Medvedev of Russia and we agreed that I will go there and we’ll talk about what, how they can help us here. China is ready, ready to help us and I would add now they should be ready to help us with this goddamn problem of, itong drugs. So ‘yan, any question? Ikaw ‘yung representative ni Chua, ‘yung may -ari?'

In [53]:
lt_others = ['Philippines):', 'Andolong:', '(GMA):', 'Morong:', '(DZRH):', 'Uri:', 'Orejas:']
df.loc[374, 'transcript'] = missing + get_duterte_part_not_q(df.loc[374].transcript, lt_others)

In [54]:
df.loc[374].transcript

'Now we are dealing with China. I will bring this to their attention. I’m leaving for Vietnam then maybe too for China. Lahat ng materials galing sa China. So we want them also to control their people and increase their focus on the criminals. Kaibigan man kaya tayo. Eh bakit ganon? If you consider as your friend, you want to help us, but most of the materials, lahat and the machines and the boilers are from China, what does that mean? Why I’m asking the Filipino in the coming days kung totohanin talaga ng Amerika, I’m going to ask you to sacrifice a little bit. But by next year, I would have entered alliances with so many countries. I will have an alliance in the military, trade and commerce with China. I have talked to President Medvedev of Russia and we agreed that I will go there and we’ll talk about what, how they can help us here. China is ready, ready to help us and I would add now they should be ready to help us with this goddamn problem of, itong drugs. So ‘yan, any question? 

## 372

In [56]:
df.loc[372].url

'http://pcoo.gov.ph//sept-28-2016-media-interview-with-president-rodrigo-roa-duterte-before-his-departure-for-vietnam/'

In [63]:
lt_others = ['(ABS-CBN):', 'Hernandez:', '(GMA):', 'Morong:']
df.loc[372, 'transcript'] = get_duterte_part_not_q(df.loc[372].transcript, lt_others).replace('Ms.', '').replace('Mr.', '')

## 370

In [68]:
df.loc[370].transcript



In [69]:
df.loc[370].url

'http://pcoo.gov.ph//sept-30-2016-statement-of-president-rodrigo-roa-duterte-following-his-official-visit-to-vietnam/'

In [74]:
import requests
from bs4 import BeautifulSoup
headers = {
    'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1 RuxitSynthetic/1.0 v1355348020 t3296535826494656701 smf=0'
}

url = 'https://pcoo.gov.ph/sept-30-2016-statement-of-president-rodrigo-roa-duterte-following-his-official-visit-to-vietnam/'
response = requests.get(url, headers=headers)
html = BeautifulSoup(response.text, 'lxml')

In [91]:
html.find('table').findAll('td')[3].findAll('span')[5].text

'\u2029'

In [114]:
full_txt = ''
for span in html.find('table').findAll('td')[3].findAll('span'):
    if span.text != '\u2029':
        full_txt += '\n' + span.text

In [119]:
lt_others = ['Q:']
df.loc[370, 'transcript'] = get_duterte_part_not_q(full_txt, lt_others).replace('QUESTIONS AND ANSWERS:', '')

## 436

In [120]:
df.loc[436].transcript

'\nPRESIDENT PUTIN: Your Excellency, Mr. President, your colleagues and friends, we first of all cordially welcome you. It’s my pleasure to meet you.\nMr. President, in the presidential election in your country was held on the 9th of May for us it is indeed a very bright day, public holiday that marks the victory in the great patriotic war over the Nazis group.\nFor you, it has been your personal victory so once again congratulations, Mr. President.\nMr. President, this year marks 40th anniversary since diplomatic ties between our countries established back in time.\nHistorically, it’s quite a short period of time.\nWell, you have been able to do a lot in a short period of time in terms of developing the all round partnership between our countries and with respect to promoting greater trust and confidence between us.\nAnd it is my pleasure to have a chance to speak to you and your colleagues about developing our bilaterals.\nPRESIDENT DUTERTE: I have been looking for this moment to mee

In [123]:
txt = 'I have been looking for this moment to meet you Mr. President not only because you represent a great country but of your leadership too.\nAnd we’ve been longing to be part also of — despite the distance. We’ve been longing to be part of Europe especially in commerce and trade around the world.\nBut there was one thing that really stood before us and that was the result of Cold War.\nAnd historically, I have been identified with the Western world.\nIt was good until it lasted. And of late, I see a lot of these Western nations bullying small nations. And not only that they are into so much hypocrisy.\nAnd they want to… They seem to start a war but are afraid to go to war. That is what’s wrong with America and the other.\nThey were waging war in so many places in Vietnam, in Afghanistan and in Iraq. And for one single reason that there was a weapon of mass destruction and there was none.\nThey insist if you are allied with them that they follow you. They go to the Korean war, nothing happened. They got defeated.\nThey also got soldiers, Filipino soldiers in both Iraq and Vietnam and nothing happened, they lost. Then they went to an expedition in Iraq on an excuse of weapons of mass destruction and there was none.\nAnd they forced my country to contribute military forces. And when one soldier, one Filipino worker in the Middle East was captured by the groups there, they threatened to behead the Filipino unless we go out of war — in the war against the Middle East at that time.\nAnd because it was of national interest, and because their country was really over thinking of how to solve the problem. And the condition was that if we withdraw our forces, then they would spare the life of the Filipino worker. And we decided to withdraw.\nFrom that time on, the Americans made it hard for us and even in the (inaudible) times with IMF. So there are the things that I see which is not a good idea.\n'

In [124]:
df.loc[436, 'transcript'] = txt

## 405

In [127]:
df.loc[405].transcript

'\nPRIME MINISTER ABE: \nMay I once again express my heartfelt welcome to His Excellency President Duterte on your first visit to Japan after becoming the President of the Republic of the Philippines. \nThis year marks the milestone of the 60th anniversary of normalization of our diplomatic relations and the year opened with an auspicious event of Their Majesties, the Emperor and the Empress visiting the Philippines.\nIt is a great pleasure for me to welcome President Duterte, who has long served as the mayor of Davao, which has a long history of friendship with Japan and to have the second summit following our meeting last month.\nJapan and the Philippines are important partners, both sharing fundamental values, including freedom, democracy, and the rule of law. During our talk today, based on such partnership, we had an extremely meaningful exchanges of opinions between President Duterte and we agreed to raise our bilateral relationship of cooperation to a level higher than before. A

In [130]:
txt = '\nThank you, Your Excellency. I take this opportunity to express gratitude for the warm welcome and exceptional arrangements accorded me and my delegation. \nMy meeting with Prime Minister Shinzo Abe was productive and fruitful. We reaffirmed strong ties between our countries and peoples and resolved to make that friendship and partnership even stronger. \nWe discussed a broad range of issues important to both sides and more importantly, agreed to work closely to ensure that shared objectives are accomplished.\nWe want to increase two-way trade and investments, especially in agriculture and manufacturing and look at various opportunities for new areas of collaboration.\nThese include the development of quality infrastructure and transport systems crucial in order to sustain the Philippines’ economic growth.\nJapan’s Official Development Assistance for the Philippines is second to none in terms of real value and the positive impact on the lives of the Filipinos. \nWe agreed to continue harnessing ODA or the Official Development Assistance as a tool to bring about positive economic and structural changes in the Philippines. \nWe agreed to collaborate on political, security, and defense issues in order to create an enabling environment for our economies to grow. \nJapan will continue to play an important role in modernizing the capabilities of the Philippines for maritime domain awareness and maritime security as well as in humanitarian relief and disaster risk reduction response.\nWe welcome the ongoing new initiatives by Japan that support the Philippine government’s effort to realize Mindanao’s full potential.\nAs the Philippines take on the chairmanship of ASEAN in 2017, I sought and received Japan’s support and commitment to take an active role in the relevant meeting.\nThis is an important leadership role for the Philippines as we seek fully to realize the goal and rules-based, people-oriented and people-centered ASEAN.\nJapan will be a crucial ASEAN dialogue partner in ensuring the efforts to strengthen adherence to the rule of law as the bedrock of stable and secure relation in the ASEAN region and beyond.\nThe Philippines will continue to work closely with Japan on issues of common concern in the region and uphold the shared values of democracy, adherence to the rule of law and the peaceful settlement of disputes, including the South China Sea. \nThe full range of our relations over the past 60 years has shown that Japan is the Philippines’ special friend who is closer than a brother.\nThis visit has been a defining moment on the solid and strategic partnership between the Philippines and Japan. Today, we have taken steps to ensure that our ties remain vibrant and will gain greater strength in the years to come. Thank you.'

In [132]:
df.loc[405, 'transcript'] = txt

## 400

In [134]:
df.loc[400, 'transcript']

'PRIME MINISTER SHINZO ABE: Mr. President is quite a famous figure also in Japan and I’m very excited to see you in person.\nFirst of all, I would like to underscore that Japan firmly deplores the terror incident which took place in Davao the other day.\nI would like to take this opportunity to express My heartfelt condolences for the victims and their family members.\nAnd also, I would like to reiterate my feelings and sympathy for you.\nPRES. DUTERTE: Thank you.\nPRIME MINISTER ABE: This year is a milestone year which marks the 60th anniversary of the normalization of the diplomatic relationship between Japan and the Philippines. And under our close partnership, I do look forward to further developing our cooperation in a wide range of areas together with you.\nPRES. DUTERTE: I would like to thank the Japanese government for their efforts to commiserate with us in the bombing incident in Davao City. Thank you very much for your concern and Davao remains to be the largest recipient of

In [149]:
lt_others = ['PRIME MINISTER ABE:', 'ABE:']
df.loc[400, 'transcript'] = get_duterte_part_not_q(df.loc[400, 'transcript'], lt_others).replace('PRIME MINISTER SHINZO', '').replace('PRIME MINISTER', '')

## 525

In [160]:
remove = ['QUESTIONS AND ANSWERS:', 'AC Nichols (CNN', 'Ms.', 'JP Bencito (Manila Standard', 'Julius Disamburun', 'Mr.'
         'Ayee Macaraig (Agence', 'Victoria Tulad']
lt_others = ['Philippines):', 'Nichols:', 'Inquirer):', 'Leila Salaverria (The Philippine Daily', 'Today):', '(PTV-4):', 
            'France-Presse):', 'Macaraig:', '(GMA-7):', 'Tulad:']
txt = get_duterte_part_not_q(df.loc[525, 'transcript'], lt_others)
for wrd in remove:
    txt = txt.replace(wrd, ' ')
df.loc[525, 'transcript'] = txt

In [151]:
df.loc[525, 'url']

'http://pcoo.gov.ph//march-23-2017-media-interview-of-president-rodrigo-roa-duterte-following-his-official-visits-to-republic-of-the-union-of-myanmar-and-the-kingdom-of-thailand/'

## 708

In [162]:
print(df.loc[708, 'transcript'])

  Yes, what is the trouble of the day? Hi, sir. Sir, can you --- sir ‘yung statement niyo lang kahapon, can you clarify about China, ‘yung may ouster plans, tutulong po sa inyo ‘yung China. Sino po in particular, sir?   No, that was the time when there was this parade ‘yung “oust, oust, Duterte.” We would not want to see someone duly elected na… otherwise, these threats about ‘yung mga yellow pati Magdalo, pati ‘yung… ‘yung mga left nag-join, they were demonstrating almost everyday. I was just trying to recall his statement by [one?] --- not… by one of them pero… Actually kasi nandiyan sila lahat, we will not like to see a duly-elected Philippine leader ousted by itong ‘yung oust-oust na ‘to. ‘Yun ‘yun. Si President Xi Jinping? You were talking to President Xi Jinping…   Again? You were talking with President Xi Jinping regarding these threats that China…   At that time, hindi na. It was just a statement siguro ‘yung… Pero ‘yung ano… Ah ‘yung sabi niya na --- oo. It was passed on to th

In [172]:
df.loc[708, 'transcript'] = get_duterte_part_docx(df.loc[708].date)

## 528

In [176]:
print(df.loc[528, 'transcript'])

PRESIDENT DUTERTE: Good evening. I would be glad to take your questions.
RG Cruz (ABS-CBN 2): Sir, good evening, Mr. President. 
PRESIDENT DUTERTE: Yes.
Mr. Cruz: Does the Pres — do you support the impeachment of the Vice President Leni Robredo amid criticism from Senator de Lima that it is stupid and an arrogance of power? And will you bring to court all of those who are behind the supposed concerted efforts to destabilize your administration amid the claim by Speaker Alvarez that there are plans by the Liberal Party to oust you the way former President Estrada was ousted? 
PRESIDENT DUTERTE: Yes, it’s all politics actually. In the matter of going after them, it has not reached that level of violence, destabilization. It’s more of publicity. And for as long as it is really a peaceful exercise of the freedom of speech and freedom of the press, there’s nothing I can do about it. It’s guaranteed under the Constitution.  
The talk about destabilization I think is a bit too… well it is jus

In [175]:
df.loc[528, 'url']

'http://pcoo.gov.ph//march-19-2017-media-interview-with-president-rodrigo-roa-duterte-following-his-meeting-with-the-filipino-community-in-the-republic-union-of-myanmar/'

In [185]:
remove = ['RG Cruz (ABS-CBN', 'Mr.', 'Dano Tingcungco (GMA', 'Jervis Manahan (PTV', 'Roy Mabasa (Manila ']
lt_others = ['2): Sir,', 'Cruz:', '7):', 'Tingcungco:', '):', 'Manahan:', 'Bulletin):']
txt = get_duterte_part_not_q(df.loc[528, 'transcript'], lt_others)
for wrd in remove:
    txt = txt.replace(wrd, ' ')
df.loc[528, 'transcript'] = txt

## 494

In [188]:
print(df.loc[494, 'transcript'])

PRESIDENT DUTERTE: His Excellency Prime Minister Shinzo Abe, distinguished members of the Japan delegation, colleagues in the Philippine government, friends, ladies and gentlemen.
Thank you for joining us this afternoon. Prime Minister Abe and I just concluded a productive meeting where we discussed the ways for further strengthening of bilateral ties.
As proven defense and long time partners, the Philippines and Japan are committed to further expand and deepen our relations across a broad range of areas. We had an active discussion on enhancing of maritime and security cooperation.
As maritime nations, the Philippines and Japan have a shared interest in keeping our waters safe and secure from threats of any kind.
Capacity-building and assets acquisition and upgrading will be a centerpiece of this collaboration. We hope to fast-track the delivery of the Philippine of key assets already in the pipeline and the acquisition of new ones.
As we seek these new innovations to the Philippines’

In [187]:
df.loc[494, 'url']

'http://pcoo.gov.ph//jan-12-2017-president-rodrigo-roa-dutertes-joint-press-statement-with-japanese-prime-minister-shinzo-abe/'

In [191]:
txt = df.loc[494, 'transcript']

index = txt.find('PRIME MINISTER ABE:')
df.loc[494, 'transcript'] = txt[:index]

# DUPLICATES

# 417 - duplicate delete this

In [195]:
df.loc[417]

date                                         October 19, 2016_2
title         Press Conference of President Rodrigo Roa Dute...
url           http://pcoo.gov.ph//oct-19-2016-press-conferen...
event                                                       NaN
location                      Grand Hyatt Hotel, Beijing, China
transcript     STATEMENT: PRESIDENT\n DUTERTE: Presently we ...
Name: 417, dtype: object

In [201]:
df.drop(index=417, inplace=True)

# SKIPPED SOME DIALOGUES

# 625

In [206]:
print(df.loc[625, 'transcript'])


 DUTERTE: I can maybe spare you something like… Wala na ang relo ko, ibinigay ko sa sundalo. So I can… Mga five hours.
 DUTERTE: Yes?
 DUTERTE: Ang solusyon nasa Pilipino, kayo. Kayo ang… You know ang gustong magdala ng gobyerno, that is really extortion. That is just a matter of semantics. They call it “revolutionary tax,” actually it’s extortion. Kaya nga may away tayo eh. And kaya nga hindi kami papayag… I refuse now to resume the talks with them until they stop itong extortion. Kasi sasabihin ng tao, magbabayad sila ng buwis, wala namang protection kasi hinihingian sila ng NPA. So the only way is to — I do not want to continue the fight, it has been there for 50 years and I suppose that everybody’s tired killing people for 50 years. But if they want it another 50 years, wala kaming, wala — ang gobyerno walang magawa. It’s plain extortion. And if they want to continue, to resume the talks, one of the things that I would demand would really be that they stop the extortion activities

In [205]:
df.loc[625, 'url']

'http://pcoo.gov.ph//media-interview-with-president-rodrigo-roa-duterte-following-his-visit-to-the-403rd-infantry-brigade/'

In [209]:
url = 'https://pcoo.gov.ph/media-interview-with-president-rodrigo-roa-duterte-following-his-visit-to-the-403rd-infantry-brigade/'

response = requests.get(url, headers=headers)
html = BeautifulSoup(response.text, 'lxml')

In [214]:
container = html.find('table')
txt = container.findAll('td')[3].text

txt_cleaned = get_duterte_part_not_q(txt , ['Q:'])
df.loc[625, 'transcript'] = txt_cleaned

## 367

In [216]:
df.loc[367, 'transcript']



In [217]:
df.loc[367, 'url']

'http://pcoo.gov.ph//aug-01-2016-statement-of-president-rodrigo-roa-duterte-during-a-press-briefing/'

In [218]:
url = 'https://pcoo.gov.ph/aug-01-2016-statement-of-president-rodrigo-roa-duterte-during-a-press-briefing/'
response = requests.get(url, headers=headers)
html = BeautifulSoup(response.text, 'lxml')

In [221]:
def get_txt_html(html):
    container = html.find('table')
    txt = container.findAll('td')[3].text
    return txt

In [237]:
txt = get_txt_html(html)

In [239]:
remove = ['Pia Ranada', 'Ms.', 'Joseph Morong', 'Mr.', 'QUESTION & ANSWER:']
lt_others = ['Q:', '(Rappler):', 'Ranada:', 'Monte:', '(GMA-7):', 'Morong:', 'Ignacio:', 'Tinaza:', 'Mendez:',
            'REPORTERS:']
txt = get_duterte_part_not_q(txt, lt_others)
for wrd in remove:
    txt = txt.replace(wrd, ' ')
df.loc[367, 'transcript'] = txt

## 349

In [240]:
df.loc[349, 'transcript']

' 11:35. As of today, I am declaring a ceasefire. So, I’m joining the Communist Party of the Philippines in its desire to seek peace for this nation.  Likewise, in the same manner, I am ordering the Armed Forces of the Philippines pati ang Philippine National Police: As of today, meron tayong ceasefire.  So, we avoid hostile actions against each other, we do not go into antagonistic behavior in front of whoever and as a matter of fact, I am encouraging people in government—the military and the police to be friendly with the forces of the revolutionary government of the Communist Party of the Philippines.  In the meantime, that we have the ceasefire, because of the Oslo talks, not because we want to be extra friendly but you know, forget for—in the meantime, even for a short period, and I hope it would go a long, long period for a peaceful resolution of the communist rebellion against the Republic of the Philippines.  ‘Yan lang po, very important so that everybody will be apprised of, w

In [241]:
df.loc[349, 'url']

'http://pcoo.gov.ph//aug-24-2016-press-conference-of-president-rodrigo-roa-duterte/'

In [262]:
url = 'https://pcoo.gov.ph/aug-24-2016-press-conference-of-president-rodrigo-roa-duterte/'

response = requests.get(url, headers=headers)

html = BeautifulSoup(response.text, 'lxml')

In [263]:
txt = get_txt_html(html)
txt

'I’ll be talking in English because I’m also—at the same time, addressing the nation.\nEffective this hour, it’s 11:35. As of today, I am declaring a ceasefire. So, I’m joining the Communist Party of the Philippines in its desire to seek peace for this nation.\n Likewise, in the same manner, I am ordering the Armed Forces of the Philippines pati ang Philippine National Police: As of today, meron tayong ceasefire.\n So, we avoid hostile actions against each other, we do not go into antagonistic behavior in front of whoever and as a matter of fact, I am encouraging people in government—the military and the police to be friendly with the forces of the revolutionary government of the Communist Party of the Philippines.\n In the meantime, that we have the ceasefire, because of the Oslo talks, not because we want to be extra friendly but you know, forget for—in the meantime, even for a short period, and I hope it would go a long, long period for a peaceful resolution of the communist rebelli

In [266]:
lt_others = ['Q:', 'PRRD', 'Q']
cleaned_txt = get_duterte_part_not_q(txt, lt_others)
df.loc[349, 'transcript'] = cleaned_txt

In [290]:
with open('semi_cleaned_df.pkl', 'wb') as fl:
    pkl.dump(df, fl)

# CLEANING PART 2

In [292]:
lt = ['Q:', 'Q']

df[df.transcript.map(lambda x: any(word == char for char in lt for word in x.split()))]

Unnamed: 0,date,title,url,event,location,transcript
335,"July 07, 2016",President Rodrigo Roa Duterte’s Press Statemen...,http://pcoo.gov.ph//july-07-2016-president-rod...,,"President’s Hall, Malacañan Palace [Aired via ...",\n[Video starts]\nDue to the constant and insi...
353,"August 21, 2016",Press Conference of President Rodrigo Roa Duterte,http://pcoo.gov.ph//aug-21-2016-press-conferen...,,"Presidential Guest House, DPWH Depot, Panacan,...","PRRD: Are we ready? Okay? Ah, good morning! ..."


# 335

In [364]:
print(df.loc[335].transcript)



In [355]:
url = df.loc[335].url
response = requests.get(url, headers=headers)
html = BeautifulSoup(response.text, 'lxml')

In [356]:
txt = get_txt_html(html)
txt



In [358]:
remove= ['R.']
lt_others = ['Ignacio', 'Q']
cleaned_txt = get_duterte_part_not_q(txt, lt_others).replace('* * *', ' ')
df.loc[335, 'transcript'] = remove_parts(cleaned_txt, remove)

# 353

In [365]:
print(df.loc[353, 'transcript'])



In [366]:
df.loc[353, 'url']

'http://pcoo.gov.ph//aug-21-2016-press-conference-of-president-rodrigo-roa-duterte/'

In [372]:
txt = df.loc[353, 'transcript']
txt.find('Doris Bigornia:')

11497

In [382]:
remove = ['Doris', 'EDITH', 'REGALADO/PH', 'EDITH', 'JONATHAN']
lt_others = ['Bigornia:', 'MODERATOR:', 'REGALADO:', 'STAR:', 'KADUAYO/RAPPLER:', 'TRISHA/CNN:', 'MILLER:', 
             'DORIS:', 'Journalist:', 'Q:', 'Q']
txt_cleaned = get_duterte_part_not_q(txt, lt_others)
txt_cleaned = remove_parts(txt_cleaned, remove)
df.loc[353, 'transcript'] = txt_cleaned

In [378]:
def remove_parts(txt, lt_remove):
    full_txt = ''
    for wrd in txt.replace('\n', ' ').split(' '):
        if wrd not in lt_remove:
            full_txt += ' ' + wrd
       
    return full_txt
def get_duterte_part_not_q(transcript, others):
    lt_names= ['PRESIDENT RODRIGO DUTERTE:', 'PRESIDENT DUTERTE:', 'PRESIDENT RODRIGO ROA DUTERTE:', 'DUTERTE:', 
               'PRES. DUTERTE:', 'PRRD']

    lt_trans = transcript.replace('\n', ' ').split(' ')

    full_txt = ''
    isDuterte = True
    for word in lt_trans:
        if (':' in word) or ('Q' == word) or ('PRRD' == word) or ('Ignacio' == word):
#             print(word)
            if any(nm in word for nm in lt_names):
                isDuterte = True
                full_txt += '\n'
            elif any(nm in word.strip() for nm in others):
                isDuterte = False

        if isDuterte:
            full_txt = full_txt + ' ' + word
    return full_txt

# CLEANING  PART 3

In [384]:
df

Unnamed: 0,date,title,url,event,location,transcript
0,"May 26, 2020",Excerpts from Speech of President Rodrigo Roa ...,https://pcoo.gov.ph/presidential-speech/excerp...,Meeting with Philippine Army (PA) and Philippi...,"Malago Clubhouse, Malacañang Park, Manila",So ako pati si Bong during my mayorship d...
1,"May 25, 2020",Talk to the People of President Rodrigo Roa Du...,https://pcoo.gov.ph/presidential-speech/talk-t...,On Coronavirus Disease 2019 (COVID-19),Malago Clubhouse in Malacañang,PRESIDENT RODRIGO ROA DUTERTE: I remember dist...
2,"May 22, 2020",Speech of President Rodrigo Roa Duterte during...,https://pcoo.gov.ph/presidential-speech/speech...,Commencement Exercsies of the Philippine Milit...,"Malago Clubhouse, Malacañang Park, Manila","Kindly sit down. [May upuan sila? Okay.], Defe..."
3,"May 19, 2020",Talk to the People of President Rodrigo Roa Du...,https://pcoo.gov.ph/presidential-speech/talk-t...,On Coronavirus Disease 2019 (COVID-19),Malago Clubhouse in Malacañang,PRESIDENT RODRIGO ROA DUTERTE: Good evening my...
4,"May 12, 2020",Talk to the People of President Rodrigo Roa Du...,https://pcoo.gov.ph/presidential-speech/talk-t...,On Coronavirus Disease 2019 (COVID-19),Malago Clubhouse in Malacañang,"PRESIDENT DUTERTE: Sir, one question. Itong op..."
...,...,...,...,...,...,...
713,"May 08, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"the Rizal Hall, Malacañan Palace","Maybe after my prepared speech, this is just a..."
714,"May 05, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"the Isla Ballroom, EDSA Shangri-La, Mandaluyon...",Thank you. Kindly sit down. I’d forego wit...
715,"May 04, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"the SMX Convention Center, Davao City",Salamat po. Thank you for your courtesy. You m...
716,"May 02, 2018",Speech of President Rodrigo Roa Duterte during...,http://pcoo.gov.ph/wp-content/uploads/2018/05/...,,"Liwasang Alfaro G. Aguirre, Mulanay, Quezon",Kindly sit down. Thank you for your courtesy. ...
