# Assessing GDPR-Compliance in Web Applications: A Machine Learning Approach

We will assess the GDPR-compliance of web applications based on their privacy policies. We use a classification model, trained on a corpus of 18,397 natural sentences, to classify the privacy policies on whether five General Data Protection Regulation (GDPR) privacy policy core requirements are communicated in the policy.

__Relevance:__ The GDPR applies to any personal data processing of EU citizens. We aim to assess the state of GDPR-compliance in application software based on their privacy policies.

__Focus:__ web applications; as the web application paradigm is widely used due to the omnipresence of web browsers across PCs and mobile devices. 

__Goal:__ to scrutinize the privacy policies of web applications using ML, to assess whether core privacy policy requirements are communicated.

#### __RQ:__ What is the state of GDPR-compliance disclosure in web applications?

---

### Step 1: collect list of companies active in the Web Apps industry

To do so we utilize the Crunchbase database that allows us to identify companies that engage in web applications, filtered on location (which in our case will be the European Union). We used 

We've imported 1000 companies using the following criteria:
- Industry: Web Apps
- Location: Europe (European Union)

These criteria yield 2019 results of which we collected 1000.

---

In [174]:
import os
from newspaper import Article
from bs4 import BeautifulSoup
from six.moves.urllib.parse import urlparse
import urllib
import sys
import time
import nltk
import pandas as pd
import requests
import spacy
import random
# from googlesearch import search
from langdetect import detect
import re
import pickle
import math

### Step 2: read data

In [46]:
# crunch_data_init = pd.read_excel('data/Advanced Search _ Companies _ Crunchbase.xlsx', index_col=0) 
crunch_data = pd.read_excel('data/companies.xlsx', index_col=0) 

In [47]:
# crunch_data = crunch_data_init.iloc[0:999]

In [48]:
crunch_data

Unnamed: 0_level_0,Description,Location,Employees,Type,Website,Rank,Founded Date,Operating Status,Company Type,Contact Email,...,Industry 31,Industry 32,Industry 33,Industry 34,Industry 35,Industry 36,Industry 37,Link 1,Link 2,link_3
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01s-community-company,01S Community company communicates and interac...,"Arezzo, Toscana, Italy",51-100,Private,www.01s.it/,1284758,,Active,For Profit,info@01s.it,...,,,,,,,,https://www.facebook.com/01esse/,https://www.linkedin.com/company/01s-community...,
1000-digital,1000 ° Digital develops innovative web applica...,"Leipzig, Sachsen, Germany",11-50,Private,www.1000grad.de,1480851,2000,Active,For Profit,info@1000grad.de,...,,,,,,,,https://www.facebook.com/1000graddigital,https://www.linkedin.com/company/1000digital/,https://twitter.com/1000digital
100-net,100% Net is a global internet solution for all...,"Pérols, Limousin, France",1-10,Private,www.100pour100net.com//,986874,2003,Active,For Profit,contact@100p100.net,...,,,,,,,,https://www.facebook.com/100pour100Net/,https://www.linkedin.com/company/100-net/,https://www.twitter.com/100pour100net
100starlings,"100 Starlings creates web and mobile apps, the...","London, England, United Kingdom",1-10,Private,www.100starlings.com/,388746,2015,Active,For Profit,info@100starlings.com,...,,,,,,,,https://www.linkedin.com/company/100starlings-...,https://twitter.com/100Starlings?utm_source=hi...,
10geeks-software-engineering,"10Geeks designs, develops, and analyzes tailor...","Fohren, Baden-Wurttemberg, Germany",1-10,Private,www.10geeks.com,651610,"Jan 1, 2012",Active,For Profit,info@10geeks.com,...,,,,,,,,https://www.linkedin.com/company/10geeks,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zonova,Zonova is an information technology and servic...,"Terrenoire, Rhone-Alpes, France",11-50,Private,zonova.io,853916,2017,Active,For Profit,contact@zonova.io,...,,,,,,,,https://www.facebook.com/ZetaOmegaNOVA/,https://www.linkedin.com/company/zonova/,
zoonect,"Zoonect offers web apps, mobile app, cloud pla...","Pistoia, Toscana, Italy",1-10,Private,www.zoonect.com,1741834,2015,Active,For Profit,office@zoonect.com,...,,,,,,,,https://www.facebook.com/zoonect/,https://www.linkedin.com/company/zoonect/,https://twitter.com/zoonect
zostera,Zostera specializes provides software and data...,"Aarlanderveen, Zuid-Holland, The Netherlands",1-10,Private,zostera.nl,1839410,,Active,For Profit,info@zostera.nl,...,,,,,,,,https://www.facebook.com/zostera.bv,https://www.linkedin.com/company/zostera/,https://twitter.com/zostera
zoznam-mobile,Zoznam Mobile offers services in the implement...,"Bratislava, Bratislava, Slovakia (Slovak Repub...",1-10,Private,zmb.sk/,1671707,"Jan 1, 2002",Active,For Profit,info@zoznammobile.sk,...,,,,,,,,,,


#### Clean websites list

In [49]:
websites_list = crunch_data["Website"].tolist()

In [50]:
websites_list

['www.01s.it/',
 'www.1000grad.de',
 'www.100pour100net.com//',
 'www.100starlings.com/',
 'www.10geeks.com',
 'www.121digitalmedia.eu/',
 '150sec.com/',
 'wollow-soft.com',
 '1minus1.com/',
 'www.1t-s.com',
 '21stwebb.co.uk',
 '23g.io',
 '247wms.com',
 '2advance.ch',
 'www.2develop.nl',
 'www.2m-a2i.fr/',
 'www.2open.it',
 'www.2see.nl',
 'www.2w.de',
 '33communication.com',
 'www.360d.be',
 'www.360telemetry.com/',
 'www.3asyr.com/',
 'www.3d3.nl/',
 'www.3d-dental.dk',
 '3dit.de; https//govie.de',
 '3ie.fr',
 'www.3m5.de',
 'www.3po.nl/',
 'www.3tiersystems.com',
 'www.3xw.ch',
 'www.40bis.nl',
 'www.4fx.co.uk/',
 'www.4homepages.de',
 '4kstudio.at',
 '4tpm.fr',
 '5w155.ch',
 '69pixl.com/',
 'www.7interactive.cz',
 'www.80si.com',
 'www.8balls.nl/',
 'www.8trust.com/',
 '8web.gr/',
 'www.960labs.com/',
 'www.999web.de',
 'www.99codelines.com',
 'a10sistemas.es',
 'a2colores.es',
 'www.aaltra.eu/',
 'aardenexperts.com/',
 'aardvark-creative.com',
 'www.aardvark.gr',
 'www.ab4d.com/',

In [51]:
# remove / from the end of the string that contains the website
# websites_list = [website.rstrip(website[-1]) if (website[-1] == "/") else website for website in websites_list]
websites_list = [website.rstrip(website[-1]) if (isinstance(website, str) and website[-1] == "/") else website for website in websites_list]
websites_list = [website.rstrip(website[-1]) if (isinstance(website, str) and website[-1] == "/") else website for website in websites_list]

In [129]:
(websites_list)

2792

---

### Step 3: scrape privacy policies

In [170]:
def get_privacy_policy_url(query):
    keyword_in_title = 0
    attempts = 0
    url = ""
    print("Query: " + query)
    
    try:
        query_results_list = return_google_results(query, 3, 5)
        print("Considering " + str(len(query_results_list)) + " URL(s) ...")
        for i, url in enumerate(query_results_list):
            term_in_url = 0
            attempts = attempts + 1
            print("Assessing privacy policy URL: " + url)
            
            if (re.findall('privacy', url) or re.findall('policy', url)): 
                print("Found relevant terms in URL! Succesful break!")
                break

#                     pass
            if keyword_in_title == 1 or attempts == 3 or i==(len(query_results_list)-1): 
                keyword_in_title = 0
                attempts = 0
                print("No results. Breaking ..")
                url = ""
#                 print(sentences)
                break   
    except Exception as e:
            print(str(e))
            pass
    return url

In [185]:
def return_google_results(keywords, num_results, attempts):
    user_agent_list = [
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    ]

    html_keywords = urllib.parse.quote_plus(keywords)
    sleep_init = 10
    
    url = "https://www.google.com/search?q=" + html_keywords + "&num=" + str(num_results)
    print("** Search query in URL: " + url)

    headers = {'User-Agent': random.choice(user_agent_list)}
    
    html = requests.get(url, headers=headers)

    if html.status_code == 429:
        if(attempts == 0):
            sys.exit("Too many request 429, attempted "+ str(5)+ " times, break ...")
        else:
            if 'Retry_After' in html.headers:
                print("Helaas, geen retry-after info")
            else:
                time.sleep(sleep_init)
                print("Too many requests (attempt "+ str(5 - attempts)+ "), we will attempt again in " + str(sleep_init) + " seconds")
                return_google_results(keywords, num_results, (attempts - 1))
    else: 
        pass
        
    soup = BeautifulSoup(html.text, 'html.parser')

    allData = soup.find_all("div",{"class":"g"})

    link_list = []
    print("len alldata: " + str(len(allData)))
    
    for i in range(0,len(allData)):
        link = allData[i].find('a').get('href')
        
        if(link is not None):
            if(link.find('https') != -1 and link.find('http') == 0 and link.find('aclk') == -1):
                print(link)
                link_list.append(link)
    print(link_list)
    return link_list

#### Collect privacy policy URLs

In [186]:
privacy_policies_url_list = []

In [187]:
# loop through each company URL and attempt to find the URL of the privacy policy
for i, url_company in enumerate(websites_list[75:]):
    print(i)
#     print(len(privacy_policies_url_list))
    if(isinstance("url_company", str) is False):
        privacy_policies_url_list.append("")
    else:
        query = "site:\"" + url_company + " \"privacy policy"
        privacy_policies_url_list.append(get_privacy_policy_url(query))
    print()
    time.sleep(10)
    if (i == 500):
        break

0
Query: site:"www.addink.net "privacy policy
** Search query in URL: https://www.google.com/search?q=site%3A%22www.addink.net+%22privacy+policy&num=3
len alldata: 0
[]
Considering 0 URL(s) ...

1
Query: site:"a-digital.one "privacy policy
** Search query in URL: https://www.google.com/search?q=site%3A%22a-digital.one+%22privacy+policy&num=3
len alldata: 2
https://a-digital.one/magazin/wichtige-security-header-fuer-eure-website-und-wie-ihr-sie-einbindet/
https://a-digital.one/magazin/die-4-besten-dsgvo-generatoren-per-knopfdruck-zur-sicheren-datenschutzerklaerung/
['https://a-digital.one/magazin/wichtige-security-header-fuer-eure-website-und-wie-ihr-sie-einbindet/', 'https://a-digital.one/magazin/die-4-besten-dsgvo-generatoren-per-knopfdruck-zur-sicheren-datenschutzerklaerung/']
Considering 2 URL(s) ...
Assessing privacy policy URL: https://a-digital.one/magazin/wichtige-security-header-fuer-eure-website-und-wie-ihr-sie-einbindet/
Assessing privacy policy URL: https://a-digital.one/mag

SystemExit: Too many request 429, attempted 5 times, break ...

In [184]:
keyword = "site:\"" + "https://www.facebook.com" + " \"privacy policy"
number_of_result = 3
llist = return_google_results(keyword, number_of_result, 1)

** Search query in URL: https://www.google.com/search?q=site%3A%22https%3A%2F%2Fwww.facebook.com+%22privacy+policy&num=3
Too many requests (attempt 4), we will attempt again in 10 seconds
** Search query in URL: https://www.google.com/search?q=site%3A%22https%3A%2F%2Fwww.facebook.com+%22privacy+policy&num=3


SystemExit: Too many request 429, attempted 5 times, break ...

In [101]:
privacy_policies_url_list[74]

''

In [102]:
# privacy_policies_url_list_0_74 = privacy_policies_url_list
# privacy_policies_url_list_15_ = privacy_policies_url_list

In [107]:
privacy_policies_url_list_0_74

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'https://1minus1.com/legal/privacy',
 '',
 'https://21stwebb.co.uk/privacy/',
 '',
 '',
 '',
 '',
 '',
 'https://www.2open.it/cookie-policy-it.html',
 '',
 '',
 'https://33communication.com/privacy_policy/',
 '',
 '',
 '',
 '',
 '',
 '',
 'https://blog.3ie.fr/privacy-policy/',
 '',
 '',
 '',
 '',
 '',
 'https://www.4fx.co.uk/legals-privacy-policy.html',
 'https://www.4homepages.de/privacy-policy',
 '',
 '',
 'https://5w155.ch/legal/privacy-policy',
 '',
 '',
 '',
 '',
 '',
 'https://8web.gr/en/wordpress-web-design-privacy-policy/',
 '',
 '',
 '',
 '',
 '',
 'https://www.aaltra.eu/privacy-policy',
 '',
 'https://www.aardvark-creative.com/privacy.shtml',
 '',
 '',
 '',
 '',
 '',
 '',
 'https://www.ablsoft.it/privacy-policy/',
 '',
 '',
 '',
 'https://accesslab.gr/en/privacy-policy/',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'https://www.adakom.de/en/privacy-policy.html',
 '',
 'https://adamfard.com/privacy-policy',
 'https://www.adamkey.com/en/pri

In [106]:
len([print(collected_url) for collected_url in privacy_policies_url_list if collected_url is not ""])

  len([print(collected_url) for collected_url in privacy_policies_url_list if collected_url is not ""])


0

#### Scrape!

In [112]:
nlp = spacy.load("en_core_web_md")

In [124]:
def scrape_policies_google(url):
    policies = []
    sentences = []    
    try:
        
        article = Article(url)
#             print(url)
        article.download() #Downloads the link’s HTML content
#             print(url)
        article.parse() #Parse the article
#             print(url)
#                 print(article.title)
        doc = nlp(article.text)
        print("PP language = EN?: " + str(detect(article.text) == 'en'))
        print("PP length > 10 sentences?: " + str(len(list(doc.sents)) > 10))

        if detect(article.text) == 'en' and len(list(doc.sents)) > 10:
            print("Policy meets requirements of language and length ... ")
            sentences = list(doc.sents)
            print("Scraping successful!")

        else:
            print("Scraping not successful")
    except:
            pass
    print()
    return sentences

In [110]:
len([print(collected_url) for collected_url in privacy_policies_url_list_0_74 if collected_url != ""])

https://1minus1.com/legal/privacy
https://21stwebb.co.uk/privacy/
https://www.2open.it/cookie-policy-it.html
https://33communication.com/privacy_policy/
https://blog.3ie.fr/privacy-policy/
https://www.4fx.co.uk/legals-privacy-policy.html
https://www.4homepages.de/privacy-policy
https://5w155.ch/legal/privacy-policy
https://8web.gr/en/wordpress-web-design-privacy-policy/
https://www.aaltra.eu/privacy-policy
https://www.aardvark-creative.com/privacy.shtml
https://www.ablsoft.it/privacy-policy/
https://accesslab.gr/en/privacy-policy/
https://www.adakom.de/en/privacy-policy.html
https://adamfard.com/privacy-policy
https://www.adamkey.com/en/privacy/
https://www.advertage.com/privacy-policy/
https://www.adverteaser.com/privacy.html
https://www.aersyn.com/privacy-policy/
https://www.agifly.be/privacy-policy
https://agileapp.co/privacy-policy/
https://ag-prop.com/page/privacy
https://www.aircury.com/privacy-policy
https://www.ais.pl/en/privacy-policy/


24

In [135]:
pp_list_sentences_0_74 = []
for i, pp_url in enumerate(privacy_policies_url_list_0_74):
    if pp_url == "":
        pp_list_sentences_0_74.append("")
    else:
        pp_list_sentences_0_74.append(scrape_policies_google(pp_url))

PP language = EN?: True
PP length > 10 sentences?: True
Policy meets requirements of language and length ... 
Scraping successful!

PP language = EN?: True
PP length > 10 sentences?: True
Policy meets requirements of language and length ... 
Scraping successful!

PP language = EN?: False
PP length > 10 sentences?: True
Scraping not successful


PP language = EN?: True
PP length > 10 sentences?: True
Policy meets requirements of language and length ... 
Scraping successful!

PP language = EN?: True
PP length > 10 sentences?: True
Policy meets requirements of language and length ... 
Scraping successful!

PP language = EN?: False
PP length > 10 sentences?: True
Scraping not successful

PP language = EN?: True
PP length > 10 sentences?: True
Policy meets requirements of language and length ... 
Scraping successful!

PP language = EN?: True
PP length > 10 sentences?: False
Scraping not successful

PP language = EN?: True
PP length > 10 sentences?: True
Policy meets requirements of language

In [136]:
print(len(privacy_policies_url_list_0_74))

74


In [138]:
print((pp_list_sentences_0_74))

['', '', '', '', '', '', '', '', [Be prepared to be wowed by the best privacy policy you have ever read., It’s so good you may want to copy it., But please don’t do that because it’s not allowed.

, This is our privacy policy., It is fairly simple, because we are not doing anything particularly interesting with our data., It is positively enthralling, as all good privacy policies are.

, You might guess that this Privacy Policy is applicable to this website (Referred to as “This Site”) only and does not cover any other 1minus1 Limited websites, or those of its partners or any other external links from this website., 1minus1, Limited is committed to the privacy of its website visitors and is registered and observes all of the requirements of the Data Protection Act 2018.

, 1minus1, Limited collects non-personal information about your visit to This Site., This information is used for reporting on the demographics of our visitors., This information is available internally only, and we lo

In [120]:
[print(len(pp)) for pp in pp_list_sentences]

27
14
0
0
20
188
0
57
0
70
11
0
16
124
110
72
0
0
40
0
46
0
38
0
0


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

---

### Step 4: classification

#### Create Dataframe

In [140]:
crunch_data_0_74 = crunch_data.iloc[0:74].copy(deep=True)

In [141]:
crunch_data_0_74

Unnamed: 0_level_0,Description,Location,Employees,Type,Website,Rank,Founded Date,Operating Status,Company Type,Contact Email,...,Industry 31,Industry 32,Industry 33,Industry 34,Industry 35,Industry 36,Industry 37,Link 1,Link 2,link_3
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01s-community-company,01S Community company communicates and interac...,"Arezzo, Toscana, Italy",51-100,Private,www.01s.it/,1284758,,Active,For Profit,info@01s.it,...,,,,,,,,https://www.facebook.com/01esse/,https://www.linkedin.com/company/01s-community...,
1000-digital,1000 ° Digital develops innovative web applica...,"Leipzig, Sachsen, Germany",11-50,Private,www.1000grad.de,1480851,2000,Active,For Profit,info@1000grad.de,...,,,,,,,,https://www.facebook.com/1000graddigital,https://www.linkedin.com/company/1000digital/,https://twitter.com/1000digital
100-net,100% Net is a global internet solution for all...,"Pérols, Limousin, France",1-10,Private,www.100pour100net.com//,986874,2003,Active,For Profit,contact@100p100.net,...,,,,,,,,https://www.facebook.com/100pour100Net/,https://www.linkedin.com/company/100-net/,https://www.twitter.com/100pour100net
100starlings,"100 Starlings creates web and mobile apps, the...","London, England, United Kingdom",1-10,Private,www.100starlings.com/,388746,2015,Active,For Profit,info@100starlings.com,...,,,,,,,,https://www.linkedin.com/company/100starlings-...,https://twitter.com/100Starlings?utm_source=hi...,
10geeks-software-engineering,"10Geeks designs, develops, and analyzes tailor...","Fohren, Baden-Wurttemberg, Germany",1-10,Private,www.10geeks.com,651610,"Jan 1, 2012",Active,For Profit,info@10geeks.com,...,,,,,,,,https://www.linkedin.com/company/10geeks,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
acusu,Acusu works across several sectors to thrive i...,"Tamworth, Staffordshire, United Kingdom",1-10,Private,acusu.com,1796802,,Active,For Profit,Sales@Acusu.Com,...,,,,,,,,https://www.facebook.com/AcusuUK,,
adakom,ADAKOM offers an IT solutions for heat exchang...,"Berlin, Berlin, Germany",11-50,Private,www.adakom.de,2015788,1998,Active,For Profit,info@adakom.de,...,,,,,,,,https://www.linkedin.com/company/adakom,,
adamapp,ADAMAPP helps shape your idea into reality wit...,"London, England, United Kingdom",51-100,Private,www.adamapp.com/,678657,"Jan 1, 2011",Active,For Profit,hello@adamapp.co.uk,...,,,,,,,,https://www.facebook.com/adamappltd/,https://twitter.com/AdastraCorp,
adam-fard-ux-studio,Adam Fard UX Studio is a tech company providin...,"Berlin, Berlin, Germany",1-10,Private,adamfard.com,212943,"Jan 1, 2016",Active,For Profit,contact@adamfard.com,...,,,,,,,,https://www.facebook.com/AdamFardStudio/,https://www.linkedin.com/company/adam-fard,


In [127]:
thresholds = [0.014130434782608696, 0.035326086956521736, 0.017934782608695653, 0.03369565217391304, 0.009782608695652175]

In [128]:
def preprocessing(pps):
#     tokenizer = nlp.tokenizer
    # tokenize sentences
    tokenized_sent = [sent.split() for sent in pps]
    
    # remove punctuation
    tokenized_sent = [[re.sub('[,’\'\.!?&“”():*_;"]', '', y) for y in x] for x in tokenized_sent]
    
    # remove words with numbers in them
    tokenized_sent = [[y for y in x if not any(c.isdigit() for c in y)] for x in tokenized_sent]
    
    # remove stopwords   
    tokenized_sent_clean = tokenized_sent
#     tokenized_sent_clean = [[y for y in x if y not in stopwords.words('english')] for x in tokenized_sent]
    
    # from nltk.stem import PorterStemmer
    porter = PorterStemmer()
    tokenized_sent_clean = [[porter.stem(y) for y in x] for x in tokenized_sent_clean]
    
#     lemmatizer = WordNetLemmatizer()
#     tokenized_sent_clean = [[lemmatizer.lemmatize(y) for y in x] for x in tokenized_sent_clean]

    
    detokenized_pps = []
    for i in range(len(tokenized_sent_clean)):
        t = ' '.join(tokenized_sent_clean[i])
        detokenized_pps.append(t) 
    
    return detokenized_pps