# Privacy Policy Scraper
Python script to scrape privacy policies from Google search results.
The script makes use of:
 - googlesearch library (https://python-googlesearch.readthedocs.io/en/latest/genindex.html)
 - newspaper library (https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#performing-nlp-on-an-article)
 - nltk (http://www.nltk.org/api/nltk.tokenize.html?highlight=punkt)
 
 Inspired by:
 https://pythondata.com/quick-tip-consuming-google-search-results-to-use-for-web-scraping/

In [1]:
# Import libs
import os
from googlesearch import search
from newspaper import Article
from six.moves.urllib.parse import urlparse
import nltk
import pandas as pd
import spacy
from googlesearch import search
from langdetect import detect
import re

## Import SMEs and LE

The following datasets are collected through Orbis Europe:
- dataset 1 (SMEs): SME_50000_Export 17_05_2021_12_54.xlsx
- dataset 1 (LEs): LE_50000_Export 17_05_2021_12_54.xlsx

In [2]:
# load companies' data
data_sme = pd.read_excel('data/SME_50000_Export 17_05_2021_12_54.xlsx', index_col=0, header=0, sheet_name='Results')
data_le = pd.read_excel('data/LE_50000_Export 17_05_2021_12_54.xlsx', index_col=0, header=0, sheet_name='Results')

In [3]:
data_sme.head()

Unnamed: 0,Company name Latin alphabet,Inactive,Quoted,Branch,OwnData,Woco,Country ISO code,"NACE Rev. 2, core code (4 digits)",Consolidation code,Last avail. year,...,Number of employees\nLast avail. yr,Additional address(es) - Country,Additional address(es) - Country ISO code,Additional address(es) - Standardized city,"NAICS 2017, core code (4 digits)","NAICS 2017, core code - description","NAICS 2017, primary code(s)","NAICS 2017, primary code(s) - description","NAICS 2017, secondary code(s) - description","NAICS 2017, secondary code(s) - description.1"
1.0,VIVACE LOGISTICA SA,No,No,No,No,No,ES,5210.0,U1,2018.0,...,84,,,,4931.0,Warehousing and Storage,493110.0,General Warehousing and Storage,All Other Support Activities for Transportation,All Other Support Activities for Transportation
,,,,,,,,,,,...,,,,,,,493190.0,Other Warehousing and Storage,,
2.0,MENSA,No,No,No,No,No,FR,4711.0,U1,2019.0,...,n.a.,,,,4451.0,Grocery Stores,445110.0,Supermarkets and Other Grocery (except Conveni...,,
,,,,,,,,,,,...,,,,,,,445120.0,Convenience Stores,,
3.0,MINITEC ESPANA SL,No,No,No,No,No,ES,2849.0,U1,2019.0,...,39,,,,3335.0,Metalworking Machinery Manufacturing,333517.0,Machine Tool Manufacturing,Other Commercial and Service Industry Machiner...,Other Commercial and Service Industry Machiner...


## Extract list of company names

In [4]:
# Get list of company names
# SME_list = data_sme['Company name Latin alphabet'].to_list()
SME_list = [name for name in data_sme['Company name Latin alphabet'].to_list() if str(name) != 'nan']
LE_list = [name for name in data_le['Company name Latin alphabet'].to_list() if str(name) != 'nan']

In [5]:
SME_list[:5]

['VIVACE LOGISTICA SA',
 'MENSA',
 'MINITEC ESPANA SL',
 'SNEXI',
 'PHARMACIE DE LA GRANGE LUX FORTINA']

## Select random policies for validation of the classification model

In [6]:
import random
i=0
idx_list = []
for i in range(50):
    idx_list.append((random.randint(0, len(SME_list)-1)))
print(idx_list)

[10926, 47193, 43768, 22477, 25883, 8274, 20550, 32552, 43751, 32033, 46787, 4063, 12873, 94, 1071, 13104, 36085, 11271, 40874, 26559, 47419, 590, 1325, 43843, 36509, 32343, 537, 31989, 45706, 31615, 9482, 13785, 48261, 29938, 43293, 892, 48970, 16905, 46133, 11607, 625, 45932, 2780, 2836, 26139, 3564, 21041, 36451, 15752, 5668]


#### Randomly selected privacy policies (we will use the first 10) that will be manually labeled to validate the classification model for both SME and LE
[15991, 16719, 34111, 5317, 27629, 47251, 47641, 27152, 17003, 15962, 43942, 4178, 15640, 32350, 1895, 15762, 32443, 3645, 2255, 2876, 47697, 11464, 12510, 21791, 14217, 253, 42911, 22824, 12261, 45549]

## SMEs and LEs: scrape policies for the test set

In [8]:
nlp = spacy.load("en_core_web_md")

Scrape policies from google. Search for the words "privacy" or "policy" or "notice" in the url of the potential policy.
We only pick the 1st result, unless the first result doesn't have the key terms in the URL, or if it is in a different language
than English. We make 3 attempts before we go on to the next company in the list of SMEs or LEs. In short, we use the following reqs:
- English language;
- longer than 19 sentences (Avg SME: 101, Avg LE: 70);
- 3 attempts to find a privacy policy based on company name.

In [13]:
def scrape_policies_google(query):
    in_title = 0
    attempts = 0
    policies = []
    sentences = []
    for url in search(query, lang='en'):
        attempts = attempts + 1
        
        if re.findall('privacy', url) or re.findall('policy', url) or re.findall('notice', url): 
            try:
                article = Article(url)
                article.download() #Downloads the link’s HTML content
                article.parse() #Parse the article
#                 print(article.title)
                doc = nlp(article.text)
                if detect(article.text) == 'en' and len(list(doc.sents)) > 19:
                    print(url)
                    sentences = list(doc.sents)
                    in_title = 1
#                     counter = counter + 1 
#                     print("Number of scraped policies: ", counter)
            except:
                pass
        if in_title == 1 or attempts == 3: 
            in_title = 0
            attempts = 0
            break        
    return sentences

Create list comprising of the randomly selected companies

In [10]:
selected_sme = [SME_list[i] for i in idx_list]
selected_le = [LE_list[i] for i in idx_list]
selected_sme

['G2L NIORT',
 'TEATE ECOLOGIA - S.P.A.',
 'POLSOFT ENGINEERING SP. Z O.O.',
 'THRACE ANGUS FARM EOOD',
 'ALFAFIBER SP. Z O.O.',
 'CASARRUBIO IMPORT EXPORT S.L.',
 "SPORTPIU' SALUTE E BENESSERE SPORTIVA DILETTANTISTICA S.R.L.",
 'SCHWUNG HOME SP. Z O.O.',
 'D B PROGETTI S.R.L.',
 'HANSPETER BIEHLER GMBH',
 'ASCO ASSISTANCE & CONSEIL',
 'CASINO MANHATTAN SRL',
 'FRERES EN BIENS',
 'MOON C',
 'MILANO FOOD SERVICE S.R.L.',
 'MAFRECAL, LDA',
 'OEC OPTO DEVICES GMBH',
 "COMPAGNIA RICICLAGGIO INERTI SOCIETA' A RESPONSABILITA' LIMITATA (ENUNCIABILE ANCHE COME CO.R.I. S.R.L. )",
 'SMART INDUSTRIES LESNA INDUSTRIJA D.O.O.',
 'AXPE CONSULTING SL',
 'PERLAPELLE S.R.L.',
 'SATKO OOD',
 'ARMANDO GOMES LINDO & FILHOS, LDA',
 'ADVITIS',
 'I.DE.I CONCEPTS AB',
 'KARO 06 EOOD',
 'EXICOM, S.R.O.',
 'ELFA SRL',
 'ELEKTROTECHNIK BUERGLER & MOOSLECHNER GMBH',
 'SCANTEST SIA',
 'GLORIA CERVECEROS S.L.',
 "FRIENDZ SOCIETA' A RESPONSABILITA' LIMITATA",
 'SPOLDZIELNIA PRACY SPECJALISTOW RENTGENOLOGOW IM. PROF.

#### SMEs

In [11]:
pps_sme = []
counter = []

for company in selected_sme:
    query = company + " privacy policy"
    print(" - - - " + company + " - - - ")
    out = scrape_policies_google(query)
    
    if(len(out) != 0):
        pps_sme.extend(out)
        counter.append(out)
        print("Number of scraped policies: ", len(counter))
        print("")
    if(len(counter) == 20):
        break

 - - - G2L NIORT - - - 
181
https://go-2-learn.com/g2l-privacy-policy/
Number of scraped policies:  1

 - - - TEATE ECOLOGIA - S.P.A. - - - 
 - - - POLSOFT ENGINEERING SP. Z O.O. - - - 
 - - - THRACE ANGUS FARM EOOD - - - 
 - - - ALFAFIBER SP. Z O.O. - - - 
61
https://user.com/en/privacy-policy/
Number of scraped policies:  2

 - - - CASARRUBIO IMPORT EXPORT S.L. - - - 
 - - - SPORTPIU' SALUTE E BENESSERE SPORTIVA DILETTANTISTICA S.R.L. - - - 
100
 - - - SCHWUNG HOME SP. Z O.O. - - - 
3
 - - - D B PROGETTI S.R.L. - - - 
 - - - HANSPETER BIEHLER GMBH - - - 
9
 - - - ASCO ASSISTANCE & CONSEIL - - - 
371
https://www.asco.org/about-asco/legal/privacy-policy
Number of scraped policies:  3

 - - - CASINO MANHATTAN SRL - - - 
156
https://www.cashmio.com/privacy-policy
Number of scraped policies:  4

 - - - FRERES EN BIENS - - - 
129
https://www.cnp.be/privacy-policy/
Number of scraped policies:  5

 - - - MOON C - - - 
467
https://mooncamp.com/privacy-policy/
Number of scraped policies:  6

 

#### LEs

In [12]:
pps_le = []
counter = []

for company in selected_le:
    query = company + " privacy policy"
    print(" - - - " + company + " - - - ")
    out = scrape_policies_google(query)
    
    if(len(out) != 0):
        pps_le.extend(out)
        counter.append(out)
        print("Number of scraped policies: ", len(counter))
        print("")
    if(len(counter) == 20):
        break

 - - - NA FASTIGHETER AB - - - 
 - - - SCAUTO - - - 
7
38
https://www.scautorepair.com/privacy-policy.html
Number of scraped policies:  1

 - - - ESSOR AGRO - - - 
 - - - HERMANOS GELIDA SA - - - 
 - - - AMASTEN NYNASHAMN AB - - - 
 - - - PROMOCIONES ORCERA SL - - - 
131
https://www.helixperiences.com/en/privacy-policy
Number of scraped policies:  2

 - - - V FRANCE - - - 
91
https://www.politico.eu/article/europe-privacy-rules-survived-years-of-negotiations-lobbying/
Number of scraped policies:  3

 - - - CORPUS SIREO PROJEKTENTWICKLUNG WOHNEN GMBH - - - 
 - - - BILKOMPANIET GAVLEBORG AB - - - 
 - - - THELLIER CAMPING CAR - - - 
 - - - UNIVERSITAETSKLINIKUM SCHLESWIG-HOLSTEIN CAMPUS KIEL - - - 
 - - - TELFORT ZAKELIJK B.V. - - - 
48
 - - - IMECO SA - - - 
 - - - TRMC - - - 
52
https://www.trmchealth.org/privacy-policy/
Number of scraped policies:  4

 - - - SAGEMCOM HOLDING - - - 
69
https://www.sagemcom.com/legal-notice/
Number of scraped policies:  5

 - - - DELFINO - - - 
65
http:/

### Clean policies
Clean the privacy policy texts:
- lowercase
- remove newlines

In [19]:
pps_sme = [(str(sen).lower()).replace(";","") for sen in pps_sme if len(sen) > 3]
pps_le = [(str(sen).lower()).replace(";","") for sen in pps_le if len(sen) > 3]

In [20]:
print(len(pps_sme))
print(len(pps_sme)/20)
print(len(pps_le))
print(len(pps_le)/20)

2009
100.45
1397
69.85


Test set statistics:
- Avg number of sentences in policy SMEs: 101
- Avg number of sentences in policy SMEs: 70

### Save policies
Save the privacy policy texts as csv

In [21]:
# pps_sme = [(str(sen).lower()) for sen in pps_sme if len(sen) > 3]

zer_list = [0] * len(pps_sme)

print(len(zer_list))
# d = {'Text': pps, 'DPO': zer_list, 'Purpose': zer_list, 'Acquired data': zer_list, 'Data sharing': zer_list, 'Rights': zer_list}

df_sme = pd.DataFrame(list(zip(pps_sme, zer_list,zer_list,zer_list,zer_list,zer_list)),
                                 columns =['Text', 'DPO', 'Purpose', 'Acquired data', 'Data sharing', 'Rights'])

# extra cleaning
df_sme = df_sme.replace('\n',' ', regex=True)

2009


In [22]:
zer_list = [0] * len(pps_le)
# d = {'Text': pps, 'DPO': zer_list, 'Purpose': zer_list, 'Acquired data': zer_list, 'Data sharing': zer_list, 'Rights': zer_list}

df_le = pd.DataFrame(list(zip(pps_le, zer_list,zer_list,zer_list,zer_list,zer_list)),
                                 columns =['Text', 'DPO', 'Purpose', 'Acquired data', 'Data sharing', 'Rights'])

# extra cleaning
df_le = df_le.replace('\n',' ', regex=True)

In [23]:
df_le

Unnamed: 0,Text,DPO,Purpose,Acquired data,Data sharing,Rights
0,this privacy policy sets out how s.c. auto rep...,0,0,0,0,0
1,s.c. auto repairs is committed to ensuring tha...,0,0,0,0,0
2,should we ask you to provide certain informati...,0,0,0,0,0
3,s.c. auto repairs may change this policy from ...,0,0,0,0,0
4,you should check this page from time to time t...,0,0,0,0,0
...,...,...,...,...,...,...
1392,a cookie is a small file that is downloaded an...,0,0,0,0,0
1393,"cookies allow the website, amongst other thing...",0,0,0,0,0
1394,the user has the option to prevent the generat...,0,0,0,0,0
1395,you can obtain more information by reading our...,0,0,0,0,0


In [24]:
df_sme.head(50)

Unnamed: 0,Text,DPO,Purpose,Acquired data,Data sharing,Rights
0,privacy policy for go2learn,0,0,0,0,0
1,", llc updated: february 19, 2021",0,0,0,0,0
2,", llc 4198",0,0,0,0,0
3,meadow hill road,0,0,0,0,0
4,"cazenovia, ny 13035",0,0,0,0,0
5,"go2learn, llc and our subsidiaries and affilia...",0,0,0,0,0
6,your privacy is very important to us.,0,0,0,0,0
7,our goal is to deliver content targeted to the...,0,0,0,0,0
8,in order to provide you with relevant informat...,0,0,0,0,0
9,this privacy policy applies to all products an...,0,0,0,0,0


In [25]:
df_sme.to_csv(r'data/SME-GDPR-test.csv', sep='\t', encoding='utf-8', index=False)
df_le.to_csv(r'data/LE-GDPR-test.csv', sep='\t', encoding='utf-8', index=False)

#### Selected inverse indices from the list of indices that was selected to create the test set
- this is the final set that will be used for comparing the GDPR-compliance or GDPR-awareness between SMEs and LEs

In [26]:
final_sme = [sme for ind, sme in enumerate(SME_list) if ind not in idx_list]
final_le = [le for ind, le in enumerate(LE_list) if ind not in idx_list]
#2177

In [41]:
len(f_pps_sme_final)
# final_sme.index('JAMONES EL CHARRO SA')

277

## SMEs: scrape 1000 policies

In [40]:
# pps_sme_final = []
# f_pps_sme_final = []

for company in final_sme:
    query = company + " privacy policy"
    print(" - - - " + company + " - - - ")
    out = scrape_policies_google(query)
    
    if(len(out) != 0):
        pps_sme_final.extend(out)
        f_pps_sme_final.append(out)
        print("Number of scraped policies: ", len(f_pps_sme_final))
        print("")
    if(len(f_pps_sme_final) == 1000):
        break

 - - - J - - - 
https://www.junaidjamshed.com/privacy-policy-j
Number of scraped policies:  258

 - - - A - - - 
https://en.wikipedia.org/wiki/Privacy_policy
Number of scraped policies:  259

 - - - M - - - 
https://www.groupm.com/mplatform-privacy-notice/
Number of scraped policies:  260

 - - - O - - - 
https://www.privacypolicies.com/blog/privacy-policy-template/?sa=X&ved=2ahUKEwiPi6jXuP3wAhXCO-wKHTAjDKEQ9QF6BAgGEAI
Number of scraped policies:  261

 - - - N - - - 
https://www.modeln.com/privacy-policy/
Number of scraped policies:  262

 - - - E - - - 
https://ec.europa.eu/info/privacy-policy_en
Number of scraped policies:  263

 - - - S - - - 
https://www.privacypolicies.com/blog/privacy-policy-template/
Number of scraped policies:  264

 - - -   - - - 
https://www.privacypolicies.com/blog/privacy-policy-template/?sa=X&ved=2ahUKEwif_9DfuP3wAhUpgP0HHS3TDZEQ9QF6BAgKEAI
Number of scraped policies:  265

 - - - E - - - 
https://ec.europa.eu/info/privacy-policy_en
Number of scraped poli

In [25]:
#clean list of sentences
pps_sme_final = [(str(sen).lower()) for sen in pps_sme_final if len(sen) > 3]

#clean list of list of sentences - this will be the output file
f_pps_sme_final = [[(str(sen).lower()).replace(";","") for sen in pol if len(sen) > 3] for pol in f_pps_sme_final]

#### Write policies to separate files

In [26]:
calc_len = 0
for i, policy in enumerate(f_pps_sme_final):
    zer_list = [0] * len(policy)
    
    calc_len = calc_len + len(zer_list)
    # d = {'Text': pps, 'DPO': zer_list, 'Purpose': zer_list, 'Acquired data': zer_list, 'Data sharing': zer_list, 'Rights': zer_list}

    df_sme_f = pd.DataFrame(list(zip(policy, zer_list,zer_list,zer_list,zer_list,zer_list)),
                                     columns =['Text', 'DPO', 'Purpose', 'Acquired data', 'Data sharing', 'Rights'])

    # extra cleaning
    df_sme_f = df_sme_f.replace('\n',' ', regex=True)
    df_sme_f.to_csv(r'data/SME/' + str(i) + '.csv', sep='\t', encoding='utf-8', index=False)

#### Average policy length of the policies (SME): 92 sentences

In [60]:
print(calc_len/len(f_pps_sme_final))

91.995


## LEs: scrape 1000 policies

In [19]:
# pps_le_final = []
# f_pps_le_final = []

for company in final_le_ctd:
    query = company + " privacy policy"
    print(" - - - " + company + " - - - ")
    out = scrape_policies_google(query)
    
    if(len(out) != 0):
        pps_le_final.extend(out)
        f_pps_le_final.append(out)
        print("Number of scraped policies: ", len(f_pps_le_final))
        print("")
    if(len(f_pps_le_final) == 1000):
        break

 - - - ARCE CERAMICAS SL - - - 
 - - - CINTRA - URBANIZACOES, TURISMO E CONSTRUCOES, S.A. - - - 
 - - - VTA AUSTRIA GMBH - - - 
https://vta.cc/en/privacy-policy
Number of scraped policies:  981

 - - - POWSZECHNA SPOLDZIELNIA MIESZKANIOWA RENAWA - - - 
 - - - "HENCKE SYSTEMBERATUNG" GMBH - - - 
 - - - CHASTAGNER LOCATION - - - 
 - - - BREDA TECNOLOGIE COMMERCIALI S.R.L. - - - 
http://www.breda.tech/en/cookie-policy/
Number of scraped policies:  982

 - - - TAZAKI FOODS LIMITED - - - 
 - - - DUVAL ELECTRICITE - - - 
https://www.tranzcom.com/en/privacy-policy/
Number of scraped policies:  983

 - - - CASTILLO DE SAN LUIS SL - - - 
 - - - TAYS SYDANKESKUS OY - - - 
 - - - RW EINDHOVEN BLOK 59 B.V. - - - 
 - - - GEDO GRUNDSTUECKSENTWICKLUNGS UND  VERWALTUNGSGESELLSCHAFT MBH & CO. KG - - - 
 - - - DEMAG ERGOTECH GMBH - - - 
https://www.sumitomo-shi-demag.eu/privacy-policy
Number of scraped policies:  984

 - - - PERRIN ET FILS - - - 
 - - - THERMO ELECTRON SWEDEN AB - - - 
 - - - BAVARIAN N

In [18]:
len(final_le)
final_le[2819]
final_le_ctd = final_le[2819:]
print(len(final_le))
print(len(final_le_ctd))

49950
47131


In [20]:
#clean list of sentences
pps_le_final = [(str(sen).lower()) for sen in pps_le_final if len(sen) > 3]

#clean list of list of sentences - this will be the output file
f_pps_le_final = [[(str(sen).lower()).replace(";","") for sen in pol if len(sen) > 3] for pol in f_pps_le_final]

#### Write policies to separate files

In [21]:
calc_len = 0
for i, policy in enumerate(f_pps_le_final):
    zer_list = [0] * len(policy)
    
    calc_len = calc_len + len(zer_list)
    # d = {'Text': pps, 'DPO': zer_list, 'Purpose': zer_list, 'Acquired data': zer_list, 'Data sharing': zer_list, 'Rights': zer_list}

    df_le_f = pd.DataFrame(list(zip(policy, zer_list,zer_list,zer_list,zer_list,zer_list)),
                                     columns =['Text', 'DPO', 'Purpose', 'Acquired data', 'Data sharing', 'Rights'])

    # extra cleaning
    df_le_f = df_le_f.replace('\n',' ', regex=True)
    df_le_f.to_csv(r'data/LE/' + str(i) + '.csv', sep='\t', encoding='utf-8', index=False)

#### Average length of the policies (LEs): 2.74 sentences

In [27]:
print(calc_len/len(f_pps_sme_final))

90.25
