# Assessing GDPR-Compliance in Web Applications: A Machine Learning Approach

We will assess the GDPR-compliance of web applications based on their privacy policies. We use a classification model, trained on a corpus of 18,397 natural sentences, to classify the privacy policies on whether five General Data Protection Regulation (GDPR) privacy policy core requirements are communicated in the policy.

__Relevance:__ The GDPR applies to any personal data processing of EU citizens. We aim to assess the state of GDPR-compliance in application software based on their privacy policies.

__Focus:__ web applications; as the web application paradigm is widely used due to the omnipresence of web browsers across PCs and mobile devices. 

__Goal:__ to scrutinize the privacy policies of web applications using ML, to assess whether core privacy policy requirements are communicated.

#### __RQ:__ What is the state of GDPR-compliance disclosure in web applications?

---

### Step 1: collect list of companies active in the Web Apps industry

To do so we utilize the Crunchbase database that allows us to identify companies that engage in web applications, filtered on location (which in our case will be the European Union). We used 

We've imported 2792 companies using the following criteria:
- Industry: Web Apps
- Location: Europe (European Union)

---

In [1]:
import os
from newspaper import Article
from bs4 import BeautifulSoup
from six.moves.urllib.parse import urlparse
import urllib
import sys
import time
import nltk
import pandas as pd
import requests
import spacy
import random
# from googlesearch import search
from langdetect import detect
import re
import pickle
import math

### Step 2: read data

In [2]:
# crunch_data_init = pd.read_excel('data/Advanced Search _ Companies _ Crunchbase.xlsx', index_col=0) 
crunch_data = pd.read_excel('data/companies.xlsx', index_col=0) 

In [3]:
# crunch_data = crunch_data_init.iloc[0:999]

In [4]:
crunch_data

Unnamed: 0_level_0,Description,Location,Employees,Type,Website,Rank,Founded Date,Operating Status,Company Type,Contact Email,...,Industry 31,Industry 32,Industry 33,Industry 34,Industry 35,Industry 36,Industry 37,Link 1,Link 2,link_3
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01s-community-company,01S Community company communicates and interac...,"Arezzo, Toscana, Italy",51-100,Private,www.01s.it/,1284758,,Active,For Profit,info@01s.it,...,,,,,,,,https://www.facebook.com/01esse/,https://www.linkedin.com/company/01s-community...,
1000-digital,1000 ° Digital develops innovative web applica...,"Leipzig, Sachsen, Germany",11-50,Private,www.1000grad.de,1480851,2000,Active,For Profit,info@1000grad.de,...,,,,,,,,https://www.facebook.com/1000graddigital,https://www.linkedin.com/company/1000digital/,https://twitter.com/1000digital
100-net,100% Net is a global internet solution for all...,"Pérols, Limousin, France",1-10,Private,www.100pour100net.com//,986874,2003,Active,For Profit,contact@100p100.net,...,,,,,,,,https://www.facebook.com/100pour100Net/,https://www.linkedin.com/company/100-net/,https://www.twitter.com/100pour100net
100starlings,"100 Starlings creates web and mobile apps, the...","London, England, United Kingdom",1-10,Private,www.100starlings.com/,388746,2015,Active,For Profit,info@100starlings.com,...,,,,,,,,https://www.linkedin.com/company/100starlings-...,https://twitter.com/100Starlings?utm_source=hi...,
10geeks-software-engineering,"10Geeks designs, develops, and analyzes tailor...","Fohren, Baden-Wurttemberg, Germany",1-10,Private,www.10geeks.com,651610,"Jan 1, 2012",Active,For Profit,info@10geeks.com,...,,,,,,,,https://www.linkedin.com/company/10geeks,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zonova,Zonova is an information technology and servic...,"Terrenoire, Rhone-Alpes, France",11-50,Private,zonova.io,853916,2017,Active,For Profit,contact@zonova.io,...,,,,,,,,https://www.facebook.com/ZetaOmegaNOVA/,https://www.linkedin.com/company/zonova/,
zoonect,"Zoonect offers web apps, mobile app, cloud pla...","Pistoia, Toscana, Italy",1-10,Private,www.zoonect.com,1741834,2015,Active,For Profit,office@zoonect.com,...,,,,,,,,https://www.facebook.com/zoonect/,https://www.linkedin.com/company/zoonect/,https://twitter.com/zoonect
zostera,Zostera specializes provides software and data...,"Aarlanderveen, Zuid-Holland, The Netherlands",1-10,Private,zostera.nl,1839410,,Active,For Profit,info@zostera.nl,...,,,,,,,,https://www.facebook.com/zostera.bv,https://www.linkedin.com/company/zostera/,https://twitter.com/zostera
zoznam-mobile,Zoznam Mobile offers services in the implement...,"Bratislava, Bratislava, Slovakia (Slovak Repub...",1-10,Private,zmb.sk/,1671707,"Jan 1, 2002",Active,For Profit,info@zoznammobile.sk,...,,,,,,,,,,


#### Clean websites list

In [5]:
websites_list = crunch_data["Website"].tolist()

In [6]:
websites_list

['www.01s.it/',
 'www.1000grad.de',
 'www.100pour100net.com//',
 'www.100starlings.com/',
 'www.10geeks.com',
 'www.121digitalmedia.eu/',
 '150sec.com/',
 'wollow-soft.com',
 '1minus1.com/',
 'www.1t-s.com',
 '21stwebb.co.uk',
 '23g.io',
 '247wms.com',
 '2advance.ch',
 'www.2develop.nl',
 'www.2m-a2i.fr/',
 'www.2open.it',
 'www.2see.nl',
 'www.2w.de',
 '33communication.com',
 'www.360d.be',
 'www.360telemetry.com/',
 'www.3asyr.com/',
 'www.3d3.nl/',
 'www.3d-dental.dk',
 '3dit.de; https//govie.de',
 '3ie.fr',
 'www.3m5.de',
 'www.3po.nl/',
 'www.3tiersystems.com',
 'www.3xw.ch',
 'www.40bis.nl',
 'www.4fx.co.uk/',
 'www.4homepages.de',
 '4kstudio.at',
 '4tpm.fr',
 '5w155.ch',
 '69pixl.com/',
 'www.7interactive.cz',
 'www.80si.com',
 'www.8balls.nl/',
 'www.8trust.com/',
 '8web.gr/',
 'www.960labs.com/',
 'www.999web.de',
 'www.99codelines.com',
 'a10sistemas.es',
 'a2colores.es',
 'www.aaltra.eu/',
 'aardenexperts.com/',
 'aardvark-creative.com',
 'www.aardvark.gr',
 'www.ab4d.com/',

In [7]:
# remove / from the end of the string that contains the website
# websites_list = [website.rstrip(website[-1]) if (website[-1] == "/") else website for website in websites_list]
websites_list = [website.rstrip(website[-1]) if (isinstance(website, str) and website[-1] == "/") else website for website in websites_list]
# een keer extra voor het geval er een url was met // op het eind
websites_list = [website.rstrip(website[-1]) if (isinstance(website, str) and website[-1] == "/") else website for website in websites_list]

In [8]:
(websites_list)

['www.01s.it',
 'www.1000grad.de',
 'www.100pour100net.com',
 'www.100starlings.com',
 'www.10geeks.com',
 'www.121digitalmedia.eu',
 '150sec.com',
 'wollow-soft.com',
 '1minus1.com',
 'www.1t-s.com',
 '21stwebb.co.uk',
 '23g.io',
 '247wms.com',
 '2advance.ch',
 'www.2develop.nl',
 'www.2m-a2i.fr',
 'www.2open.it',
 'www.2see.nl',
 'www.2w.de',
 '33communication.com',
 'www.360d.be',
 'www.360telemetry.com',
 'www.3asyr.com',
 'www.3d3.nl',
 'www.3d-dental.dk',
 '3dit.de; https//govie.de',
 '3ie.fr',
 'www.3m5.de',
 'www.3po.nl',
 'www.3tiersystems.com',
 'www.3xw.ch',
 'www.40bis.nl',
 'www.4fx.co.uk',
 'www.4homepages.de',
 '4kstudio.at',
 '4tpm.fr',
 '5w155.ch',
 '69pixl.com',
 'www.7interactive.cz',
 'www.80si.com',
 'www.8balls.nl',
 'www.8trust.com',
 '8web.gr',
 'www.960labs.com',
 'www.999web.de',
 'www.99codelines.com',
 'a10sistemas.es',
 'a2colores.es',
 'www.aaltra.eu',
 'aardenexperts.com',
 'aardvark-creative.com',
 'www.aardvark.gr',
 'www.ab4d.com',
 'www.ab-data.de',
 

---

### Step 3: scrape privacy policies

In [9]:
def get_privacy_policy_url(query):
    keyword_in_title = 0
    attempts = 0
    url = ""
    print("Query: " + query)
    
    try:
        query_results_list = return_google_results(query, 3, 5)
        print("Considering " + str(len(query_results_list)) + " URL(s) ...")
        for i, url in enumerate(query_results_list):
            term_in_url = 0
            attempts = attempts + 1
            print("Assessing privacy policy URL: " + url)
            
            if (re.findall('privacy', url) or re.findall('policy', url)): 
                print("Found relevant terms in URL! Succesful break!")
                break

#                     pass
            if keyword_in_title == 1 or attempts == 3 or i==(len(query_results_list)-1): 
                keyword_in_title = 0
                attempts = 0
                print("No results. Breaking ..")
                url = ""
#                 print(sentences)
                break   
    except Exception as e:
            print(str(e))
            pass
    return url

In [10]:
def return_google_results(keywords, num_results, attempts):
    user_agent_list = [
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    ]

    html_keywords = urllib.parse.quote_plus(keywords)
    sleep_init = 10
    
    url = "https://www.google.com/search?q=" + html_keywords + "&num=" + str(num_results)
    print("** Search query in URL: " + url)

    headers = {'User-Agent': random.choice(user_agent_list)}
    
    html = requests.get(url, headers=headers)

    if html.status_code == 429:
        if(attempts == 0):
            sys.exit("Too many request 429, attempted "+ str(5)+ " times, break ...")
        else:
            if 'Retry_After' in html.headers:
                print("Helaas, geen retry-after info")
            else:
                time.sleep(sleep_init)
                print("Too many requests (attempt "+ str(5 - attempts)+ "), we will attempt again in " + str(sleep_init) + " seconds")
                return_google_results(keywords, num_results, (attempts - 1))
    else: 
        pass
        
    soup = BeautifulSoup(html.text, 'html.parser')

    allData = soup.find_all("div",{"class":"g"})

    link_list = []
    print("len alldata: " + str(len(allData)))
    
    for i in range(0,len(allData)):
        link = allData[i].find('a').get('href')
        
        if(link is not None):
            if(link.find('https') != -1 and link.find('http') == 0 and link.find('aclk') == -1):
                print(link)
                link_list.append(link)
    print(link_list)
    return link_list

#### Collect privacy policy URLs

In [11]:
privacy_policies_url_list = []

In [12]:
# loop through each company URL and attempt to find the URL of the privacy policy
for i, url_company in enumerate(websites_list[:200]):
    print(i)
#     print(len(privacy_policies_url_list))
    if(isinstance("url_company", str) is False):
        privacy_policies_url_list.append("")
    else:
        query = "site:\"" + url_company + " \"privacy policy"
        privacy_policies_url_list.append(get_privacy_policy_url(query))
    print()
    time.sleep(10)
    if (i == 500):
        break

0
Query: site:"www.01s.it "privacy policy
** Search query in URL: https://www.google.com/search?q=site%3A%22www.01s.it+%22privacy+policy&num=3
Too many requests (attempt 0), we will attempt again in 10 seconds
** Search query in URL: https://www.google.com/search?q=site%3A%22www.01s.it+%22privacy+policy&num=3
Too many requests (attempt 1), we will attempt again in 10 seconds
** Search query in URL: https://www.google.com/search?q=site%3A%22www.01s.it+%22privacy+policy&num=3
Too many requests (attempt 2), we will attempt again in 10 seconds
** Search query in URL: https://www.google.com/search?q=site%3A%22www.01s.it+%22privacy+policy&num=3
Too many requests (attempt 3), we will attempt again in 10 seconds
** Search query in URL: https://www.google.com/search?q=site%3A%22www.01s.it+%22privacy+policy&num=3
Too many requests (attempt 4), we will attempt again in 10 seconds
** Search query in URL: https://www.google.com/search?q=site%3A%22www.01s.it+%22privacy+policy&num=3


SystemExit: Too many request 429, attempted 5 times, break ...

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
