# Assessing GDPR-Compliance in Web Applications: A Machine Learning Approach

We will assess the GDPR-compliance of web applications based on their privacy policies. We use a classification model, trained on a corpus of 18,397 natural sentences, to classify the privacy policies on whether five General Data Protection Regulation (GDPR) privacy policy core requirements are communicated in the policy.

__Relevance:__ The GDPR applies to any personal data processing of EU citizens. We aim to assess the state of GDPR-compliance in application software based on their privacy policies.

__Focus:__ web applications; as the web application paradigm is widely used due to the omnipresence of web browsers across PCs and mobile devices. In particular, we focus on organisations that provide cloud-based solutions: Cloud Computing, Cloud Data Services, Cloud Infrastructure, Cloud, Management, and Cloud Storage.


__Goal:__ to scrutinize the privacy policies of web applications using ML, to assess whether core privacy policy requirements are communicated.

#### __RQ:__ What is the state of GDPR-compliance disclosure in web applications?

---

### Step 1: collect list of companies active in the Web Apps industry

To do so we utilize the Crunchbase database that allows us to identify companies that provide webbased services, filtered on location (which in our case will be the European Union). We used 

We've imported 2792 companies using the following criteria:
- Industry: Web Services -> Cloud Computing, Cloud Data Services, Cloud Infrastructure, Cloud, Management, and Cloud Storage
- Location: USA, India, EU

---

In [1]:
import os
from newspaper import Article
from bs4 import BeautifulSoup
from six.moves.urllib.parse import urlparse
import urllib
import sys
import time
import nltk
import glob
import pandas as pd
import requests
import spacy
import random
# from googlesearch import search
from langdetect import detect
import re
import pickle
import math
import numpy as np
import collections
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from tabulate import tabulate
from IPython.display import display, HTML

### Step 2: read data

In [2]:
path = r'C:\Users\aaberkan\OneDrive - UGent\Scripts\GDPR-Compliance in Web Applications\data\Crunchbase\Cloud'
filenames = glob.glob(path + "/*.csv")

In [3]:
# len = 30
len(filenames)

30

In [4]:
dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

In [5]:
crunch_data = pd.concat(dfs, ignore_index=True)

In [6]:
crunch_data

Unnamed: 0,Organization Name,Organization Name URL,Full Description,Industries,Website,Headquarters Location,Description,CB Rank (Company),SEMrush - Monthly Visits,SEMrush - Average Visits (6 months),...,Founded Date,Founded Date Precision,Number of Founders,Number of Employees,Founders,Apptopia - Number of Apps,Apptopia - Downloads Last 30 Days,Aberdeen - IT Spend,Aberdeen - IT Spend Currency,Aberdeen - IT Spend Currency (in USD)
0,Ex Libris,https://www.crunchbase.com/organization/ex-libris,Ex Libris Group is a leading provider of libra...,"Apps, Cloud Computing, Enterprise Software, In...",http://www.exlibrisgroup.com,"Ballerup, Hovedstaden, Denmark",Ex Libris Group is a leading provider of libra...,165720,13961244,14073988.5,...,1986-01-01,year,2.0,501-1000,"Azriel Morag, Evgeniy Larionov",19.0,3443,,,
1,Exact,https://www.crunchbase.com/organization/exact,Exact is a global supplier of cloud business s...,"Accounting, Cloud Computing, CRM, Enterprise R...",http://www.exact.com,"Delft, Zuid-Holland, The Netherlands",Exact is a company that provides cloud based b...,229780,622387,672910.5,...,1984-07-23,day,3.0,1001-5000,"Arco Van Nieuwland, Eduard Hagens, Irfan Verdia",32.0,11294,,,
2,Xperience,https://www.crunchbase.com/organization/xperie...,Xperience provides software solutions within E...,"Cloud Computing, Consulting, CRM, Enterprise R...",https://www.xperience-group.com/,"Antrim, Antrim, United Kingdom",Xperience provides software solutions within E...,187002,402949,198967.83,...,1969-01-01,year,,101-250,,,,6256711.0,USD,6256711.0
3,Sage,https://www.crunchbase.com/organization/sage-c1e4,Sage DPW-Software is now implementing over 100...,"Cloud Computing, Computer, Private Cloud, Secu...",https://www.sagedpw.at/,"Vienna, Wien, Austria",Sage provides software for HR services.,362734,319471,219712.33,...,1972-01-01,year,,101-250,,,,,,
4,Novahé,https://www.crunchbase.com/organization/novahé,,"Cloud Computing, Information Technology, IT In...",https://www.novahe.fr/,"Orléans, Centre, France",Novahé is an information technology company sp...,930723,288214,,...,1986-06-30,day,,51-100,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21235,BackupAddict,https://www.crunchbase.com/organization/backup...,,"Cloud Storage, Information Technology, Profess...",https://www.backupaddict.com/,"Jersey Shore, Pennsylvania, United States",BackupAddict renders online backup plans to pr...,2145360,,,...,,,,11-50,,,,,,
21236,"NetConvergence, Inc.",https://www.crunchbase.com/organization/netcon...,,"Cloud Storage, Enterprise Software, Informatio...",http://www.netconvergence.com,"Santa Clara, California, United States","NetConvergence, Inc. is an intellectual proper...",2215033,,,...,,,,,,,,,,
21237,MyLabBook.com,https://www.crunchbase.com/organization/mylabb...,,"Cloud Data Services, Cloud Storage, Software",https://www.mylabbook.com,"Texas, South Carolina, United States",MyLabBook.com is a company that facilitates op...,2222155,,,...,1994-01-01,year,,1-10,,,,,,
21238,Radmedix,https://www.crunchbase.com/organization/radmedix,,"Cloud Storage, Manufacturing, Medical Device, ...",https://radmedix.com,"Dayton, Ohio, United States",Radmedix is an x-ray manufacturer that provide...,,,,...,,,,11-50,,,,,,


In [7]:
# remove duplicates
crunch_data.drop_duplicates(inplace=True)

In [8]:
crunch_data = crunch_data[((10267 + 1) + 4244):].copy(deep=True)

In [9]:
crunch_data

Unnamed: 0,Organization Name,Organization Name URL,Full Description,Industries,Website,Headquarters Location,Description,CB Rank (Company),SEMrush - Monthly Visits,SEMrush - Average Visits (6 months),...,Founded Date,Founded Date Precision,Number of Founders,Number of Employees,Founders,Apptopia - Number of Apps,Apptopia - Downloads Last 30 Days,Aberdeen - IT Spend,Aberdeen - IT Spend Currency,Aberdeen - IT Spend Currency (in USD)
16663,Polyaxon,https://www.crunchbase.com/organization/polyaxon,,"Artificial Intelligence, Cloud Infrastructure,...",https://polyaxon.com,"Berlin, Berlin, Germany",An open source platform for reproducible machi...,208230,2642,,...,2018-10-01,month,,,,,,,,
16664,Mainstream,https://www.crunchbase.com/organization/mainst...,,"Cloud Infrastructure, Information Technology, ...",https://www.mainstream.rs,"Belgrade, Vojvodina, Serbia","Mainstream provides advanced cloud, managed ho...",439301,2639,1560.17,...,2005-01-01,year,,51-100,,,,,,
16666,Replex,https://www.crunchbase.com/organization/replex...,Replex is the first governance and cost manage...,"Cloud Infrastructure, Cloud Management, Inform...",https://www.replex.io,"Leipzig, Sachsen, Germany",Replex provides software solutions.,26882,2439,,...,2016-02-01,month,5.0,11-50,"Christian Falk, Costantino Lattarulo, Dennis J...",5.0,,,,
16667,Power DCloud,https://www.crunchbase.com/organization/power-...,"Power Ecosystem is a web3 company, developer o...","Blockchain, Cloud Infrastructure, Cloud Storag...",https://thepower.io,"Tallinn, Harjumaa, Estonia",DCloud is a nextgen web3 all-in-one decentrali...,142924,2227,,...,2017-06-01,month,3.0,11-50,"Dmitry Burov, Igor Belousov, Max Mikhailenko",,,,,
16669,Red Kubes,https://www.crunchbase.com/organization/red-kubes,"For all that Kubernetes can do, it still requi...","Cloud Infrastructure, Software",https://redkubes.com/,"Utrecht, Utrecht, The Netherlands",Red Kubes is a Dutch start-up founded in 2019 ...,74886,2174,1725.67,...,2019-07-18,day,2.0,11-50,"Maurice Faber, Sander Rodenhuis",,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21234,Affordable Cloud Hosting,https://www.crunchbase.com/organization/afford...,,"Cloud Storage, Information Technology, Web Hos...",http://www.affordablecloudhosting.com,"Tampa, Florida, United States",Affordable Cloud Hosting is a company that pro...,2116766,,,...,1993-01-01,year,,,,,,269653.0,USD,269653.0
21235,BackupAddict,https://www.crunchbase.com/organization/backup...,,"Cloud Storage, Information Technology, Profess...",https://www.backupaddict.com/,"Jersey Shore, Pennsylvania, United States",BackupAddict renders online backup plans to pr...,2145360,,,...,,,,11-50,,,,,,
21236,"NetConvergence, Inc.",https://www.crunchbase.com/organization/netcon...,,"Cloud Storage, Enterprise Software, Informatio...",http://www.netconvergence.com,"Santa Clara, California, United States","NetConvergence, Inc. is an intellectual proper...",2215033,,,...,,,,,,,,,,
21238,Radmedix,https://www.crunchbase.com/organization/radmedix,,"Cloud Storage, Manufacturing, Medical Device, ...",https://radmedix.com,"Dayton, Ohio, United States",Radmedix is an x-ray manufacturer that provide...,,,,...,,,,11-50,,,,,,


In [10]:
crunch_data.to_csv("crunch_cloud_conc_p3.csv", sep='\t', header=True, index=False)

#### Clean websites list

In [11]:
websites_list = crunch_data["Website"].tolist()

In [13]:
len(websites_list)

3371

In [None]:
# remove / from the end of the string that contains the website
# websites_list = [website.rstrip(website[-1]) if (website[-1] == "/") else website for website in websites_list]
websites_list = [website.rstrip(website[-1]) if (isinstance(website, str) and website[-1] == "/") else website for website in websites_list]
# een keer extra voor het geval er een url was met // op het eind
websites_list = [website.rstrip(website[-1]) if (isinstance(website, str) and website[-1] == "/") else website for website in websites_list]

In [None]:
len(websites_list)

---

### Step 3: scrape privacy policies

In [None]:
def get_privacy_policy_url(query):
    keyword_in_title = 0
    attempts = 0
    url = ""
    print("Query: " + query)
    
    try:
        query_results_list = return_google_results(query, 3, 5)
        print("Considering " + str(len(query_results_list)) + " URL(s) ...")
        for i, url in enumerate(query_results_list):
            term_in_url = 0
            attempts = attempts + 1
            print("Assessing privacy policy URL: " + url)
            
            if (re.findall('privacy', url) or re.findall('policy', url) or re.findall('gdpr', url) 
                or re.findall('terms', url) or re.findall('legal', url)): 
                print("Found relevant terms in URL! Succesful break!")
                break

#                     pass
            if keyword_in_title == 1 or attempts == 3 or i==(len(query_results_list)-1): 
                keyword_in_title = 0
                attempts = 0
                print("No results. Breaking ..")
                url = ""
#                 print(sentences)
                break   
    except Exception as e:
            print(str(e))
            pass
    return url

In [None]:
def return_google_results(keywords, num_results, attempts):
    user_agent_list = [
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    ]

    html_keywords = urllib.parse.quote_plus(keywords)
    sleep_init = 10
    
    url = "https://www.google.com/search?q=" + html_keywords + "&num=" + str(num_results)
    print("** Search query in URL: " + url)

    headers = {'User-Agent': random.choice(user_agent_list)}
    
    html = requests.get(url, headers=headers)

    if html.status_code == 429:
        if(attempts == 0):
            sys.exit("Too many request 429, attempted "+ str(5)+ " times, break ...")
        else:
            if 'Retry_After' in html.headers:
                print("Helaas, geen retry-after info")
            else:
                time.sleep(sleep_init)
                print("Too many requests (attempt "+ str(5 - attempts)+ "), we will attempt again in " + str(sleep_init) + " seconds")
                return_google_results(keywords, num_results, (attempts - 1))
    else: 
        pass
        
    soup = BeautifulSoup(html.text, 'html.parser')

    allData = soup.find_all("div",{"class":"g"})

    link_list = []
    print("len alldata: " + str(len(allData)))
    
    for i in range(0,len(allData)):
        link = allData[i].find('a').get('href')
        
        if(link is not None):
            if(link.find('https') != -1 and link.find('http') == 0 and link.find('aclk') == -1):
                print(link)
                link_list.append(link)
    print(link_list)
    return link_list

#### Collect privacy policy URLs

In [None]:
privacy_policies_url_list = []

In [None]:
# loop through each company URL and attempt to find the URL of the privacy policy
count_urls = 0
for i, url_company in enumerate(websites_list):    
    print(i)

#     print(len(privacy_policies_url_list))
    if(isinstance("url_company", str) is False or (url_company == url_company) is False):
        privacy_policies_url_list.append("")
    else:
        query = "site:\"" + url_company + " \"privacy policy"
        privacy_policies_url_list.append(get_privacy_policy_url(query))
        if(len(privacy_policies_url_list[-1]) > 0):
            count_urls = count_urls + 1
    print("URL count: " + str(count_urls))
    print()
    time.sleep(55)

In [None]:
privacy_policies_url_list[-2]

In [None]:
len(privacy_policies_url_list)

In [None]:
(len([(collected_url) for collected_url in privacy_policies_url_list if collected_url is not ""]))

In [None]:
len(crunch_data[0:7243])

In [None]:
crunch_data_survived = crunch_data[0:7243].copy(deep=True)

In [None]:
crunch_data_survived['PP URL'] = privacy_policies_url_list

In [None]:
#save data
crunch_data_survived.to_csv("crunch_data_cloud_surv.csv", sep='\t', header=True, index=True, index_label="Name")

In [None]:
crunch_data_r = pd.read_csv("crunch_data_cloud_surv.csv", sep='\t', encoding='utf-8')

In [None]:
crunch_data_r

# Step 3: Scrape privacy policies

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
def scrape_policies_google(url):
    policies = []
    sentences = []    
    try:
        
        article = Article(url)
#             print(url)
        article.download() #Downloads the link’s HTML content
#             print(url)
        article.parse() #Parse the article
#             print(url)
#                 print(article.title)
        doc = nlp(article.text)
        print("PP language = EN?: " + str(detect(article.text) == 'en'))
        print("PP length > 10 sentences?: " + str(len(list(doc.sents)) > 10))

        if detect(article.text) == 'en' and len(list(doc.sents)) > 10:
            print("Policy meets requirements of language and length ... ")
            sentences = list(doc.sents)
            print("Scraping successful!")

        else:
            print("Scraping not successful")
    except:
            pass
    print()
    return sentences

In [None]:
pp_list_sentences = []
for i, pp_url in enumerate(privacy_policies_url_list):
    print(i)
    if pp_url == "":
        pp_list_sentences.append("")
    else:
        pp_list_sentences.append(scrape_policies_google(pp_url))

In [None]:
[print(len(pp)) for pp in pp_list_sentences]

# Step 4: Classification

In [None]:
crunch_data_r

In [None]:
GDPR_classes = ['DPO', 'Purpose', 'Acquired data', 'Data sharing', 'Rights']

In [None]:
thresholds = [0.014130434782608696, 0.035326086956521736, 0.017934782608695653, 0.03369565217391304, 0.009782608695652175]

#### Preprocessing

In [None]:
def preprocessing(pps):
#     tokenizer = nlp.tokenizer
    # tokenize sentences
    tokenized_sent = [sent.text.split() for sent in pps]
    
    # remove punctuation
    tokenized_sent = [[re.sub('[,’\'\.!?&“”():*_;"]', '', y) for y in x] for x in tokenized_sent]
    
    # remove words with numbers in them
    tokenized_sent = [[y for y in x if not any(c.isdigit() for c in y)] for x in tokenized_sent]
    
    # remove stopwords   
    tokenized_sent_clean = tokenized_sent
#     tokenized_sent_clean = [[y for y in x if y not in stopwords.words('english')] for x in tokenized_sent]
    
    # from nltk.stem import PorterStemmer
    porter = PorterStemmer()
    tokenized_sent_clean = [[porter.stem(y) for y in x] for x in tokenized_sent_clean]
    
#     lemmatizer = WordNetLemmatizer()
#     tokenized_sent_clean = [[lemmatizer.lemmatize(y) for y in x] for x in tokenized_sent_clean]

    
    detokenized_pps = []
    for i in range(len(tokenized_sent_clean)):
        t = ' '.join(tokenized_sent_clean[i])
        detokenized_pps.append(t) 
    
    return detokenized_pps

In [None]:
def set_GDPR_columns(df):
    df['DPO'] = 0
    df['Purpose'] = 0
    df['Acquired data'] = 0
    df['Data sharing']  = 0
    df['Rights'] = 0

In [None]:
set_GDPR_columns(crunch_data_r)

In [None]:
pp_list_sentences_prep = []

for j, pp in enumerate(pp_list_sentences):
    pp_list_sentences_prep.append(preprocessing(pp))

In [None]:
pp_list_sentences_prep

In [None]:
crunch_data_r['PP text'] = pp_list_sentences_prep

In [None]:
crunch_data_r

In [None]:
crunch_data_r.to_csv("crunch_data_pp_url_text.csv", sep='\t', header=True)

#### Classification

In [None]:
crunch_data_r = pd.read_csv("crunch_data_pp_url_text.csv", sep='\t', encoding='utf-8', index_col = 0)

In [None]:
crunch_data_r

In [None]:
crunch_data_r_selected = crunch_data_r.loc[crunch_data_r['PP text'] != "[]"]

In [None]:
crunch_data_r_selected

In [None]:
for index, row in crunch_data_r_selected.iterrows(): 
    x = row["PP text"]
    pp_text_split = x.split(', ')
    
    for j, category in enumerate(GDPR_classes):
             # Load from file to check if everything is ok
        filen = "linreg-oversampling-" + category + ".pkl"      
        with open(filen, 'rb') as file:
            vectorizer, lr = pickle.load(file)
            x = vectorizer.transform(pp_text_split)
        
            y_pred = lr.predict(x)
#             print(y_pred)
            n_pos_pred = list(y_pred).count(1)
#             print(n_pos_pred)
            
            
#             print("(" + str(n_pos_pred) + "/" + str(len(pp_text_split)) + ") >= " + str(thresholds[j]))
            if (n_pos_pred/len(pp_text_split)) >= thresholds[j]:
    #           MARK THE LABEL AS POSITIVE (1), DEFAULT STATE IS NEGATIVE (0)
#                 print("TRUE")
                crunch_data_r_selected.at[index, GDPR_classes[j]] = 1
            else:
                pass

In [None]:
crunch_data_r_selected

# Classification Analysis (425 privacy policies)

In [None]:
for idx, GDPR_class in enumerate(GDPR_classes):
    print(GDPR_class)
    print("Positively classified:" + str(crunch_data_r_selected[GDPR_class].value_counts()[0]) + " (" + str((crunch_data_r_selected[GDPR_class].value_counts()[0]/crunch_data_r_selected.shape[0])*100) + "%)")
    print("Negatively classified:" + str(crunch_data_r_selected[GDPR_class].value_counts()[1]) + " (" + str((crunch_data_r_selected[GDPR_class].value_counts()[1]/crunch_data_r_selected.shape[0])*100) + "%)")
    classification_analysis = [
       [GDPR_class, crunch_data_r_selected.shape[0], crunch_data_r_selected[GDPR_class].value_counts()[0], crunch_data_r_selected[GDPR_class].value_counts()[1]],
#        [GDPR_labels[idx], 'L1', 'numerical', 'full data', sm_lr_numpredictors_acc[idx], {k:v for (k,v) in dict(sm_lr_numpredictors[idx].pvalues).items() if ((v <= 0.05) and ( v != 0) and (k != 'const'))}]
      ]
    classification_analysis = pd.DataFrame(classification_analysis, columns =['GDPR Class', '# companies', 'Postive', 'Negative'])
#     print(summary_sm_sk.to_markdown())
    
    display(HTML(classification_analysis.to_html(index=False)))
    print()
    print()

# Statistical Analysis

### Select potentially interesting predictors

- Employee (object), 
- Type (object), 
- Founded Date (object), 
- Location
- Operating Status (object), 
- Industry 1 (object)

In [None]:
pd_stats = crunch_data_r_selected[["Employees", "Founded Date", "Location", "Industry 1", "DPO", "Purpose", "Acquired data", "Data sharing", "Rights"]].copy(deep=True)

In [None]:
pd_stats.info()

##### Employees

In [None]:
pd_stats["Employees"].value_counts()

##### Founded Date

In [None]:
pd_stats["Founded Date"].value_counts()

Convert to year

In [None]:
f_date = pd_stats["Founded Date"].tolist()

In [None]:
f_date_clean = [re.findall(r'(\d{4})', date)[0] if date is not np.nan else (np.nan) for date in f_date]

In [None]:
len(f_date)

In [None]:
(f_date_clean)

In [None]:
pd_stats["Founded Year"] = f_date_clean

In [None]:
pd_stats["Founded Year"].value_counts()

##### Location          

In [None]:
pd_stats["Location"].value_counts()

In [None]:
location = pd_stats["Location"].to_list()

In [None]:
country = [(country.split(", ")[-1]) for country in location]

In [None]:
len(country)

In [None]:
pd_stats["Country"] = country

In [None]:
pd_stats["Country"].value_counts()

##### Industry 1

In [None]:
pd_stats["Industry 1"].value_counts()

### Drop old columns

In [None]:
pd_stats.drop(['Founded Date', 'Location'], axis=1, inplace=True)

In [None]:
pd_stats.info()

### Cast to category

In [None]:
# Define the lambda function: categorize_label
label_categorical = lambda x: x.astype('category')

In [None]:
pd_stats = pd_stats.apply(label_categorical, axis=0)

In [None]:
pd_stats.dtypes

### LR with Statsmodels

In [None]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor

# scaling
from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_columns', None) 
pd.set_option('display.max_colwidth', None)

In [None]:
pd_stats.isna().sum()

In [None]:
GDPR_classes

#### Explore correlations

In [None]:
X = pd_stats.drop(GDPR_classes,axis=1) # independant features
X = pd.get_dummies(X, drop_first = True)
sns.clustermap(X.corr())

#### Split data

In [None]:
train, test = train_test_split(pd_stats, test_size=0.2, random_state=42)
X_train = train.drop(GDPR_classes,axis=1) # independant features

#### Encode non-numerical categorical data, and drop first to avoid collinearity

In [None]:
X_train = pd.get_dummies(X_train, drop_first = True)

# Parameter Optimization

#### First without PO

In [None]:
train, test = train_test_split(pd_stats, test_size=0.25, random_state=25)
sel_alpha_list = dict()
acc_last = 0

In [None]:
y_train = train[GDPR_classes[0]] # dependant variable
y_test = test[GDPR_classes[0]] # dependant variable

In [None]:
# independent features
X_train = train.drop(GDPR_classes, axis=1) 
# encode non-numerical categorical data, and drop first to avoid collinearity
X_train = pd.get_dummies(X_train, drop_first = True)

X_test = test.drop(GDPR_classes, axis=1) # independant features
X_test = pd.get_dummies(X_test, drop_first = True)

X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

In [None]:
X_train

In [None]:
model = sm.Logit(y_train,X_train)
logit_model = model.fit()

In [None]:
pred_train = logit_model.predict(X_train)>=.5
pred_test = logit_model.predict(X_test)>=.5

In [None]:
acc_train = (y_train==pred_train).mean()
acc_test = (y_test==pred_test).mean()

print("Acc: ", acc_test)
print("Alpha: ", alpha_op)

In [None]:
alpha_list = list(np.arange(0.001, 10, 0.1))

##### Optimize parameters

In [None]:
opt_alpha = optimize_logit(pd_stats, True, alpha_list, True)

In [None]:
X_train

In [None]:
y_train

In [None]:
def optimize_logit(pd_stats, reg, alpha_range, intercept_set):
    train, test = train_test_split(pd_stats, test_size=0.2, random_state=25)
    sel_alpha_list = dict()
    acc_last = 0

    for GDPR_cat in GDPR_classes:
        alpha_sel = alpha_range[0]
        acc_last = 0

        print("***************** NEW ROUND!")
        for alpha_op in alpha_range:
            print("GDPR-category: " + GDPR_class)

            y_train = train[GDPR_class] # dependant variable
            y_test = test[GDPR_class] # dependant variable
            
#             sys.exit(0)

            # independent features
            X_train = train.drop(GDPR_classes, axis=1) 
            # encode non-numerical categorical data, and drop first to avoid collinearity
            X_train = pd.get_dummies(X_train, drop_first = True)

            X_test = test.drop(GDPR_classes, axis=1) # independant features
            X_test = pd.get_dummies(X_test, drop_first = True)

            if(intercept_set):
                X_train = sm.add_constant(X_train)
                X_test = sm.add_constant(X_test)
                
            print(y_train)

            print("flag 1")
            model = sm.Logit(y_train,X_train)
            print("flag 2")

            if(reg):
                logit_model = model.fit_regularized(method = 'l1', trim_mode = 'size', alpha = alpha_op)
            else:
                logit_model = model.fit()

            print("flag 3")

            pred_train = logit_model.predict(X_train)>=.5

            pred_test = logit_model.predict(X_test)>=.5

            acc_train = (y_train==pred_train).mean()

            acc_test = (y_test==pred_test).mean()
            
            print("Acc: ", acc_test)
            print("Alpha: ", alpha_op)

            sys.exit(0)
            if(acc_test >= acc_last):
                print("Alpha selected!")
                alpha_sel = alpha_op 
                acc_last = acc_test

            # last alpha in range? Place optimized alpha and accuracy in dict
            if(alpha_op == alpha_list[-1]):
                sel_alpha_list[GDPR_class] = [alpha_sel, acc_last]
            
            print()
            print()

    return sel_alpha_list