### Data Prep

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
pip install transformers pandas openpyxl

In [None]:
pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [None]:
import pandas as pd
import os
from rouge import Rouge

In [None]:
# Data to summarize

directory_path = '/content/drive/MyDrive/GDPR/GDPR/Data/GDPR Key issues'
file_names = os.listdir(directory_path)
print(file_names)

filepath= '/content/drive/MyDrive/GDPR/GDPR/Data/GDPR Key issues/'
documents = [open(filepath+path, 'r').read() for path in file_names]

['6. Personal Data.txt', '9. Processing.txt', '7. Privacy by Design.txt', '12. Right to be Forgotten.txt', '8. Privacy Impact Assessment.txt', '14. Third Countries.txt', '5. Fines_Penalties.txt', '10. Records of Processing Activities.txt', '11. Right of Access.txt', '13. Right to be Informed.txt', '2. Data Protection Officer.txt', '4. Encryption.txt', '1. Consent.txt', '3. Email Marketing.txt']


In [None]:
documents[-2] # sample for comparison

'Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing. While being one of the more well-known legal bases for processing personal data, consent is only one of six bases mentioned in the General Data Protection Regulation (GDPR). The others are: contract, legal obligations, vital interests of the data subject, public interest and legitimate interest as stated in Article 6(1) GDPR.\nThe basic requirements for the effectiveness of a valid legal consent are defined in Article 7 and specified further in recital 32 of the GDPR. Consent must be freely given, specific, informed and unambiguous. In order to obtain freely given consent, it must be given on a voluntary basis. The element “free” implies a real choice by the data subject. Any element of inappropriate pressure or influence which could affect the outcome of that choice renders the consent invalid. In doing so, the legal text takes a certain imbala

### Extractive Summarization tutorial - https://www.youtube.com/watch?v=9PoKellNrBc

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

In [None]:
stopwords = list(STOP_WORDS)

punctuation = punctuation + '\n'
punctuation

nlp = spacy.load('en_core_web_sm')

In [None]:
for doc in [documents[-2]]:
  doc = nlp(doc)

  # Tokenisation
  tokens = [token.text for token in doc]
  # print(tokens)

  word_frequencies = {}
  for word in doc:
      if word.text.lower() not in stopwords:
          if word.text.lower() not in punctuation:
              if word.text not in word_frequencies.keys():
                  word_frequencies[word.text] = 1
              else:
                  word_frequencies[word.text] += 1

  # print(word_frequencies)
  # print(max(word_frequencies.values()))


  for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

  # print(word_frequencies)

  sentence_tokens = [sent for sent in doc.sents]
  # print(sentence_tokens)

  sentence_scores = {}
  for sent in sentence_tokens:
    for word in sent:
      if word.text.lower() in word_frequencies.keys():
        if sent not in sentence_scores.keys():
          sentence_scores[sent] = word_frequencies[word.text.lower()]
        else:
          sentence_scores[sent] += word_frequencies[word.text.lower()]

  # print(sentence_scores)

  select_length = int(len(sentence_tokens)*0.4)
  # print(select_length)

  summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
  final_summary = [word.text for word in summary]
  summary = ' '.join(final_summary)
  print(summary)

Especially considering that the European data protection authorities have made it clear “that if a controller chooses to rely on consent for any part of the processing, they must be prepared to respect that choice and stop that part of the processing if an individual withdraws consent.” Strictly interpreted, this means the controller is not allowed to switch from the legal basis consent to legitimate interest once the data subject withdraws his consent. For consent to be informed and specific, the data subject must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’. If the consent should legitimise the processing of special categories of personal data, the information for the data subject must expressly refer to this.
 While being one of the more well-known legal bases for processing personal data, consent is only one of six bases mentioned in

In [None]:
# Example summaries and references
summary = '''Especially considering that the European data protection authorities have made it clear “that if a controller chooses to rely on consent for any part of the processing, they must be prepared to respect that choice and stop that part of the processing if an individual withdraws consent.” Strictly interpreted, this means the controller is not allowed to switch from the legal basis consent to legitimate interest once the data subject withdraws his consent. For consent to be informed and specific, the data subject must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’. If the consent should legitimise the processing of special categories of personal data, the information for the data subject must expressly refer to this.
 While being one of the more well-known legal bases for processing personal data, consent is only one of six bases mentioned in the General Data Protection Regulation (GDPR). Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing. For example, in an employer-employee relationship: The employee may worry that his refusal to consent may have severe negative consequences on his employment relationship, thus consent can only be a lawful basis for processing in a few exceptional circumstances. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing. Thus, the performance of a contract may not be made dependent upon the consent to process further personal data, which is not needed for the performance of that contract.
 As one can see consent is not a silver bullet when it comes to the processing of personal data. That being said, there is no form requirement for consent, even if written consent is recommended due to the accountability of the controller. Therefore, consent should always be chosen as a last option for processing personal data.
 The data subject must also be informed about his or her right to withdraw consent anytime.'''
reference = documents[-2]

rouge = Rouge()
scores = rouge.get_scores(summary, reference)
print('Rouge score:', scores)

Rouge score: [{'rouge-1': {'r': 0.549079754601227, 'p': 1.0, 'f': 0.7089108865127733}, 'rouge-2': {'r': 0.48538961038961037, 'p': 0.964516129032258, 'f': 0.6457883324790432}, 'rouge-l': {'r': 0.549079754601227, 'p': 1.0, 'f': 0.7089108865127733}}]


### Summarization with pre-trained summarizer

#### Default summarization

In [None]:
from transformers import pipeline


# Load Summarization Model
summarizer = pipeline("summarization")

# Generate Summaries
summaries = [summarizer(doc, max_length=1000, min_length=200, length_penalty=2.0, num_beams=9)[0]['summary_text'] for doc in [documents[-2]]]

summaries[0]

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your max_length is set to 1000, but your input_length is only 859. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=429)


' Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing . The basic requirements for the effectiveness of a valid legal consent are defined in Article 7 and specified further in recital 32 of the General Data Protection Regulation (GDPR) consent must be freely given, specific, informed and unambiguous . For consent of children and adolescents in relation to information society services is a special case . For those who are under the age of 16, there is an additional consent or authorisation requirement from the holder of parental responsibility . The EU data protection authorities have made it clear “that if a controller chooses to rely on consent for any part of the processing, they must be prepared to respect that choice and stop that choice if an individual withdraws consent’.” Strictly interpreted, this means the controller is not allowed to switch from the legal basis consent to legitimate inter

In [None]:
# Example summaries and references
summary = '''Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing . The basic requirements for the effectiveness of a valid legal consent are defined in Article 7 and specified further in recital 32 of the General Data Protection Regulation (GDPR) consent must be freely given, specific, informed and unambiguous . For consent of children and adolescents in relation to information society services is a special case . For those who are under the age of 16, there is an additional consent or authorisation requirement from the holder of parental responsibility . The EU data protection authorities have made it clear “that if a controller chooses to rely on consent for any part of the processing, they must be prepared to respect that choice and stop that choice if an individual withdraws consent’.” Strictly interpreted, this means the controller is not allowed to switch from the legal basis consent to legitimate interest once the user’s consent is not permitted to'''
reference = documents[-2]
scores = rouge.get_scores(summary, reference)
print('Rouge score:', scores)

Rouge score: [{'rouge-1': {'r': 0.3282208588957055, 'p': 0.963963963963964, 'f': 0.4897025133727464}, 'rouge-2': {'r': 0.2435064935064935, 'p': 0.9259259259259259, 'f': 0.3856041098131786}, 'rouge-l': {'r': 0.3282208588957055, 'p': 0.963963963963964, 'f': 0.4897025133727464}}]


In [None]:
# # Create a DataFrame
# df = pd.DataFrame({"Document": [file[3:-4] for file in file_names], "Summary": summaries})

# # Save to CSV
# df.to_csv("/content/drive/MyDrive/GDPR/Data/GDPR_summaries.csv", index=False)

# df.head()

Unnamed: 0,Document,Summary
0,Consent,Processing personal data is generally prohibi...
1,Data Protection Officer,The General Data Protection Regulation (GDPR)...
2,Email Marketing,Processing is only allowed by the General Dat...
3,Encryption,Companies can reduce the probability of a dat...
4,Fines_Penalties,National authorities can or must assess fines...


#### Pre-trained legal summarization model - https://huggingface.co/nsi319/legal-pegasus

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("nsi319/legal-pegasus")
model = AutoModelForSeq2SeqLM.from_pretrained("nsi319/legal-pegasus")


text = """The competent supervisory authority shall approve binding corporate rules in accordance with the consistency mechanism set out in Article 63, provided that they
are legally binding and apply to and are enforced by every member concerned of the group of undertakings, or group of enterprises engaged in a joint economic activity, including their employees
expressly confer enforceable rights on data subjects with regard to the processing of their personal data. The binding corporate rules referred to in paragraph 1 shall specify at least
the structure and contact details of the group of undertakings, or group of enterprises engaged in a joint economic activity and of each of its members
the data transfers or set of transfers, including the categories of personal data, the type of processing and its purposes, the type of data subjects affected and the identification of the third country or countries in question
their legally binding nature, both internally and externally
the application of the general data protection principles, in particular purpose limitation, data minimisation, limited storage periods, data quality, data protection by design and by default, legal basis for processing, processing of special categories of personal data, measures to ensure data security, and the requirements in respect of onward transfers to bodies not bound by the binding corporate rules
the rights of data subjects in regard to processing and the means to exercise those rights, including the right not to be subject to decisions based solely on automated processing, including profiling in accordance with Article 22, the right to lodge a complaint with the competent supervisory authority and before the competent courts of the Member States in accordance with Article 79, and to obtain redress and, where appropriate, compensation for a breach of the binding corporate rules
the acceptance by the controller or processor established on the territory of a Member State of liability for any breaches of the binding corporate rules by any member concerned not established in the Union; the controller or the processor shall be exempt from that liability, in whole or in part, only if it proves that that member is not responsible for the event giving rise to the damage
how the information on the binding corporate rules, in particular on the provisions referred to in points (d), (e) and (f) of this paragraph is provided to the data subjects in addition to Articles 13 and 14
the tasks of any data protection officer designated in accordance with  Article 37 or any other person or entity in charge of the monitoring compliance with the binding corporate rules within the group of undertakings, or group of enterprises engaged in a joint economic activity, as well as monitoring training and complaint-handling
the complaint procedures
the mechanisms within the group of undertakings, or group of enterprises engaged in a joint economic activity for ensuring the verification of compliance with the binding corporate rules. Such mechanisms shall include data protection audits and methods for ensuring corrective actions to protect the rights of the data subject. Results of such verification should be communicated to the person or entity referred to in point (h) and to the board of the controlling undertaking of a group of undertakings, or of the group of enterprises engaged in a joint economic activity, and should be available upon request to the competent supervisory authority
the mechanisms for reporting and recording changes to the rules and reporting those changes to the supervisory authority
the cooperation mechanism with the supervisory authority to ensure compliance by any member of the group of undertakings, or group of enterprises engaged in a joint economic activity, in particular by making available to the supervisory authority the results of verifications of the measures referred to in point (j)
the mechanisms for reporting to the competent supervisory authority any legal requirements to which a member of the group of undertakings, or group of enterprises engaged in a joint economic activity is subject in a third country which are likely to have a substantial adverse effect on the guarantees provided by the binding corporate rules; """

input_tokenized = tokenizer.encode(documents[-2], return_tensors='pt',max_length=1024,truncation=True)
summary_ids = model.generate(input_tokenized,
                                  num_beams=9,
                                  no_repeat_ngram_size=3,
                                  length_penalty=2.0,
                                  min_length=150,
                                  max_length=250,
                                  early_stopping=True)
summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
### Summary Output

# The Securities and Exchange Commission today charged AT&T, Inc. and three of its Investor Relations executives with aiding and abetting the company's violations of the antifraud provisions of Section 10(b) of the Securities Exchange Act of 1934 and Rule 10b-5 thereunder. According to the SEC's complaint, the company learned in March 2016 that a steeper-than-expected decline in its first quarter smartphone sales would cause its revenue to fall short of analysts' estimates for the quarter. The complaint alleges that to avoid falling short of the consensus revenue estimate for the third consecutive quarter, the executives made private, one-on-one phone calls to analysts at approximately 20 separate firms. On these calls, the SEC alleges that Christopher Womack, Michael Black, and Kent Evans allegedly disclosed internal smartphone sales data and the impact of that data on internal revenue metrics. The SEC further alleges that as a result of what they were told, the analysts substantially reduced their revenue forecasts, leading to the overall consensus Revenue Estimate falling to just below the level that AT&t ultimately reported to the public on April 26, 2016. The SEC is seeking permanent injunctive relief and civil monetary penalties against each defendant.


In [None]:
summary

'The General Data Protection Regulation (GDR) came into force on 25 May 2012 and is designed to protect the personal data of individuals in the European Union. According to the GDR, the processing of personal data is generally prohibited unless it is expressly allowed by law or the data subject has consented to it. For example, in an employer-employee relationship, the employee may worry that his consent may have severe negative consequences on his employment relationship. Therefore consent can only be a lawful basis for processing in a few exceptional circumstances. In addition, a so-called "coupling" or "prohibition of consent" applies. Thus, the performance of a contract may not be made dependent upon the consent to process further personal data. For consent to be informed and specific, the data must be freely given, specific, informed and unambiguous. In order to obtain freely given consent, it must be given on a basis of consent. The data subject must also be informed about his or

In [None]:
# Example summaries and references
summary = '''The General Data Protection Regulation (GDR) came into force on 25 May 2012 and is designed to protect the personal data of individuals in the European Union. According to the GDR, the processing of personal data is generally prohibited unless it is expressly allowed by law or the data subject has consented to it. For example, in an employer-employee relationship, the employee may worry that his consent may have severe negative consequences on his employment relationship. Therefore consent can only be a lawful basis for processing in a few exceptional circumstances. In addition, a so-called "coupling" or "prohibition of consent" applies. Thus, the performance of a contract may not be made dependent upon the consent to process further personal data. For consent to be informed and specific, the data must be freely given, specific, informed and unambiguous. In order to obtain freely given consent, it must be given on a basis of consent. The data subject must also be informed about his or her right to withdraw consent for any part of the processing. The controller must also inform about the use of the data for automated decision-making. In this regard, the controller also has to inform about possible risks of data transfers due to absence of appropriate safeguards or appropriate decision or'''
reference = documents[-2]
scores = rouge.get_scores(summary, reference)
print('Rouge score:', scores)

Rouge score: [{'rouge-1': {'r': 0.29754601226993865, 'p': 0.8290598290598291, 'f': 0.4379232466772316}, 'rouge-2': {'r': 0.22564935064935066, 'p': 0.6984924623115578, 'f': 0.34110429078748916}, 'rouge-l': {'r': 0.27300613496932513, 'p': 0.7606837606837606, 'f': 0.40180586518738953}}]


In [None]:
summaries = []
for doc in documents:
  input_tokenized = tokenizer.encode(doc, return_tensors='pt',max_length=1024,truncation=True)
  summary_ids = model.generate(input_tokenized,
                                  num_beams=9,
                                  no_repeat_ngram_size=3,
                                  length_penalty=2.0,
                                  min_length=150,
                                  max_length=300,
                                  early_stopping=True)
  summaries.append([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0])

# Create a DataFrame
df = pd.DataFrame({"Document": [file[3:-4] for file in file_names], "Summary": summaries})

# Save to CSV
df.to_csv("/content/drive/MyDrive/GDPR/GDPR/Data/GDPR_summaries.csv", index=False)

In [None]:
df.head()

#### Pre-trained legal summarization model with extractive summarization as a pre-processing step

In [None]:

# Input from extractive summary
text = """Especially considering that the European data protection authorities have made it clear “that if a controller chooses to rely on consent for any part of the processing, they must be prepared to respect that choice and stop that part of the processing if an individual withdraws consent.” Strictly interpreted, this means the controller is not allowed to switch from the legal basis consent to legitimate interest once the data subject withdraws his consent. For consent to be informed and specific, the data subject must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’. If the consent should legitimise the processing of special categories of personal data, the information for the data subject must expressly refer to this.
 While being one of the more well-known legal bases for processing personal data, consent is only one of six bases mentioned in the General Data Protection Regulation (GDPR). Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing. For example, in an employer-employee relationship: The employee may worry that his refusal to consent may have severe negative consequences on his employment relationship, thus consent can only be a lawful basis for processing in a few exceptional circumstances. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing. Thus, the performance of a contract may not be made dependent upon the consent to process further personal data, which is not needed for the performance of that contract.
 As one can see consent is not a silver bullet when it comes to the processing of personal data. That being said, there is no form requirement for consent, even if written consent is recommended due to the accountability of the controller. Therefore, consent should always be chosen as a last option for processing personal data.
 The data subject must also be informed about his or her right to withdraw consent anytime.
 """

input_tokenized = tokenizer.encode(text, return_tensors='pt',max_length=1024,truncation=True)
summary_ids = model.generate(input_tokenized,
                                  num_beams=9,
                                  no_repeat_ngram_size=3,
                                  length_penalty=2.0,
                                  min_length=150,
                                  max_length=250,
                                  early_stopping=True)
summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]

print(summary)

The European data protection authorities have made it clear that if a controller chooses to rely on consent for any part of processing, they must be prepared to respect that choice if an individual withdraws consent. Strictly interpreted this means that the controller is not allowed to switch from the legal basis consent to legitimate interest once the data subject withdraws his consent. For consent to be informed specific data must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’.<n>While consent is one of the more well-known legal bases for processing personal data, it is only one of six mentioned bases in the General Data Protection Regulation.


In [None]:
# Example summaries and references
summary = '''The European data protection authorities have made it clear that if a controller chooses to rely on consent for any part of processing, they must be prepared to respect that choice if an individual withdraws consent. Strictly interpreted this means that the controller is not allowed to switch from the legal basis consent to legitimate interest once the data subject withdraws his consent. For consent to be informed specific data must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’.<n>While consent is one of the more well-known legal bases for processing personal data, it is only one of six mentioned bases in the General Data Protection Regulation.'''
reference = documents[-2]
scores = rouge.get_scores(summary, reference)
print('Rouge score:', scores)

Rouge score: [{'rouge-1': {'r': 0.2607361963190184, 'p': 0.9659090909090909, 'f': 0.4106280159761022}, 'rouge-2': {'r': 0.1737012987012987, 'p': 0.84251968503937, 'f': 0.2880215314860819}, 'rouge-l': {'r': 0.2607361963190184, 'p': 0.9659090909090909, 'f': 0.4106280159761022}}]


#### Pre-trained default summarization model with extractive summarization as a pre-processing step

In [None]:
text = """Especially considering that the European data protection authorities have made it clear “that if a controller chooses to rely on consent for any part of the processing, they must be prepared to respect that choice and stop that part of the processing if an individual withdraws consent.” Strictly interpreted, this means the controller is not allowed to switch from the legal basis consent to legitimate interest once the data subject withdraws his consent. For consent to be informed and specific, the data subject must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’. If the consent should legitimise the processing of special categories of personal data, the information for the data subject must expressly refer to this.
 While being one of the more well-known legal bases for processing personal data, consent is only one of six bases mentioned in the General Data Protection Regulation (GDPR). Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing. For example, in an employer-employee relationship: The employee may worry that his refusal to consent may have severe negative consequences on his employment relationship, thus consent can only be a lawful basis for processing in a few exceptional circumstances. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing. Thus, the performance of a contract may not be made dependent upon the consent to process further personal data, which is not needed for the performance of that contract.
 As one can see consent is not a silver bullet when it comes to the processing of personal data. That being said, there is no form requirement for consent, even if written consent is recommended due to the accountability of the controller. Therefore, consent should always be chosen as a last option for processing personal data.
 The data subject must also be informed about his or her right to withdraw consent anytime.
 """

summarizer = pipeline("summarization")

# Generate Summaries
summaries = [summarizer(doc, max_length=1000, min_length=200, length_penalty=2.0, num_beams=9)[0]['summary_text'] for doc in [text]]

summaries[0]

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your max_length is set to 1000, but your input_length is only 431. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=215)


' Consent is only one of six legal bases mentioned in the General Data Protection Regulation (GDPR) Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing . The data subject must at least be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against ‘function creep’ If the consent should legitimise the processing of special categories of personal data, the information must expressly refer to this . Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion . There is no form requirement for consent, even if written consent is recommended due to the accountability of the controller . The Data subject must also be informed about his or her right to withdraw consent anytime. For example, in an employer-employee relationship: The employee may worry that h

In [None]:
# Example summaries and references
summary = summaries[0]
reference = documents[-2]
scores = rouge.get_scores(summary, reference)
print('Rouge score:', scores)

Rouge score: [{'rouge-1': {'r': 0.3282208588957055, 'p': 0.9907407407407407, 'f': 0.493087553865234}, 'rouge-2': {'r': 0.2353896103896104, 'p': 0.9119496855345912, 'f': 0.3741935451256941}, 'rouge-l': {'r': 0.3282208588957055, 'p': 0.9907407407407407, 'f': 0.493087553865234}}]
