# Challenge - GDPR Compliant

![](http://eleanorglanvillecentre.lincoln.ac.uk/assets/images/content/_large/adalovelacehero.jpg)

In the `ada_lovelace.txt` file, located in the `input` folder, contains some informations about Ada Lovelace. This problem is that this file is full of identifying informations about people, and as such, is really not GDPR-compliant 😱 (info : the [General Data Protection Regulation](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) (known as RGPD in French) is a regulation in EU law on data protection and privacy)

## Guidelines
The objective of this exercice is to write a function that will clean up a file, by remplacing all mentions of people's names by "\[REDACTED\]", in order to comply with European law.

In [5]:
# TODO : Imports
import numpy as np
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_md")
from spacy import displacy



In [3]:
# TODO : load file and have a look at it
df = pd.read_csv('../input/ada_lovelace.txt',sep='\t',header=None)
df.head()

Unnamed: 0,0
0,"Augusta Ada King, Countess of Lovelace (née By..."
1,Lovelace became close friends with her tutor M...


In [9]:
df = df.reset_index()

**Q1.** Using the SpaCy NER tools, identify the **entities** in this document, and their relating tags.

In [14]:
# TODO : Named Entities Recognition
def ner_entries(my_df):
    entries_list = []
    for index, row in my_df.iterrows():
        print(f'row: {row}')
        doc = nlp(row[0])
        entries = [ent for ent in doc.ents]
        entries_list.append(entries)
    return entries_list

entries = ner_entries(df)
entries

row: index                                                    0
0        Augusta Ada King, Countess of Lovelace (née By...
Name: 0, dtype: object
row: index                                                    1
0        Lovelace became close friends with her tutor M...
Name: 1, dtype: object


[[Augusta Ada King,
  Lovelace,
  née Byron,
  10 December 1815,
  27,
  November 1852,
  English,
  Charles Babbage's,
  the Analytical Engine,
  first,
  first,
  first,
  one,
  first],
 [Lovelace,
  Mary Somerville,
  Charles Babbage,
  1833,
  Somerville,
  many years,
  Andrew Crosse,
  David Brewster,
  Charles Wheatstone,
  Michael Faraday,
  Charles Dickens]]

**Q2.** Display the identified entities in a more visual manner.

In [11]:
# TODO : NER visualization
def ner_viz(my_df):
    entries = []
    for index, row in my_df.iterrows():
        print(f'row: {row}')
        doc = nlp(row[0])
        entries.extend([ent for ent in doc.ents])
        displacy.render(doc, style="ent")
    return entries
ner_viz(df)

row: index                                                    0
0        Augusta Ada King, Countess of Lovelace (née By...
Name: 0, dtype: object


row: index                                                    1
0        Lovelace became close friends with her tutor M...
Name: 1, dtype: object


[Augusta Ada King,
 Lovelace,
 née Byron,
 10 December 1815,
 27,
 November 1852,
 English,
 Charles Babbage's,
 the Analytical Engine,
 first,
 first,
 first,
 one,
 first,
 Lovelace,
 Mary Somerville,
 Charles Babbage,
 1833,
 Somerville,
 many years,
 Andrew Crosse,
 David Brewster,
 Charles Wheatstone,
 Michael Faraday,
 Charles Dickens]

**Q3.** Write a function `replace_name_by_redacted`that will modify the document in order to replace all occurences of names by "\[REDACTED\]", and apply it to the file.

In [6]:
# TODO : `replace_name_by_redacted`
def replace_name_by_redacted(my_df):
    for token in doc:
        print(token.text, token.pos_, token.tag_)
    return 