# Data Engineering Test

## Task 1: Masking Named Entities


**Objective:** 

Mask the named entity in the provided text, i.e., e-mail text data. <br>
The text is from the English translation text of Noli Me Tangere, published under the Project Gutenberg. 

**Example:**

Original:
“On the last of October Don Santiago de los Santos, popularly known as Capitan Tiago, have a dinner.” <br>
Masked: “On the last of October XXX XXXXXXXX XX XXX XXXXXX, popularly known as XXXXXXX XXXXX, gave a dinner.”

In [1]:
# !pip install spacy
# !python -m spacy download en_core_web_trf

In [1]:
import spacy
import re

from pprint import PrettyPrinter as pp

In [3]:
# Load RoBERTa-based transformer
NER = spacy.load("en_core_web_trf")


# Load document to mask
with open(
    file = "Data/DE Track exam - Noli Me Tangere EN.txt", 
    encoding = "utf8"
        ) as f:
    text = f.read()


# Extract named entities
NER_text = NER(text)


# Types of named entities to mask
NE = [
        'PERSON', # People, including fictional.
        'ORG'     # Companies, agencies, institutions, etc.
        'LOC',    # Non-GPE locations, mountain ranges, bodies of water.
        'GPE'     # Countries, cities, states. 
    ]


# Function defining how to mask named entities
#     Replace all characters by 'X' except whitespaces
mask = lambda x: re.sub('\S', 'X', x)


# Mask named entities
masked_text = ' '.join(
    [
        mask(token.text) 
            if token.ent_type_ in NE 
            else token.text 
        for token in NER_text
    ]
)


# Print masked text
pp().pprint(masked_text)

('Chapter I - A Social Gathering \n'
 ' = = = = = = = = = = = \n'
 ' On the last of October XXX XXXXXXXX XX XXX XXXXXX , popularly known as '
 'XXXXXXX XXXXX , gave a dinner . In spite of the fact that , contrary to his '
 'usual custom , he had made the announcement only that afternoon , it was '
 'already the sole topic of conversation in XXXXXXX and adjacent districts , '
 'and even in XXX XXXXXX XXXX , for at that time XXXXXXX XXXXX was considered '
 'one of the most hospitable of men , and it was well known that his house , '
 'like his country , shut its doors against nothing except commerce and all '
 'new or bold ideas . Like an electric shock the announcement ran through the '
 'world of parasites , bores , and hangers - on , whom God in His infinite '
 'bounty creates and so kindly multiplies in XXXXXX . Some looked at once for '
 'shoe - polish , others for buttons and cravats , but all were especially '
 'concerned about how to greet the master of the house in the most fami

## Task 2: Format Phone Numbers


**Objective**

Given 100 hypothetical Philippine phone numbers in different formats, <br>
process the raw phone numbers and transform them into the form:639XXXXXXXXX.

**Example**

Original: 933 389 1120 <br>
Expected: 639333891120

In [4]:
import pandas as pd

In [5]:
# Load data
df = pd.read_csv('Data/DE Track exam - Phone #.csv', index_col='ID')
print(df.shape)
df.sample(10)

(100, 1)


Unnamed: 0_level_0,Phone Numbers
ID,Unnamed: 1_level_1
4,926 997 3019
81,+63 (928) 3385041
39,+639488594248
32,+639595172926
16,990 884 3046
72,+63 (987) 2684501
35,+639392433440
45,+63 (996) 7853519
48,+63 (983) 5627910
75,+63 (974) 6244215


In [6]:
# Format phone numbers
df['Phone Numbers'] = \
    df['Phone Numbers'].replace('[^0-9]','', regex=True) \
        .astype(str) \
        .apply(lambda x: '63' + x[-10:]) \
        .apply(int)
df.sample(10)

Unnamed: 0_level_0,Phone Numbers
ID,Unnamed: 1_level_1
49,639556902644
26,639423148894
55,639758850087
57,639793822711
29,639594888963
18,639594349256
67,639755146422
64,639739479800
21,639617887189
9,639301248769


In [7]:
# Sanity check
df['Phone Numbers'].astype(str).apply(len).unique()

array([12], dtype=int64)