# ETL sandbox

Here you can find all the ETL process than transforms a txt file of the Colombian Constitution into an index on a Elastic Search Server ESS. In case you don't have any ESS, or want to make local loads, you can use a docker image. Something to clarify, if you want to use this notebook to create json documents from .txt files, they must have the next hierarchy structure:

```
hierarchy = {
    'TITULO' : 'headline',
    'DISPOSICIONES' : 'headline',
    'CAPITULO' : 'chapter',
    'ARTÍCULO' : 'article'

}
```

An important consideration to have in mind, is that here, this order represents the head level of each clasification. In other words, headline is an _"h1"_ on html, chapter an _"h2"_ and article would be a _"p"_.

## Importing libraries and tools.

About the libraries and needed to run the main project, as the convensions says, are listed on the requirements.txt file. Here the only library that is not installed on python's kernel is [tabulate](https://pypi.org/project/tabulate/). To install it you can use 
```
pip install tabulate
```

or

```
pip3 install tabulate
```

About support.

In [23]:
!pip3 install tabulate
from classifier import *
from os import path
from support import *
from tabulate import tabulate



In [24]:
import requests
import json

## Extracting the data.

A simple use of the path kernel tools. Having this here, it won't matter that the root path has change from computer, the file with the constitution will be loaded to our program with no inconvenience.

Another consideration taking place here, is the removal of the Main title of the text: "Constitucion Politica de Colombia", because it don't give any relevance but redundancy.

In [25]:
root_folder = "LegalSearcher/ReadFiles"
constitution_f = f'../ReadFiles/constitucion_colombiana.txt'

filepath = path.abspath(constitution_f)

with open(filepath, 'r') as f:
    or_text = f.readlines()

    f.close()

print('Original len:', len(or_text))
text = []

for line in or_text:
    if line != '\n':
        if line != ' \n':
            text.append(line)

# Drop the title "Constitucion Politica de Colombia"
text.pop(0)
print('Total elements:', len(text), '\n----------------------')

# Caso en particular
text[:5]

Original len: 2742
Total elements: 1353 
----------------------


['TITULO I\n',
 'DE LOS PRINCIPIOS FUNDAMENTALES\n',
 'Artículo 1. Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.\n',
 'Artículo 2. Son fines esenciales del Estado: servir a la comunidad, promover la prosperidad general y garantizar la efectividad de los principios, derechos y deberes consagrados en la Constitución; facilitar la participación de todos en las decisiones que los afectan y en la vida económica, política, administrativa y cultural de la Nación; defender la independencia nacional, mantener la integridad territorial y asegurar la convivencia pacifica y la vigencia de un orden justo.\n',
 'Las autoridades de la República están instituidas para proteger a todas las personas residentes en Colo

Here it takes place the first metric: the original len of the elements in file. What is at the sigth, the headlines of Titles does not use a final period before ending the line.

### Considerations for EDA.
In order to give support and remove the noises in data for EDA process, the paragraph of the text will be splited by periods "." and dot-comma ";". This is decided in the search of the best granularity for the embeddings model. Looking trough the metrics, the split is made by punctuation mark.

In [26]:
dot_text = split_text_in_lines(text, delimiter=".")
print('Total elements:', len(dot_text), '\n----------------------')
dot_text[:5]

Total elements: 3593 
----------------------


['TITULO I\n',
 'DE LOS PRINCIPIOS FUNDAMENTALES\n',
 'Artículo 1.',
 ' Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.',
 '\n.']

And here jumps a noise. Yes, the function was made to consider this issues of the line break ("\n"). Having it before helps in Frontend at the process of renderizing, but here it brings just noice. In order to remove it, and adjusting the sentences by removing the space at the begging, result of the splitter, the next process will be executed:
    
    1) Remove de line breaks "\n.".
    2) Remove the space at the begging.
    3) Remove the line breaks ('\n') on headlines and chapters.
    4) Print the true len of the dot_text.
    5) Generate the dcomma_text, and extract the len of it.

In [27]:
# dot_text = [ element for element in dot_text if element != '\n.']


# dot_text = [ element[1:] if element[0] == ' ' else element for element in dot_text]

In [28]:
# Remembering the original len
print('Original elements in dot_text:', len(dot_text), '\n----------------------')

ndot_text = []
index = 0
for line in dot_text:
    # Step 1: Remove de line breaks "\n.".
    if line == '\n.':
        pass
    elif len(line) == 0:
        pass
    # Step 2:Remove the space at the begging.
    elif  line[0] == ' ':
        ndot_text.append(line[1:])
    # Step 3:Remove the \n for headlines and chapters

    elif line[-1:] == '\n':
        # Some elements are just line breaks, and adding the condition into
        # an "AND" on the if doesn't remove it
        if len(line) > 2:
            ndot_text.append(line.rstrip("\n"))
    # Nothing to change
    else:
        ndot_text.append(line)
    index += 1
# Step 3: Print the true len of the dot_text.
print('Total elements:', len(ndot_text), '\n----------------------')

index = 0

for line in ndot_text:
    # Step 1: Remove de line breaks "\n.".
    if line == '\n.':
        ndot_text.pop(index)
    index +=1

# Step 4: Generate the dcomma_text, and extract the len of it.
dcomma_text = split_text_in_lines(ndot_text, delimiter=";")
print('Total elements:', len(dcomma_text), '\n----------------------')


Original elements in dot_text: 3593 
----------------------
Total elements: 2458 
----------------------
Total elements: 2550 
----------------------


### The problem between Frontend, Elastic Search and Data Science.

The problem here, is that this data will give to the frontend more work to do in order to know which element join and which add a line breaker. This problem shows the reason of not deleting the original text list.

## Transforming the articles into python dictionaries.

For this process, the classifier filters through the hierarchy dictionary shown at the beggining. Fortunately, the constitution could be watched as an semi-structured data base if you make a fast check.

The logic of this is that every headline will start with _**"TITULO"**_, except for the last one that is _**"DISPOSICIONES"**_. On the other side, every chapter and article starts with the words _**"CAPITULO"**_ and _**"Articulo"**_. Considereing this, every other word (or digit of an ordered list) at the beggining of any paragraph means that it is on the last article mentioned.

```
hierarchy = {
    'TITULO' : 'headline',
    'DISPOSICIONES' : 'headline',
    'CAPITULO' : 'chapter',
    'ARTÍCULO' : 'article'

}
```

In [29]:
const_info = {
        'id': "constitucion",
        'source_name': "Constitución Política de Colombia",
    }

art_list = articles_info(const_info, text, debugging=False)

print('total articles = ', len(art_list))

total articles =  440


### Overview of the articles dictionary.
Every article dictionary has the next structure:

In [30]:
art_list[78]

{'index': 'constitucion',
 'legal_source': 'Constitución Política de Colombia',
 'id': 'constitucion00000203000079',
 'book': {'title': None, 'name': None, 'count': 0},
 'part': {'title': None, 'name': None, 'count': 0},
 'headline': {'title': 'DISPOSICIONES TRANSITORIAS\n',
  'name': 'DE LA REFORMA DE LA CONSTITUCION\n',
  'count': 14},
 'chapter': {'title': 'CAPITULO 5\n',
  'name': 'DE LOS DEBERES Y OBLIGACIONES\n',
  'count': 5},
 'section': {'title': None, 'name': None, 'count': 0},
 'article': {'name': 'Artículo 79.',
  'content': ['Artículo 79. Todas las personas tienen derecho a gozar de un ambiente sano. La ley garantizará la participación de la comunidad en las decisiones que puedan afectarlo.\n',
   'Es deber del Estado proteger la diversidad e integridad del ambiente, conservar las áreas de especial importancia ecológica y fomentar la educación para el logro de estos fines.\n']}}

Because of this structure, every component of the article can be consulted following python's methods of key-value. 

The variable "first" is referes to the first article that wants to be checked.
If last_pl has fist on it, this will return only the article numbered at first

In [31]:
art = 12
print(f'Article No {art+1}: \n',
      "id: ", art_list[art]['id'],"\n",
      #"lexical_diversity: ", art_list[art]['article']['lexical_diversity'],"\n",
      art_list[art]['headline']['title'],art_list[art]['headline']['name'],"\n",
      art_list[art]['chapter']['title'],art_list[art]['chapter']['name'],"\n",
      art_list[art]['article']['name'],art_list[art]['article']['content'])
print('\n--------------------------------------\n')

Article No 13: 
 id:  constitucion00000201000013 
 DISPOSICIONES TRANSITORIAS
 DE LA REFORMA DE LA CONSTITUCION
 
 CAPITULO 5
 DE LOS DEBERES Y OBLIGACIONES
 
 Artículo 13. ['Artículo 13. Todas las personas nacen libres e iguales ante la ley, recibirán la misma protección y trato de las autoridades y gozarán de los mismos derechos, libertades y oportunidades sin ninguna discriminación por razones de sexo, raza, origen nacional o familiar, lengua, religión, opinión política o filosófica.\n', 'El Estado promoverá las condiciones para que la igualdad sea real y efectiva y adoptara medidas en favor de grupos discriminados o marginados.\n', 'El Estado protegerá especialmente a aquellas personas que por su condición económica, física o mental, se encuentren en circunstancia de debilidad manifiesta y sancionará los abusos o maltratos que contra ellas se cometan.\n']

--------------------------------------



In case you want to explore more about the articles and check by your self, here you can use the benefit of having all the dictionaries on a list. Select your range of interest by changing the first and last article of interest. Remember that list starts at 0, so remove 1 if you have an specific article on mind.

In [32]:
first_art = 1
last_art = 3
for n in range(len( art_list[first_art : last_art] )):
    article = art_list[n]
    art_number = n + 1
    
    print_article(art_number, article)

Article No 2: 
 id:  constitucion00000100000001 
 DISPOSICIONES TRANSITORIAS
 DE LA REFORMA DE LA CONSTITUCION
 
 None None 
 Artículo 1. ['Artículo 1. Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.\n']

--------------------------------------

Article No 3: 
 id:  constitucion00000100000002 
 DISPOSICIONES TRANSITORIAS
 DE LA REFORMA DE LA CONSTITUCION
 
 None None 
 Artículo 2. ['Artículo 2. Son fines esenciales del Estado: servir a la comunidad, promover la prosperidad general y garantizar la efectividad de los principios, derechos y deberes consagrados en la Constitución; facilitar la participación de todos en las decisiones que los afectan y en la vida económica, política, administrativa y cultura

Here results notorious the advantage to Frontend of mantain the libe breaks. There is no problem rendering the text. But what about Data Science? How Elastic Search will know where to check with a embedding model?

### Adding the embeding model to the articles dictionary

Using the same tool, the embedding model will be made with the splitted text dcomma_text

In [33]:
embed_list = articles_info(const_info, text, debugging=False)
print('total articles = ', len(embed_list))

total articles =  440


In [34]:
dcomma_text[143:145]

['1.', 'Elegir y ser elegido.']

Using the zipping tool, adding the embedding to a new key in the articles dictionary will result easy.

In [35]:
for embed, article in zip(embed_list, art_list):
    article['dot_comma_sep'] = embed['article']['content']

In [36]:
art_list[78]

{'index': 'constitucion',
 'legal_source': 'Constitución Política de Colombia',
 'id': 'constitucion00000203000079',
 'book': {'title': None, 'name': None, 'count': 0},
 'part': {'title': None, 'name': None, 'count': 0},
 'headline': {'title': 'DISPOSICIONES TRANSITORIAS\n',
  'name': 'DE LA REFORMA DE LA CONSTITUCION\n',
  'count': 14},
 'chapter': {'title': 'CAPITULO 5\n',
  'name': 'DE LOS DEBERES Y OBLIGACIONES\n',
  'count': 5},
 'section': {'title': None, 'name': None, 'count': 0},
 'article': {'name': 'Artículo 79.',
  'content': ['Artículo 79. Todas las personas tienen derecho a gozar de un ambiente sano. La ley garantizará la participación de la comunidad en las decisiones que puedan afectarlo.\n',
   'Es deber del Estado proteger la diversidad e integridad del ambiente, conservar las áreas de especial importancia ecológica y fomentar la educación para el logro de estos fines.\n']},
 'dot_comma_sep': ['Artículo 79. Todas las personas tienen derecho a gozar de un ambiente sano.

### Elastic Search format.

In case that there is something that needs to be removed, like the lexical_diversity, you will have to edit the original function of **format_articles**. This because, this function is designed _ad hoc_, and other considerations will be time invested on something that won't increment the efficiency but presentation.

In [37]:
levels = { 'book','part', 'headline', 'chapter', 'section', 'article' }
json_list = format_articles(art_list, headers_dict=levels, debugging=False)

## Load and storage of data.
By the time this notebook is written (not finished yet according to plan), the json file that will contain all the articles transformed is storaged on the same folder. This for nothing more than cosiness of checking only the files on the same folder and no opening any other tab.

In [38]:
dict_json = json.dumps(json_list, ensure_ascii=False)

root_folder = "LegalSearcher/ReadFiles/Embeddings"
embedding_f = f'../ReadFiles/Embeddings/constitucion-embedding.json'

filepath = path.abspath(embedding_f)


file = open(embedding_f, "w")
file.write(dict_json)
file.close()

### Quick check of the articles state
Here is observed that the lexical_diversity is removed and that objects that every cathegory has.

In [39]:
article = json_list[1]
for key, value in article.items():
    print(key, ':', value)

index : constitucion
legal_source : Constitución Política de Colombia
id : constitucion00000100000002
headline : {'title': 'DISPOSICIONES TRANSITORIAS\n', 'name': 'DE LA REFORMA DE LA CONSTITUCION\n'}
section : {'title': None, 'name': None}
book : {'title': None, 'name': None}
chapter : {'title': None, 'name': None}
part : {'title': None, 'name': None}
article : {'name': 'Artículo 2.', 'content': ['Artículo 2. Son fines esenciales del Estado: servir a la comunidad, promover la prosperidad general y garantizar la efectividad de los principios, derechos y deberes consagrados en la Constitución; facilitar la participación de todos en las decisiones que los afectan y en la vida económica, política, administrativa y cultural de la Nación; defender la independencia nacional, mantener la integridad territorial y asegurar la convivencia pacifica y la vigencia de un orden justo.\n', 'Las autoridades de la República están instituidas para proteger a todas las personas residentes en Colombia, en 

### Load of articles to localhost
Here it will be used the PUT method, in order to make the index in Elastic Search the same as in our Data Base. For it, the local_url variable would be a formated string, where the article id will be added for every iteration.

In order to watch the result of every _"put request"_, log_info will be created in order to be used with the **tabulate** tool. In case there is no error, that column will be empty.

In [40]:
log_info = {'id': None,
            'status': None,
            'error': None,
            'message': None,
            }

Something to consider is the fact that a succesfull load of an article, will not return a status code of 200, instead, it will return a success document with all the data of the new object created or overwrited.

In [41]:
for article in json_list:
    es_article_url = f"http://localhost:9200/test_all/_doc/{article['id']}"
    request_response = requests.put(es_article_url, json=article)
    log_info = add_to_log(log_info, request_response, article)

# print(tabulate(log_info, headers='keys'))

## Query test to Elastic Search
In order to confirm the state of our info, here it will be made an Elastic Search query. The request will be past on a python dict trough json attribute. The query word in this case will be "Colombia", being the most obvious word to appear on the Colombian Constitution.

In [42]:
local_test = "http://localhost:9200/test_all/_search"
query_test = {
    "query": {
        "simple_query_string": {
            "query": "Constitucion"
        }
    }
}
query_test = requests.get(local_test, json=query_test)

Because the query text is a little "dirty", it will be share at the end as a commented line, in case you desire to look at the content, just uncoment the last cell. A cleaner view, is to call the relevant data for now, like it will be the number of articles that match with the query, the max score, and the best article rated.

In [43]:
result = json.loads(query_test.text)
print(result['hits']['total']['value'])
print(result['hits']['max_score'])
best_rated = result['hits']['hits'][0]
best_rated

440
0.0007571456


{'_index': 'test_all',
 '_type': '_doc',
 '_id': 'constitucion00000100000001',
 '_score': 0.0007571456,
 '_source': {'index': 'constitucion',
  'legal_source': 'Constitución Política de Colombia',
  'id': 'constitucion00000100000001',
  'headline': {'title': 'DISPOSICIONES TRANSITORIAS\n',
   'name': 'DE LA REFORMA DE LA CONSTITUCION\n'},
  'section': {'title': None, 'name': None},
  'book': {'title': None, 'name': None},
  'chapter': {'title': None, 'name': None},
  'part': {'title': None, 'name': None},
  'article': {'name': 'Artículo 1.',
   'content': ['Artículo 1. Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.\n']},
  'dot_comma_sep': ['Artículo 1. Colombia es un Estado social de derecho, organiz

In [44]:
# query_test.text