# ETL sandbox

Here you can find all the ETL process than transforms a txt file of the Colombian Constitution into an index on a Elastic Search Server ESS. In case you don't have any ESS, or want to make local loads, you can use a docker image. Something to clarify, if you want to use this notebook to create json documents from .txt files, they must have the next hierarchy structure:

```
hierarchy = {
    'TITULO' : 'headline',
    'DISPOSICIONES' : 'headline',
    'CAPITULO' : 'chapter',
    'ARTÍCULO' : 'article'

}
```

An important consideration to have in mind, is that here, this order represents the head level of each clasification. In other words, headline is an _"h1"_ on html, chapter an _"h2"_ and article would be a _"p"_.

## Importing libraries and tools.

About the libraries and needed to run the main project, as the convensions says, are listed on the requirements.txt file. Here the only library that is not installed on python's kernel is [tabulate](https://pypi.org/project/tabulate/). To install it you can use 
```
pip install tabulate
```

or

```
pip3 install tabulate
```

About support.

In [1]:
from classifier import *
from os import path
from support import *
from tabulate import tabulate

In [2]:
import requests
import json

## Extracting the data.

A simple use of the path kernel tools. Having this here, it won't matter that the root path has change from computer, the file with the constitution will be loaded to our program with no inconvenience.

In [3]:
root_folder = "LegalSearcher/ReadFiles"
constitution_f = f'../ReadFiles/constitucion_colombiana.txt'

filepath = path.abspath(constitution_f)

with open(filepath, 'r') as f:
    text = f.readlines()

    f.close()
    
for line in text:
    if line == '\n':
        text.remove(line)

Just a removal of all "\n" (libre breaks) and the Title "Constitucion Politica de Colombia":

In [4]:
clean_text = text_remove_nl(text)

# Drop the title "Constitucion Politica de Colombia"
clean_text = clean_text[1:len(text)]

## Transforming the articles into python dictionaries.

For this process, the classifier filters through the hierarchy dictionary shown at the beggining. Fortunately, the constitution could be watched as an semi-structured data base if you make a fast check.

The logic of this is that every headline will start with _**"TITULO"**_, except for the last one that is _**"DISPOSICIONES"**_. On the other side, every chapter and article starts with the words _**"CAPITULO"**_ and _**"Articulo"**_. Considereing this, every other word (or digit of an ordered list) at the beggining of any paragraph means that it is on the last article mentioned.

```
hierarchy = {
    'TITULO' : 'headline',
    'DISPOSICIONES' : 'headline',
    'CAPITULO' : 'chapter',
    'ARTÍCULO' : 'article'

}
```

In [5]:
book_id = "Const"
art_list = articles_info(book_id, clean_text, debugging=False)
print('total articles = ', len(art_list))

total articles =  439


### Overview of the articles dictionary.
Every article dictionary has the next structure:

In [6]:
art_list[78]

{'index': 'constitucion_politica_de_colombia',
 'id': 'const02003069',
 'headline': {'title': 'TITULO II',
  'name': 'DE LOS DERECHOS, LAS GARANTIAS Y LOS DEBERES',
  'headline_id': 2},
 'chapter': {'title': 'CAPITULO 3',
  'name': 'DE LOS DERECHOS COLECTIVOS Y DEL AMBIENTE',
  'chapter_id': 3},
 'article': {'name': 'Artículo 79',
  'content': ['Artículo 79. Todas las personas tienen derecho a gozar de un ambiente sano. La ley garantizará la participación de la comunidad en las decisiones que puedan afectarlo.',
   'Es deber del Estado proteger la diversidad e integridad del ambiente, conservar las áreas de especial importancia ecológica y fomentar la educación para el logro de estos fines.'],
  'article_id': 'Const02003069',
  'lexical_diversity': 10.147058823529411}}

Because of this structure, every component of the article can be consulted following python's methods of key-value. 

The variable "first" is referes to the first article that wants to be checked.
If last_pl has fist on it, this will return only the article numbered at first

In [7]:
art = 0
print(f'Article No {art+1}: \n',
      "id: ", art_list[art]['article']['article_id'],"\n",
      "lexical_diversity: ", art_list[art]['article']['lexical_diversity'],"\n",
      art_list[art]['headline']['title'],art_list[art]['headline']['name'],"\n",
      art_list[art]['chapter']['title'],art_list[art]['chapter']['name'],"\n",
      art_list[art]['article']['name'],art_list[art]['article']['content'])
print('\n--------------------------------------\n')

Article No 1: 
 id:  Const01000001 
 lexical_diversity:  9.942857142857143 
 TITULO I DE LOS PRINCIPIOS FUNDAMENTALES 
 None None 
 Artículo 1 ['Artículo 1. Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.']

--------------------------------------



In case you want to explore more about the articles and check by your self, here you can use the benefit of having all the dictionaries on a list. Select your range of interest by changing the first and last article of interest. Remember that list starts at 0, so remove 1 if you have an specific article on mind.

In [8]:
first_art = 1
last_art = 3
for n in range(len( art_list[first_art : last_art] )):
    article = art_list[n]
    art_number = n + 1
    
    print_article(art_number, article)

Article No 2: 
 id:  Const01000001 
 lexical_diversity:  9.942857142857143 
 TITULO I DE LOS PRINCIPIOS FUNDAMENTALES 
 None None 
 Artículo 1 ['Artículo 1. Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.']

--------------------------------------

Article No 3: 
 id:  Const01000002 
 lexical_diversity:  18.7 
 TITULO I DE LOS PRINCIPIOS FUNDAMENTALES 
 None None 
 Artículo 2 ['Artículo 2. Son fines esenciales del Estado: servir a la comunidad, promover la prosperidad general y garantizar la efectividad de los principios, derechos y deberes consagrados en la Constitución; facilitar la participación de todos en las decisiones que los afectan y en la vida económica, política, administrativa y cultural de 

### Elastic Search format.

In case that there is something that needs to be removed, like the lexical_diversity, you will have to edit the original function of **format_articles**. This because, this function is designed _ad hoc_, and other considerations will be time invested on something that won't increment the efficiency but presentation.

In [9]:
json_list = format_articles(art_list, debugging=False)

## Load and storage of data.
By the time this notebook is written (not finished yet according to plan), the json file that will contain all the articles transformed is storaged on the same folder. This for nothing more than cosiness of checking only the files on the same folder and no opening any other tab.

In [10]:
dict_json = json.dumps(json_list, ensure_ascii=False)
file = open('es_col_constitution.json', "w")
file.write(dict_json)
file.close()

### Quick check of the articles state
Here is observed that the lexical_diversity is removed and that objects that every cathegory has.

In [11]:
article = json_list[0]
for key, value in article.items():
    print(key, ':', value)

index : constitucion_politica_de_colombia
id : const01000001
headline : {'title': 'TITULO I', 'name': 'DE LOS PRINCIPIOS FUNDAMENTALES'}
chapter : {'title': None, 'name': None}
article : {'name': 'Artículo 1', 'content': ['Artículo 1. Colombia es un Estado social de derecho, organizado en forma de República unitaria, descentralizada, con autonomía de sus entidades territoriales, democrática, participativa y pluralista, fundada en el respeto de la dignidad humana, en el trabajo y la solidaridad de las personas que la integran y en la prevalencia del interés general.']}


In [12]:
json_list[1]

{'index': 'constitucion_politica_de_colombia',
 'id': 'const01000002',
 'headline': {'title': 'TITULO I', 'name': 'DE LOS PRINCIPIOS FUNDAMENTALES'},
 'chapter': {'title': None, 'name': None},
 'article': {'name': 'Artículo 2',
  'content': ['Artículo 2. Son fines esenciales del Estado: servir a la comunidad, promover la prosperidad general y garantizar la efectividad de los principios, derechos y deberes consagrados en la Constitución; facilitar la participación de todos en las decisiones que los afectan y en la vida económica, política, administrativa y cultural de la Nación; defender la independencia nacional, mantener la integridad territorial y asegurar la convivencia pacifica y la vigencia de un orden justo.',
   'Las autoridades de la República están instituidas para proteger a todas las personas residentes en Colombia, en su vida, honra, bienes, creencias, y demás derechos y libertades, y para asegurar el cumplimiento de los deberes sociales del Estado y de los particulares.']}

### Load of articles to localhost
Here it will be used the PUT method, in order to make the index in Elastic Search the same as in our Data Base. For it, the local_url variable would be a formated string, where the article id will be added for every iteration.

In order to watch the result of every _"put request"_, log_info will be created in order to be used with the **tabulate** tool. In case there is no error, that column will be empty.

In [None]:
log_info = {'id': None,
            'status': None,
            'error': None,
            'message': None,
            }

Something to consider is the fact that a succesfull load of an article, will not return a status code of 200, instead, it will return a success document with all the data of the new object created or overwrited.

In [None]:
for article in json_list:
    es_article_url = f"http://localhost:9200/col_constitucion/_doc/{article['id']}"
    request_response = requests.put(es_article_url, json=article)
    log_info = add_to_log(log_info, request_response, article)

print(tabulate(log_info, headers='keys'))

## Query test to Elastic Search
In order to confirm the state of our info, here it will be made an Elastic Search query. The request will be past on a python dict trough json attribute. The query word in this case will be "Colombia", being the most obvious word to appear on the Colombian Constitution.

In [14]:
local_test = "http://localhost:9200/col_constitucion/_search"
query_test = {
    "query": {
        "simple_query_string": {
            "query": "Colombia"
        }
    }
}
query_test = requests.get(local_test, json=query_test)

Because the query text is a little "dirty", it will be share at the end as a commented line, in case you desire to look at the content, just uncoment the last cell. A cleaner view, is to call the relevant data for now, like it will be the number of articles that match with the query, the max score, and the best article rated.

In [25]:
result = json.loads(query_test.text)
print(result['hits']['total']['value'])
print(result['hits']['max_score'])
best_rated = result['hits']['hits'][0]
best_rated

16
4.4105473


{'_index': 'col_constitucion',
 '_type': '_doc',
 '_id': 'const01000009',
 '_score': 4.4105473,
 '_source': {'index': 'constitucion_politica_de_colombia',
  'id': 'const01000009',
  'headline': {'title': 'TITULO I', 'name': 'DE LOS PRINCIPIOS FUNDAMENTALES'},
  'chapter': {'title': None, 'name': None},
  'article': {'name': 'Artículo 9',
   'content': ['Artículo 9. Las relaciones exteriores del Estado se fundamentan en la soberanía nacional, en el respeto a la autodeterminación de los pueblos y en el reconocimiento de los principios del derecho internacional aceptados por Colombia.',
    'De igual manera, la política exterior de Colombia se orientará hacia la integración latinoamericana y del Caribe.']}}}

In [16]:
# query_test.text