<div align="center>
            <h1>
            TUTORIAL
            </h1>
</div>    

<h1>
    <center>
         TUTORIAL
    </center>
</h1>


<br>
<br>
    


In this notebook, we are going to have a good tour of main features, usages and how-to of the package.

<br>

## 1. Install
****************

As requested, we need to install the package from pypi. Depending on your internet connection, this may take 1 or 2 minutes.

In [1]:
!pip install legal_doc_processing






<br>
<br>

## 2. Package check and third party downloads
************

<br>

As we will be using the major NLP libraries, we need to download data collections and mandatory web resources such as NLTK stop words, scapy or transformers models, or tensorflow / pyTorch resources. 

It may take 1/2 minutes (depending on your internet connection).

In [2]:
from legal_doc_processing import boot
boot.boot()

PressRelease(source:cftc, {'cooperation_credit': '', 'court': 'U.S. Dis', 'currency': 'USD', 'decision_date': '2015-01-', 'extracted_sanctions': 'Personal', 'extracted_violations': 'withdrew', 'folder': '-- DUMMY', 'judge': 'Marcia M', 'monitor': '0', 'nature_de_sanction': 'Personal', 'nature_of_violations': '', 'reference': '-- DUMMY', 'extracted_authorities': 'CFTC\nU.S', 'type': 'Order', 'justice_type': 'U.S. - C', 'defendant': 'Allied M', 'country_of_violation': 'United S', 'penalty_details': '', 'monetary_sanction': '1000000', 'compliance_obligations': ''}, pipe/spacy:OK/OK

<br>
<br>

## 3. Understand basic package structure
********************
<br>

There are 3 main modules in legal-doc-processing:
- legal_doc for LegalDoc objects i.e. order, complaint, etc etc. all kinds of official documents
- press release for PressRelease objects for the legal press release related to each case
- decision for LegalDoc and PressRelease documents. The Decision object is able to read both the legal document and the press release, make predictions from both documents, and merge / clean predictions from both documents.

so you can :

In [3]:
from legal_doc_processing import legal_doc # import legal document module
from legal_doc_processing import press_release # import press release module
from legal_doc_processing import decision # import decision module


<br>

Each package has its own and dedicated object : 

In [4]:
print(f"press_release.PressRelease : {press_release.PressRelease} \n")
print(f"press_release.__doc__ : {press_release.__doc__} \n")
print(f"press_release.__dict__ : {press_release.__dict__}")

press_release.PressRelease : <class 'legal_doc_processing.press_release.press_release.PressRelease'> 

press_release.__doc__ : None 

press_release.__dict__ : {'__module__': 'legal_doc_processing.press_release.press_release', 'PressRelease': <class 'legal_doc_processing.press_release.press_release.PressRelease'>, 'load_X_y': <function press_release_X_y at 0x7faedc3c9e50>, 'load_df': <function press_release_df at 0x7fafcc1addc0>, 'from_file': <function from_file at 0x7faedbee8040>, 'from_text': <function from_text at 0x7faedbee80d0>, 'from_url': <function from_url at 0x7faedbee8160>, '__dict__': <attribute '__dict__' of '_PressRelease' objects>, '__weakref__': <attribute '__weakref__' of '_PressRelease' objects>, '__doc__': None}


<br>

As consequence, it is recommanded to use this import method : 

In [5]:
from legal_doc_processing import press_release as pr

print(pr.PressRelease)

<class 'legal_doc_processing.press_release.press_release.PressRelease'>



<br>

LegalDoc, PressRelease, and Decision are inherited from The Base class. You can think of the base class as a
as an abstraction of 3 other objects.

Of course you can call this class, even if you will not mainuplate it 

In [6]:
from legal_doc_processing.base.base import Base

print(Base)

<class 'legal_doc_processing.base.base.Base'>


<br>
<br>

## 4. Object instanciation
*****************
<br>

You can instantiate an object in 3 ways:
- with a string
- with a text file path
- with a url


These three methods are built into the package:

### 4.1 with a string

In [7]:
pr.PressRelease
# or 
pr.from_text

print(pr.PressRelease)
print(pr.from_text)

<class 'legal_doc_processing.press_release.press_release.PressRelease'>
<function from_text at 0x7faedbee80d0>


### 4.2 with a file

In [8]:
pr.from_file
print(pr.from_file)

<function from_file at 0x7faedbee8040>


### 4.3 with an url

In [9]:
pr.from_url
print(pr.from_url)

<function from_url at 0x7faedbee8160>


We are going to initate  an object directly form web.
First we need an url, and then we need the source

In [10]:
url = "https://storage.googleapis.com/theolex_documents_processing/cftc/text/7100-15/press-release.txt"
source = "cftc"


<br>

Then we can instantiate our object:

In [11]:
press = pr.from_url(url=url, source=source)

print(press)

PressRelease(source:cftc, {'cooperation_credit': '', 'court': '', 'currency': '', 'decision_date': '', 'extracted_sanctions': '', 'extracted_violations': '', 'folder': '', 'judge': '', 'monitor': '', 'nature_de_sanction': '', 'nature_of_violations': '', 'reference': '', 'extracted_authorities': '', 'type': '', 'justice_type': '', 'defendant': '', 'country_of_violation': '', 'penalty_details': '', 'monetary_sanction': '', 'compliance_obligations': ''}, pipe/spacy:OK/OK



<br>

As we can see, the object takes between 0.3 and 1 second to settle. This is due to the complex methods of NLP to process and restructure the document.


For this, for each object, we need to initiate a complex object such as an instance of spacy, an instance of transformers, etc.


In our case, just for one object it's not a problem but suppose we have 1000 documents to manage?


There is of course a solution: initiate the spacy and transformers object outside of our PressRelease object.

### 4.4 instantiate 100 objects at a time

In [12]:
from legal_doc_processing.utils import get_pipeline, get_spacy

%time nlspa = get_spacy()
%time nlpipe=get_pipeline()

CPU times: user 400 ms, sys: 11.7 ms, total: 412 ms
Wall time: 411 ms
CPU times: user 1.73 s, sys: 207 ms, total: 1.94 s
Wall time: 5.13 s



<br>

As we can see nlpipe needs more than 1 sec to instantiate. 

We can now easy loop on 100 objects, in a very fast way.

In [13]:
import requests
txt = requests.get(url).text
txt_list = [txt for i in range(10)]

make_press = lambda i : pr.PressRelease(i, source=source, nlspa=nlspa, nlpipe = nlpipe)
%time press_list = [make_press(i) for i in txt_list]


CPU times: user 3.48 s, sys: 8.45 ms, total: 3.49 s
Wall time: 3.49 s


### 4.4 main attributes


<br>

Let's have a tour of this object:

In [14]:
press = press_list[0]
print(f"press : {press} \n")

print(f"press.__doc__ : {press.__doc__ } \n" )

print(f"press.__dict__.keys() : {press.__dict__.keys() } \n" )

press : PressRelease(source:cftc, {'cooperation_credit': '', 'court': '', 'currency': '', 'decision_date': '', 'extracted_sanctions': '', 'extracted_violations': '', 'folder': '', 'judge': '', 'monitor': '', 'nature_de_sanction': '', 'nature_of_violations': '', 'reference': '', 'extracted_authorities': '', 'type': '', 'justice_type': '', 'defendant': '', 'country_of_violation': '', 'penalty_details': '', 'monetary_sanction': '', 'compliance_obligations': ''}, pipe/spacy:OK/OK 

press.__doc__ : main press release doc class  

press.__dict__.keys() : dict_keys(['obj_name', 'juridiction', 'source', 'raw_text', 'date', 'h1', 'header', 'content', 'struct_content', 'abstract', 'end', 'nlpipe', 'nlspa', '_predict', '_feature_list', 'feature_list', '_cooperation_credit', '_court', '_currency', '_decision_date', '_extracted_sanctions', '_extracted_violations', '_folder', '_judge', '_monitor', '_nature_de_sanction', '_nature_of_violations', '_reference', '_extracted_authorities', '_type', '_just


<br>

We can find some basics attributes such as : 

In [15]:
basic_attrs = ['obj_name', 'juridiction', 'source', 'raw_text', ]
    
strize = lambda attr :  str(getattr(press, attr))[:200].replace('\n', '/n')
for attr in  basic_attrs : 
    print(f"{attr.ljust(20)} : {strize(attr)}")
          

obj_name             : PressRelease
juridiction          : cftc
source               : cftc
raw_text             : Release Number 7100-15/n/n /n/nJanuary 12, 2015/n/nFederal Court in Florida Enters Order Freezing Assets in CFTC Foreign Currency Anti-/nFraud Action against Allied Markets LLC and its Principals Joshua Gill


<br>

We can find some specific attrs relative to the document structure:

In [16]:
structure_attrs = ['date', 'h1', 'header', 'content', 'struct_content', 'abstract', 'end']
    
strize = lambda attr :  str(getattr(press, attr))[:200].replace('\n', '/n')
for attr in structure_attrs : 
    print(f"{attr.ljust(20)} : {strize(attr)}")
          

date                 : January 12, 2015
h1                   : Federal Court in Florida Enters Order Freezing Assets in CFTC Foreign Currency Anti-. Fraud Action against Allied Markets LLC and its Principals Joshua Gilliland and Chawalit. Wongkhiao. CFIC Charges 
header               : 
content              : Washington, DC â The U.S. Commodity Futures Trading Commission (CFTC) today announced that it filed a civil enforcement Complaint in the U.S. District Court for the Middle District of Florida, charg
struct_content       : 
abstract             : Washington, DC â The U.S. Commodity Futures Trading Commission (CFTC) today announced that it filed a civil enforcement Complaint in the U.S. District Court for the Middle District of Florida, charg
end                  : 


And even more interessant, we can find a list of data points in: 

In [17]:
import pprint 

pprint.pprint(press.feature_dict)


{'compliance_obligations': '',
 'cooperation_credit': '',
 'country_of_violation': '',
 'court': '',
 'currency': '',
 'decision_date': '',
 'defendant': '',
 'extracted_authorities': '',
 'extracted_sanctions': '',
 'extracted_violations': '',
 'folder': '',
 'judge': '',
 'justice_type': '',
 'monetary_sanction': '',
 'monitor': '',
 'nature_de_sanction': '',
 'nature_of_violations': '',
 'penalty_details': '',
 'reference': '',
 'type': ''}


<br>
<br>

## 5. Predictions
**********************

<br>

### 5.1 Specific predictions


<br>

We can now make predictions : 

In [18]:
%time defendant =  press.predict("defendant")

CPU times: user 18.3 s, sys: 32.4 ms, total: 18.3 s
Wall time: 4.64 s


<br>

We have to ways of getting a predictions

<br>

### 5.2 Raw vs final predictions 


<br>

Fist one is the result of the prediction which is a list of tupples with answer, score.

Such as:  

In [19]:
print(defendant)

[('Joshua Gilliland', 5.36), ('Chawalit Wongkhiao', 1.79), ('Allied Markets', 0.76)]



<br>

We can access to the same object with the special ```_feature_dict attribute``` and ```_[OUR_FEATURE] ```.
Such as:

In [20]:
print(press._feature_dict["_defendant"])

[('Joshua Gilliland', 5.36), ('Chawalit Wongkhiao', 1.79), ('Allied Markets', 0.76)]



<br>

Or, if we want a more usable object, we can access to the str readble answer with the ```feature_dict``` and feature.

Such as: 

In [21]:
print(press.feature_dict["defendant"])

Allied Markets
Chawalit Wongkhiao
Joshua Gilliland


<br>

### 5.3 Make all predictions at once


<br>

And of course, we have a ```predic_all``` and ```predict("all")``` methods to make all predictions in once.

Such as:

In [22]:
%time _ = press.predict_all()

CPU times: user 31.8 s, sys: 94.1 ms, total: 31.9 s
Wall time: 8.27 s


Readable data :

In [23]:
pprint.pprint(press.feature_dict)

{'compliance_obligations': '',
 'cooperation_credit': '',
 'country_of_violation': 'United States',
 'court': '',
 'currency': 'USD',
 'decision_date': '2015-01-12',
 'defendant': 'Allied Markets\nChawalit Wongkhiao\nJoshua Gilliland',
 'extracted_authorities': 'CFTC',
 'extracted_sanctions': 'Personal Expenses\n'
                        'none of the Defendants\n'
                        'none of the Defendants has ever been registered with '
                        'the CFTC\n'
                        'personal expenses\n'
                        'Freezing Assets\n'
                        'freezing and preserving assets',
 'extracted_violations': 'withdrew approximately $850,000 in pool participant '
                         'funds\n'
                         'Complaint charges\n'
                         'the Complaint\n'
                         'to trade forex in a commodity pool\n'
                         'misappropriated funds to pay their personal '
                         'e

Or if you want the raw value of each feature:

In [24]:
pprint.pprint(press._feature_dict)

{'_compliance_obligations': [('', 1)],
 '_cooperation_credit': [('', 1)],
 '_country_of_violation': [('United States', 1)],
 '_court': [('', 1)],
 '_currency': [('USD', 1)],
 '_decision_date': [('2015-01-12', 1)],
 '_defendant': [('Joshua Gilliland', 5.36),
                ('Chawalit Wongkhiao', 1.79),
                ('Allied Markets', 0.76)],
 '_extracted_authorities': [('CFTC', 1), ('CFTC', 1)],
 '_extracted_sanctions': [('none of the Defendants has ever been registered '
                           'with the CFTC',
                           0.92),
                          ('Freezing Assets', 0.89),
                          ('personal expenses', 0.8),
                          ('Personal Expenses', 0.79),
                          ('freezing and preserving assets', 0.73),
                          ('none of the Defendants', 0.31)],
 '_extracted_violations': [('Defendants Allied Markets LLC', 2.82),
                           ('Complaint', 2.04),
                           ('Compla

<br>
<br>

## 6. Conclusion
**********************

<br>

We now have done a pretty good tour of legal-doc-processing. Even if there is much more to teach, you can use it at your own, and report to the "behind the hood" section to go much more deeper in the package.