### PROJECT: JUDICIAL DECISION PREDICTOR 
<br>

#### Summary: ####
- Webscrapping: Collect all available decisions from the jurisprudence website.
- Machine Learning: Use models to understand decisions and learn behavioral patterns.
- Prediction: Use models to predict the probability of win/loss of a certain lawsuit. <br> <br>

#### Features: <br>
Must haves:
- Input the keywords of the type of lawsuit.
- Output the probability of win/loss according to the model's prediction. <br> <br>

Should haves:
- Input judge for better prediction.
- Statistics analysis and plots for better decision making.
<br> <br>

Could haves:
- Indications of keywords, jurisprudence and law articles to aid the lawyer.
<br> <br>

Won't haves:
- Never say never. <br> <br>

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import requests
import math
import time
import csv
import re
from bs4 import BeautifulSoup
from lxml import etree
from tqdm import tqdm

#### BASIC INFORMATIONS ####

In [2]:
# the code in this cell was coppied from stackoverflow, it is being used to 
# rotate ip, so i won't get banned for scrapping the website
import socks
import socket
from stem.control import Controller
from stem import Signal

err = 0
counter = 0
url = "checkip.dyn.com"

with Controller.from_port(port = 9151) as controller:
    try:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
        socket.socket = socks.socksocket
        while counter < 10:
            r = requests.get("http://checkip.dyn.com")
            soup = BeautifulSoup(r.content)
            print(soup.find("body").text)
            counter = counter + 1
            #wait till next identity will be available
            controller.signal(Signal.NEWNYM)
            time.sleep(controller.get_newnym_wait())
    except requests.HTTPError:
        print("Could not reach URL")
        err = err + 1
print("Used " + str(counter) + " IPs and got " + str(err) + " errors")

Current IP Address: 185.222.202.133
Current IP Address: 185.220.101.13
Current IP Address: 198.98.59.161
Current IP Address: 199.249.230.118
Current IP Address: 162.247.74.216
Current IP Address: 62.102.148.68
Current IP Address: 192.160.102.169
Current IP Address: 199.249.230.64
Current IP Address: 5.199.130.188
Current IP Address: 171.25.193.235
Used 10 IPs and got 0 errors


In [3]:
response= requests.get('http://ipecho.net/plain')
print ("My Original IP Address:",response.text)

My Original IP Address: 185.220.101.30


In [4]:
#url and bs4 for the first steps
url = 'https://www.direitoemdia.pt/search?search_jurisprudence%5Bkeywords%5D=&search_jurisprudence%5Bdescriptors_operator%5D=or&search_jurisprudence%5Bprocess_number%5D=&search_jurisprudence%5Breferendary%5D=&search_jurisprudence%5Bdatelow%5D=&search_jurisprudence%5Bdatehigh%5D=&search_jurisprudence%5Bdate_created%5D=&search_jurisprudence%5Bhidden%5D='

response_url = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response_url.content, 'lxml')

In [5]:
# I will star doing civil law extraction, since it is the most common law
# Didn't use jpaz and 1a_instancia because has few access to judgements 

# getting tribunals for civil area
tribunals = ['stj', 'trl', 'trp', 'trc', 'tre', 'trg']

part1 = ['https://www.direitoemdia.pt/search?search_jurisprudence%5Bkeywords%5D=&search_'\
        'jurisprudence%5Bdescriptors_operator%5D=or&search_jurisprudence%5Bprocess_number%5D'\
         '=&search_jurisprudence%5Breferendary%5D=&search_jurisprudence%5Bdatelow%5D=&search_'\
         'jurisprudence%5Bdatehigh%5D=&search_jurisprudence%5Bdate_created%5D=&search_'\
         'jurisprudence%5Bhidden%5D=&penal-dgsi_jstj=2&uniaoeuropeia-ue_func_pub=5&civil'\
         '-dgsi_j' + trib + '=' for trib in tribunals]

# getting pages
total_pages = [s.text for s in soup.find_all('span', class_='badge')][:6]
each_pages = [list(range(1, math.ceil(int(page)/20)+1)) for page in total_pages]

In [6]:
# putting the link together
zipped = list(zip(part1, each_pages))
links = [url + str(p) for url, pag in zipped for p in pag]

In [7]:
#for all the tribunals (try it out latter on)
#links = [a['href'] for a in soup.find_all('a', class_='nav-link') if len(a['href']) > 1]

In [8]:
# getting total items found is unecessary for now, but perhaps we can use it further on
# n = int(re.findall('[0-9]+', str(list(soup.find('div', class_='col-md-2 text-right'))[1]))[0])

# getting total pages
# pages = [str(n) for n in list(range(1, math.ceil(n/20)))]

In [9]:
# getting all urls for all the pages, not using it for now
#pages_url = [ 'https://www.direitoemdia.pt/search?search_jurisprudence%5B' \
#                     'keywords%5D=&search_jurisprudence%5Bdescriptors_operator%5D=or&search_' \
#                     'jurisprudence%5Bprocess_number%5D=&search_jurisprudence%5Breferendary%5D' \
#                     '=&search_jurisprudence%5Bdatelow%5D=&search_jurisprudence%5Bdatehigh%5D=&'\
#                     'search_jurisprudence%5Bdate_created%5D=&search_jurisprudence%5Bhidden%5D=&'\
#                     'civil-dgsi_jstj=' + page + '&paginator_civil-dgsi_jstj=true#' \
#                     for page in pages]

In [10]:
#getting url for all areas, not using it for now
# all_urls = [list(map(lambda link: url + link, links)) for url in pages_url]
# all_urls = [i for ii in all_urls for i in ii]

#### FUNCTIONS TO EXTRACT INFORMATION ####

In [11]:
# getting head informations
def heading(soup):
    return [b.text.replace('\n', ' ') for b in soup.find_all('div', class_='col-md-6')][:40]

# parsing heading information
def proc_number(soup):
    return [b.text for b in soup.find_all('span', class_='processNumber')][:20]

def area(heading):
     return [v for p, v in enumerate(heading) if p%2 == 1]

def date(heading):
    date = [re.findall('\d{1,2}\s[de].+\d', h) for h in heading]
    date = [v for p, v in enumerate(date) if p%2 == 0]
    return [i for d in date for i in d]

def rulling_type(heading):
    r_type = [re.findall('.+?(?=Processo)', h) for h in heading]
    r_type = [v for p, v in enumerate(r_type) if p%2 == 0]
    return [i.strip() for r in r_type for i in r]

#separating descriptors information
def descriptors(soup):
    return [b.text.replace('&nbsp&nbsp&nbsp&nbsp>&nbsp&nbsp&nbsp&nbsp', ', ')
            for b in soup.find_all('div', class_='descriptors')][:20]

#separating body information
def decision(soup):
    return [b.text for b in soup.find_all('p', class_='result-text block-with-text')][:20]

#separating links for full decisions
def dec_links(soup):
    link_full = [b.get('href') for b in soup.find_all('a', class_='search-result-link')][:20]
    return ['https://www.direitoemdia.pt'+link for link in link_full]

#### GETTING INFORMATION FROM ALL PAGES ####

In [12]:
# information I will most likely use latter on starts with p_
## this function will extract all information and save it into a dataframe(.csv)
def all_info(links):
    error = []
    p_numb = []
    p_area = []
    p_date = []
    p_type = []
    p_desc = []
    p_decision = []
    decision_links = []
    with tqdm(total=len(links), desc="Extracting Decisions", bar_format="{l_bar}{bar} [ time left: {remaining} ]") as pbar:
        for link in links:
            try:
                response_url = requests.get(link, headers={'user-agent': 'Mozilla/5.0'})
                soup = BeautifulSoup(response_url.content, 'lxml')
                head = heading(soup)
                p_numb += proc_number(soup)
                p_area += area(head)
                p_date += date(head)
                p_type += rulling_type(head)
                p_desc += descriptors(soup)
                p_decision += decision(soup)
                decision_links += dec_links(soup)
                time.sleep(1)
            except:
                error += link
                continue
            pbar.update(1)
    print('You extracted {} pages out of {}.'.format(len(links)-len(error), len(links)))
    all_lists = [p_numb, p_area, p_date, p_type, p_desc, p_decision, decision_links]
    cols = ['Process Number', 'Area', 'Date of Judgement', 'Type of Judgement', 'Descriptors',
            'Main Judgement', 'Full Judgement Link']
    print('Wrapping up...')
    proc_dict = {k:v for k,v in zip(cols, all_lists)}
    dataframe = pd.DataFrame(proc_dict)
    print('Errors: ', error)
    return dataframe.to_csv('decision_df.csv')

In [13]:
# running the function
all_info(links[5500:])

Extracting Decisions:  60%|█████████████████████████████████████████▌                            [ time left: 1:19:14 ]


You extracted -94978 pages out of 583.
Wrapping up...
Errors:  ['h', 't', 't', 'p', 's', ':', '/', '/', 'w', 'w', 'w', '.', 'd', 'i', 'r', 'e', 'i', 't', 'o', 'e', 'm', 'd', 'i', 'a', '.', 'p', 't', '/', 's', 'e', 'a', 'r', 'c', 'h', '?', 's', 'e', 'a', 'r', 'c', 'h', '_', 'j', 'u', 'r', 'i', 's', 'p', 'r', 'u', 'd', 'e', 'n', 'c', 'e', '%', '5', 'B', 'k', 'e', 'y', 'w', 'o', 'r', 'd', 's', '%', '5', 'D', '=', '&', 's', 'e', 'a', 'r', 'c', 'h', '_', 'j', 'u', 'r', 'i', 's', 'p', 'r', 'u', 'd', 'e', 'n', 'c', 'e', '%', '5', 'B', 'd', 'e', 's', 'c', 'r', 'i', 'p', 't', 'o', 'r', 's', '_', 'o', 'p', 'e', 'r', 'a', 't', 'o', 'r', '%', '5', 'D', '=', 'o', 'r', '&', 's', 'e', 'a', 'r', 'c', 'h', '_', 'j', 'u', 'r', 'i', 's', 'p', 'r', 'u', 'd', 'e', 'n', 'c', 'e', '%', '5', 'B', 'p', 'r', 'o', 'c', 'e', 's', 's', '_', 'n', 'u', 'm', 'b', 'e', 'r', '%', '5', 'D', '=', '&', 's', 'e', 'a', 'r', 'c', 'h', '_', 'j', 'u', 'r', 'i', 's', 'p', 'r', 'u', 'd', 'e', 'n', 'c', 'e', '%', '5', 'B', 'r', '

In [2]:
# testing the dataframe
decision_df = pd.read_csv('Projeto/decision_df.csv', index_col='Unnamed: 0')
decision_df.head()

Unnamed: 0,Process Number,Area,Date of Judgement,Type of Judgement,Descriptors,Main Judgement,Full Judgement Link
0,Processo nº: 4064/14.1T8STB.E1.S2,CÍVEL,26 de novembro de 2019,Acórdão,"Revista excepcional, Revista excecional, Acórd...",I- Admitido o recurso de revista excepcional...,https://www.direitoemdia.pt/search/show/f4f4f5...
1,Processo nº: 5168/11.8TCLRS.L1.S1,CÍVEL,26 de novembro de 2019,Acórdão,"Contrato de arrendamento, Fim contratual, Obra...",1. Sobre o senhorio recai o dever de facultar ...,https://www.direitoemdia.pt/search/show/f4f4f5...
2,Processo nº: 18079/16.1T8LSB.L1.S1,CÍVEL,26 de novembro de 2019,Acórdão,"Poderes do supremo tribunal de justiça, Altera...",I - A matéria de facto só pode ser alterada pe...,https://www.direitoemdia.pt/search/show/489122...
3,Processo nº: 866/14.7TBPVZ-A.P1.S1,CÍVEL,26 de novembro de 2019,Acórdão,"Saneador-sentença, Excepção dilatória, Exceção...","I - Em sede de saneador-sentença, o juiz deve,...",https://www.direitoemdia.pt/search/show/489122...
4,Processo nº: 2288/08.0TBPTM.E1.S1,CÍVEL,26 de novembro de 2019,Acórdão,"Contrato-promessa de compra e venda, Incumprim...",I. Para decidir se é ou não de atribuir relev...,https://www.direitoemdia.pt/search/show/6427f8...


In [3]:
# since i got a few datasets, concatenating them:
#'''
d1 = pd.read_csv('Projeto/decision_df1.csv', index_col='Unnamed: 0')
d2 = pd.read_csv('Projeto/decision_df2.csv', index_col='Unnamed: 0')
d3 = pd.read_csv('Projeto/decision_df3.csv', index_col='Unnamed: 0')
d4 = pd.read_csv('Projeto/decision_df4.csv', index_col='Unnamed: 0')
d5 = pd.read_csv('Projeto/decision_df.csv', index_col='Unnamed: 0')
df = pd.concat([d1,d2,d3,d4,d5])
df.to_csv('decisions.csv')
#'''

In [16]:
#decision_df = pd.read_csv('decisions.csv', index_col='Unnamed: 0')
decision_df.head()#, len(decision_df)

(                          Process Number   Area       Date of Judgement  \
 0   Processo nº:  1986/06.7TVLSB-C.L1.S2  CÍVEL  21 de novembro de 2019   
 1       Processo nº:  92/13.2TBPMS.C1.S1  CÍVEL  21 de novembro de 2019   
 2  Processo nº:  11701/15.9T8LSR-A.L1.S2  CÍVEL  21 de novembro de 2019   
 3   Processo nº:  17085/15.8T8 LSB.L1.S2  CÍVEL  12 de novembro de 2019   
 4      Processo nº:  537/14.4T8FAR.E1.S1  CÍVEL  12 de novembro de 2019   
 
   Type of Judgement                                        Descriptors  \
 0           Acórdão  Litigância de má fé, Notificação, Notificação ...   
 1           Acórdão  Nulidade de acórdão, Litisconsórcio, Litiscons...   
 2           Acórdão  Responsabilidade civil extracontratual, Dívida...   
 3           Acórdão  Revista excepcional, Revista excecional, Forma...   
 4           Acórdão  Processo de inventário, Avaliação de bens, Con...   
 
                                       Main Judgement  \
 0  I.      De acordo com o despa

In [17]:
#len(decision_df['Descriptors'].unique())

9750