<a href="https://colab.research.google.com/github/ArvindSinghRawat/Spellcast-Bot/blob/feature%2Fv1%2Farvind%2Fscraping-logic/docs/scripts/Scraping_Logic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logic to Scrape dicitonary words
> Scrapes words from the Flat file like pdf to a standard json which will be understood by our App. 

## Setting dependencies

In [1]:
!pip3 install textract

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# General imports

from typing import List, Dict, Any
from enum import Enum

## Downloading static files

In [3]:
!mkdir -p static
!mkdir -p output
!curl https://www.hzu.edu.in/uploads/2020/10/Law-dictionary.pdf -o static/dictionary.pdf

mkdir: cannot create directory ‘static’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1469k  100 1469k    0     0   363k      0  0:00:04  0:00:04 --:--:--  364k


## Reading file in python

In [4]:
import textract
import re

# creating a pdf file object
inputFile = textract.process('./static/dictionary.pdf', method='pdfminer')

## Preprocessing

1. Divide the file's content into lines or paragraphs
2. Remove all special characters from it

In [5]:
# Getting paragraphs from the PDF
paragraphs = re.split('\n\s*\n+\s*', inputFile.decode())

# Removing special characters from the string and replacing multiple whitespaces into one
paragraphs = list(map(lambda line: re.sub(r'(\\+\S)|(\s+)', ' ', line), paragraphs))


### Methods to process different versions of processed text

In [27]:
# Defining constants
class WordType(Enum):
  VERB = 'v.'
  NOUN = 'n.'
  ABBREVIATION = 'abbr.'
  ADJECTIVE = 'adj.'
result = dict()

In [28]:
def preprocess_word(input: str) -> str:
  """ Removing trailing digits and special characters
  """
  return input.strip().strip('1234567890)(!@#$%^&*()_+{}|\\:"\'<>?,./').strip()

In [29]:
def process_raw_text(line: str) -> List[str]:
  """Reads the raw text and filters unwanted data out of it. Returns List with single word in each element
  """
  # TODO: substitute `- ` with empty word
  # TODO: Split words, add words greater than 1 char in list
  # TODO: convert words to small caps
  result = list()
  line = re.sub("-\s+", "", line)
  split = line.split()
  for word in split:
    word = word.strip()
    all_same = all(ch == word[0] for ch in word)
    if len(word) > 1 and word.isalpha() and not word.isupper() and not all_same:
      result.append(word.lower())

  return result

In [30]:
def map_type(input: str) -> WordType: 
  if input is None:
    return None
  input = input.strip()
  if len(input) < 1:
    return None
  word_type = None
  if input == 'v':
    return WordType.VERB
  elif input == 'adj':
    return WordType.ADJECTIVE
  elif input == 'n':
    return WordType.NOUN
  elif input == 'abbr':
    return WordType.ABBREVIATION
  else:
    return None

In [31]:
def process_words_with_meaning(line: str)  -> Dict[str, Any]:
  if len(line) < 1:
    return None
  split = re.split(r'\s+(v|abbr|n|adj)\.\s+', line)
  if len(split) > 2:
    response = {
        'word': preprocess_word(split[0]),
        'meaning': list()
    }
    index=1
    for i in range(1, len(split) - 1):
      [word_type, meaning] = split[i:i+2]
      word_type = map_type(word_type)
      if word_type is not None:
        meaning = re.sub("-\s+", "", meaning)
        response['meaning'].append({
            'value': preprocess_word(meaning),
            'type': word_type.name,
            'index': index
        })
        index += 1
    return response
  return None

In [32]:
def process_text(line: str, dictionary: Dict[str, Any]) -> Dict[str, Dict[str, Any]]:
  """ Reads the raw text and outputs words with or without meanings
  """ 
  response = process_raw_text(line)
  if response is not None and len(response) > 0:
    for word in response:
      dictionary[word] = None
  response = process_words_with_meaning(line)
  if response is not None:
    dictionary[response['word']] = response
  return dictionary

In [33]:
def process_list_of_lines(lines: List[str], dictionary: Dict[str, Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
  """ Iterates over the list of lines and prepares a dictionary out of those
  """
  if dictionary is None:
    dictionary = dict()
  if lines is not None and len(lines) > 0:
    for each_line in lines:
      process_text(each_line, dictionary)
  return dictionary

In [34]:
res = process_list_of_lines(paragraphs, result)

In [53]:
len(res)
vm = list()
wm = list()
wom = list()
for key, value in res.items():
  if value is None:
    wom.append(key)
  elif ' ' in key:
    wm.append(key)
  else :
    vm.append(key)
print('Valid           (%d) : %s' % (len(vm), vm[0:5]))
print('With meaning    (%d) : %s' % (len(wm), wm[0:5]))
print('Without meaning (%d) : %s' % (len(wom), wom[0:5]))

Valid           (153) : ['AAA', 'AALS', 'ABA', 'abate', 'abdication']
With meaning    (219) : ['the combined value of all bequests and devises, and/or the debts owed by a tes- tator, exceed the assets in the testator’s estate.', 'state court’s decision involving those regulations and proceedings when they involve a substantial or sensitive area of state concern. Burford', 'mined in the state court. Younger', 'e.g., nurse-patient;', 'abuse of discretion']
Without meaning (8084) : ['page', 'law', 'dictionary', 'by', 'susan']


In [48]:
for i in sorted(res.keys()):
  if ' ' in i:
    print(i)

Age Discrimination in Employment
American Bar Foundation
Articles of Confederation
Association of American Law Schools
C & F
Consolidated Omnibus Budget Rec- onciliation Act of
Corpus Juris Secundum
Equal Protection Clause
Federal Insurance Contributions Act
General Agreement on Tariffs and
Good Samaritan law
Income Employee
Interstate Commerce Commission
Labor Management Relations Act
Latin. “To come”
Megan’s Law
Model Rules of Professional Conduct
Modified Accelerated Cost Recovery System
National Association of Security Dealers Automated Quotation system
Occupational Safety and Health Act
Occupational Safety and Health Administration
Restriction Fragment Length Poly- morphism
Roth IRA
Securities Act of
Securities Exchange Act of
Securities and Exchange Commission
Treasury Department, United States
W-2 form
W-4 form
abuse of discretion
academic freedom
accidental death and dismemberment
accord and satisfaction
ad damnum
ad hominem
ad litem
adequate remedy at law
adj. Relating to the 

In [37]:
paragraphs[999:1100]

['n. The criminal act or practice bribery of voluntarily giving, offering, receiving, or soliciting a bribe to influence the offi- cial conduct of a person in a position or office of public trust. See also kickback. commercial bribery. The voluntary giving, offering, receiving, or solicit- ing of a bribe to influence the dis- cretionary conduct or decision of an agent, officer, or employee of a business.',
 '05_542109 ch02.qxp 3/28/06 12:16 PM Page 57',
 '57',
 'burden of allegation',
 'n. Short-term loan to bridge loan cover excessive or concurrent obliga- tions, as in the case of a loan to cover two separate mortgages until borrower is able to sell one home. ',
 'brief 1 n. A written statement pre- pared by a lawyer and submitted to the court that outlines the pertinent facts of the case, the questions of law to be decided, the position of the lawyer’s client as to those questions, and the legal arguments and authorities (for example, statutes and appellate court decisions) that supp

# Save result as a Json

In [55]:
import json

# Serializing json
json_object = json.dumps(res, indent=4)
 
# Writing to sample.json
with open("output/dictionary.json", "w") as outfile:
    outfile.write(json_object)