# Text Processing

### Software Package & Built in Function Documentation
- Scikit Learn (sklearn) - http://scikit-learn.org/stable/documentation.html
- Natural Language Toolkit (NLTK) - http://www.nltk.org/
- SpaCy - https://spacy.io/

## Text Preprocessing Workflow

- Removing state names
- Removing case names
- Removing common stopwords (for example, "the" isn't a useful word)
- Removing people's names (loading the baby name dataset from sklearn)
- Removing day of the week, month names - this throws off our model into thinking we care about period of time
- Stripping non-words (lots of numbers referencing other cases - another interesting project could be keeping ONLY nums)
- Lemmatizing (getting the root of a word - ie run out of running)


As a note about stopwords if you've never done NLP before - there's no one size fits all stopwords subsitution list for knowledge of your domain and which words should be excluded (or kept). Creating a powerful stopwords list is an interative process and requires a lot of stopchecking of your actual data to do it right.

In [None]:
# If you get import module errors regarding spacy or en - run these commands
# %%bash
# pip install spacy
# pip install -U spacy
# python -m spacy validate
# python -m spacy download en
# python -m spacy download en_core_web_sm

In [None]:
import pandas as pd
import re
import spacy
import en_core_web_sm
import nltk
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords as stopwords
from nltk.corpus import names as names
nlp = en_core_web_sm.load()

## Assign Variables to Common Words to Creat Stoplist

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')
male_names = [w.lower() for w in male_names]
male_names_plur = [(w.lower() + "s") for w in male_names]
female_names_plur = [(w.lower() + "s") for w in female_names]
female_names = [w.lower() for w in female_names]
casenames = list(pd.read_csv("casetitles.csv",encoding = 'iso-8859-1'))
statenames = list(pd.read_csv("statenames.csv"))

In [None]:
homespun_words = ['join', 'seek', 'ginnane', 'kestenbaum', 'hummel', 'loevinger', 'note', 'curiam', 'mosk', 'pd', \
                'paxton', 'rhino', 'buchsbaum', 'hirshowitz', 'misc', 'assistant', 'whereon', 'dismiss', 'sod', \
                'vote', 'present', 'entire', 'frankfurter', 'ante', 'leave', 'concur', 'entire', 'mootness', \
                'track', 'constitution', 'jj', 'blackmun', 'rehnquist', 'amici,sup', 'rep', 'stat', 'messes', \
                'like', 'rev', 'trans', 'bra', 'teller', 'vii', 'erisa', 'usca', 'annas', 'lead', 'cf', 'cca', \
                'fsupp', 'afdc', 'amicus', 'ante', 'orrick', 'kansa', 'pd', 'foth', 'stucky', 'aver',"united", \
                "may", "argued", "argue", "decide", "rptr", "nervine", "pp","fd" ,"june", "july", \
                "august", "september", "october", "november", "states", "ca", "joyce", "certiorari", "december",\
                "january", "february", "march", "april", "writ", "supreme court", "court", "dissent", \
                "opinion", "footnote","brief", "decision", "member", "curiam", "dismiss", "note", "affirm", \
                "question", "usc", "file"]

STOPLIST = set(stopwords.words('english') + list(homespun_words) + list(ENGLISH_STOP_WORDS) \
               + list(statenames) + list(casenames) + list(female_names) + list(male_names) + \ 
               list(female_names_plur) + list(male_names_plur))

## Text Cleaner (including stopwords)

In [None]:
STOPLIST = set(list(stopwords.words('english')) + list(sub_list) + list(ENGLISH_STOP_WORDS))

def tokenizeText(sample):
    separators = ["\xa0\xa0\xa0\xa0", "\r", "\n", "\t", "n't", "'m", "'ll", '[^a-z ]']
    for i in separators:
        sample = re.sub(i, " ", sample.lower())
        
    ## get the tokens using spaCy - this makes it possible to lemmatize the words
    tokens = nlp(sample)
    tokens = [tok.lemma_.strip() for tok in tokens]

    ## apply our stoplist
    return [tok for tok in tokens if len(tok) != 1 and tok not in STOPLIST]

In [None]:
doc_list["lem"] = doc_list.case.apply(text_processing)
doc_list.to_pickle("full_proj_lemmatized.pickle") ## to be used in model selection