## Task Description



*   Text classification 
*   Represent text in meaningful way
    *   labelled inputs (supervised learning)
    *   multi-classification model (unsupervised learning)
*    Prediction based on classification patterns


There are many different approaches to text mining: data mining, machine learning, information retrieval and knowledge management.  Each seeks to extract, identify and use information from large collections of textual data.  

Text classification is a learning process of text mining.  In this use case it involves preprocessing the data,  weighting terms, using the KNN algorithm in combination with  the K-means clustering algorithim.

Evaluation of this classification methodology will be assessed using precisin, recall, and f-measure.

![Model Methodology](assets/HRF_predictive_descriptive_analysis_model_methodology_RJProctor.png)

Note: until we have data documents to process, this will serve only as an option for future teams - inherited DB from Labs29 is empty with no schema

[Labs 29  HRF Asylum B DS AWS RDS PostgresSQL](https://master:***@asylum.catpmmwmrkhp.us-east-1.rds.amazonaws.com/asylum)

[Labs 29 AWS Asylum A DS RDS PostgresSQL](https://master:***rds_endpoint=hrfasylum-database-a.catpmmwmrkhp.us-east-1.rds.amazonaws.com)


# Exploratory Model and  Visualizations 

The purpose of the model is to classify the legal documents containing judicial decisions for immigration asylum seekers and gain insights from these documents that will aid representatives to advocate for thier client:

1.   Judicial decision
  *   Asylum Granted
  *   Asylum Relief Denied
  *   Other Relief Granted
  *   Admin Closure (expired)

2.   Insights from patterns in data of individual judges (IJ cases only - initial hearings)
3.   Insights from patterns in data of appellate (panel) judges (BIA cases only - appellate hearings)
4.   Insights from patterns in all data (IJ and BIA cases - combined initial and appellate hearings)


In [None]:
# database dependency
!pip install sqlalchemy psycopg2-binary

In [None]:
# model dependency
!pip install pydantic


In [None]:
# libraries
# dataframe tools
import pandas as pd


# natural language processor
import nltk
import spacy


# wrapper to show progress/time as self checking tool
from tqdm import tqd
  

# key word extraction
import textacy
import textacy.ke


# regular expression
import re


# linear algebra and array/matrices tools
import numpy as np
import random as rd
import math


# database tools
import sqlalchemy


# linear modeling
from sklearn.linear_model import LinearRegression
from pydantic import BaseModel 


# graphing tools
import matplotlib.pyplot as plt
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import plotly.graph_objects as go


## Query Database

In [None]:
# database URL
database_url = input()


#connect to database
engine = sqlalchemy.create_engin(database_url)
connection = engin.connect()


#get_url (without password showing)
def get_url():
    """
    verify connection to database
    return database connection in URL format:

    dialect://user:password@host/dbname

    password will be hidden with ***
    """
    url_without_password = repr(connection.engin.url)
    return {'database_url': url_without_password}


# call function to verify connection to DB
get_url()



In [None]:
# query data/tables using write function
def write_data(df):
    tablename = 'mytable'
    df.to_sql(tablename, connection, if_exists='append', index=False, method='multi'


# call function to create pandas df
write_data(df)


In [None]:
# query data/tables using read function
def read_data(tablename):
    """
    select all unique records from the database
    read them into a dataframe
    """
    query = f"""SELECT DISTINCT * FROM {tablename} LIMIT 5"""
    df = pd.read_sql(query, connection)
    return df.to_dict(orient='records')


# call function to read string fields from RDB table(s)
read_data('pdfs')
read_data('judges')


## Preprocessing Text


1.   Case folding - the process of changing all letters to lowercase
2.   tokenizing - the process of reducing a string to its individual; decrease of documents (parsing) into single words (tokens)
1.    Filtering - the process of determining important words from its token
2.   Stemming or lemmatiation - technique for reducing word to root word; based on stem or form

The purpose of preprocessing data  in the form of numerical values is that this data then becomes the source of data to be processed further.

In [None]:
nlp = spacy.load('en_core_web_sm')

def createTokens(df):
    '''
    function to tokenize, lemmatize lowercase
    and remove stop words from text
    '''
    tokens = []
    for doc in tqdm(nlp.pipe(df.astype('unicode').values),total=df.size):
        if doc.is_parsed:
            tokens.append([n.lemma_.lower() for n in doc if (no n.is_punct and not n._isspace and not n._is stop)])
        else:
            tokens.append('')
    return tokens

## dependent on scraping tool separating documents into csv file type as currently written
raw = pd.read_csv('')
  # as the tabular query has not been built yet so we have no filename to insert here
tokens = createTokens(raw)

In [None]:
##dependent on previous code blocks
# find relevant keywords
text = ''.join(raw.tolist())
nlp.max_length = len(text)

keywords = []
for tokenlist in tqdm(question_tokens):
    doc = nlp(''.join(tokenlist))
    # extract and rank
    extract = textacy.ke.sgrank(doc, ngrams=(1), window_size=2, normalize=None, topn=2, include_pos=['NOUN', 'PROPN'])
        for a, b in extract:
            keywords.append(a)



In [None]:
# sort unique keywords by frequency
resorted = sorted(set(keywords), key=lambda x: keywords.count(x), reverse=True)
# most freq unique keywords assigned to bins
top20 = resorted[:20]
top200 = resorted[:200]


In [None]:
## description of flow
# extract filenames
    # assume huge dataset for scalability
    # remove root folder
    # traverse through index.html to find pattern to remove title


# read all the index.html files at once
x[0] for x in os.walk(str(os.getcwd()) + '/<foldername>/')]
# remove the extra / for the root folder
folders[0] = folders[0][:len(folders[0])-1]


# use re to match pattern of names and titles
names = re.findall(' ', text)
  # insert pattern  inside quotes once we have access to files
court = re.findall(' ', text)
  # insert pattern inside quotes once we have acces to files


# iterate
# read the file from index filenames
data = []

for i in folders:
    file = open(i + '/index.html', 'r')
      # file type will change based on scraper conversion
    text = file.read().strip()
    file.close()

    # extract title and names
    file_name = names
    file_court = court 

    # iterate to next folder
    for j in range(len(file_name)):
        data.append((str(i) + str(file_name[j]), file_court[j])) 

# use a conditional to remove
if c == False:
    file_name = file_name[2:]
    c = True

 Term Frequency – Inverse Document
Frequency (TF-IDF) freature weighting is one of the simplest strongest in the industry.  

The Term Frequency (TF) method  weights terms based on the frequency the words appears in  a single document.   The higher the TF value of a word in a document, the higher the effect of that term on the document.

The Inverse Document Frequency (IDF) is is a weighting method based on the number of words that appear throughout all the documents in a corpus.

TD-IDF, in this use case, is a preprocessing dependency for Cosine similariity and clustering thatsupport efficiencies.


In [None]:
## description of flow
# court - holds description of court instance (ij(initial hearing), bia(appeal))
# body
# TF-IDF(doc) = (TF-IDF(court) * alpha) +
              # (TF-IDF(body) * (1-alpha))

In [None]:
DF = {}

# iterate through all the documents storing all the document id's for each word
# processed_text is the body of the document
for i in range(len(processed_text)):
    tokens = processed_text[i]
    for word in tokens:
        try:
            # add to the set since the set exists
            DF[word].add(i)
        except:
            # create a set since the word doesn't have a set yet
            DF[word] = {i}


In [None]:
# len(DF)->unique words
# count of total unique words in vocabulary
for i in DF:
    DF[i] = len(DF[i])
print(DF)


# keys of DF
total_vocab = [x for x in DF]
print(total_vocab)


In [None]:
# calculate TF-IDF
tf_idf = {}
for i in range(N):
    # calculate TF-IDF for body of all docs
    tokens_tf_idf = processed_text[i] 

    # calculate TF-IDF for title of all docs
    counter = Counter(tokens + processed_court[i]) 

    # iterate body TF-IDF for every (doc, token) if  token is in body, 
    # replace the body(doc, token) value with the value in 
    # Title(doc, token)
    for token in np.unique(tokens):
        df = doc_freq(token)
        idf = np.log(N/(df + 1))
        # multiplythe body with alpha
        tf_idf[doc, token] = tf * idf

# TF-IDF = body_TF-IDF * body_weight +
          # court_TF-IDF * court_weight

           # where body_weight + court_weight = 1

# returns tuple(doc, token)


In [None]:
# rank using matching score
# works for short queries - need to consider scalability (cosine similarity)
def matchingScore(query):
    query_weights = {}
    for kye in tf_idf:
        if key[1] in tokens:
            # key[0]->doc, key[1]->token
            query_weights[key[0]] += tf_idf[key]


### Cosine Similarity
Cosine similarity comppares the inner product of space  between two vectors.  It is measured by the cosine of the angle between two vectors pointing along similar paths.  

In this use case, it is used to measure measure document sililarity in text analysis.


In [None]:
# convert query and documents  to vectors
# use np to store document vectors
vect = np.zeros((N, total_vocab_size))

for i in tf_idf:
    # use list of unique tokens to generate index for each token
    ind = total_vocab.index(i[1])
    vect[i[0]][ind] = tf_idf[i]


# tf, df, idf calculations from query and store in np array
Q = np.zeros((len(total_vocab)))

counter = Counter(tokens)
words_count = len(tokens)


query_weights = {}


for token in np.unique(tokens):
    tf = counter[token]/words_count
    df = doc_freq(token)N + 1) / (df + 1))


# calculate cosign similarity and return maximum k documents
CS = np.dot(a, b) / (norm(a) * norm(b))


K-nearest neighbor (KNN) is a supervised learning algorithm that results in new instance query  which is classified by majority KNN category.  The purpose of the KNN algorithm is to classify a new object (document) based on attributes (keywords) and training samples using minimum distances from the query instance to the training samples to determine the K-nearest neighbors.

Traditional KNN text classifiers have several limitations.


1.   High calculation complexity to find the KNN samples - all the training samples must be calculated
1.   Dependence on training set - the  classifier is generated only with the training samples and does not use any additional data
2.   no weight difference between samples - doesn't match actual phenomenon where the samples commonly have uneven distribution

By combining KNN with K-means, the expected outcome is a reduction in the complex calculation of training set after determining term weighting as describer of document importance in preprocessing.


In [None]:
## begin with average KNN model and then adapt to our model needs
my_data = 
  # path will be provided when Database access is provided
  
# set max columns to none to view every column in dataset
pd.options.display.max_columns = None

my_data.head()


In [None]:
# check feature type
my_data.info()


In [None]:
# generic code here until we have access to data and database

# convert any category dtype to object and drop the originals 
# to avoid conflicts in imputaion

my_data['judge1'] = my_data['judge'].astype(object, axis=0)
my_data['court1'] = my_data['court'].astype(object, axis=0)

my_data = my_data.drop(['judge','court'], axis=1)



In [None]:
# create a list of categorical columns to iterate over
cat_cols = []
  # list will be edited when Database schema is finalized

# instantiate 
encoder = OrdinalEncoder()
model = KNN()

def encode(data):
    """
    function to encode non-null data and
    replace encoded data in original data
    """
    # retain only non-null values
    non_nulls = np.array(data.dropna())

    # reshape data for encoding
    model_reshape = non_nulls.reshape(-1,1)

    # encode data
    model_encoded = encoder.fit_transform(model_reshape)

    # assign back encoded data values to non-null values
    data.loc[data.non_nulls()] = np.squeeze(model_encoded)

    return data

# iterate through each column in the data
for columns in cat_cols:
    encode(my_data[columns])
    

In [None]:
# impute data and convert 
encode_data = pd.DataFrame(np.round(imputer.fit_transform(impute_data)),columns = impute_data.columns)
    # if you prefer to use the remaining dat as an array, leave out the pd.DataFram() call
   

In [None]:
## description of flow
# make document X to be same text feature form as training samples
# calculate the similarities between all training samples and document X
# choose k samples whaich are larger than N similarities and treat them as a KNN collection
# calculate the probability of X belonging toeach category respectively 
# judge doc X to be the category which has the largest cosign similarity


### Cluster using K-means
Due to the numbers of calculations taken between samples (test and all of the training samples), traditional KNN has great calculation complexity and can be less efficient with larger datasets.

Considering scalibility, combine KNN samples which have largest similarities with clustering technique in order to overcome calculation complexity.

The purpose of combining K-means with KNN  is to reduce the time for calculating similarities in the KNN algorithm.


In [None]:
## begin with average K-means model and then adapt to our model needss
my_data = 
  # path will be provided when Database access is provided

my_data.head()


In [None]:
# use the term frequency (TF from TF-IDF) of the 5 types of claims to calculate frequency
# within corpus 
# create list of variables to visualize
X = my_data[[]]
  # list will be edited when Database schema is finalized

# visualize data points
plt.scatter[
            X['FearOfPersecution'],
            X['SeriousPhysicalHarm'],
            X['CoercisveTreatment'],
            X['InvidiousProsecutionOrPunishment'],
            X['EconomicOrOtherPersecution'],
            c='viridis'
            ]
plt.xlable('Claims for Asylum')
plt.ylable('Frequency')

plt.show()


In [None]:
# select number of clusters based off initial visualization
k= 

# select random observations as centroids
centroids = X.sample(n=k)

# visualize data points
plt.scatter[
            X['FearOfPersecution'],
            X['SeriousPhysicalHarm'],
            X['CoercisveTreatment'],
            X['InvidiousProsecutionOrPunishment'],
            X['EconomicOrOtherPersecution'],
            c='viridis'
            ]
plt.xlable('Claims for Asylum')
plt.ylable('Frequency')

plt.show()


In [None]:
# assign each object to the group that is closets to centroid
# update centroid by calculating average value of cluster
# repeat until centroids no longer move (convergence)
diff = 1
j=0

while(diff !=0):
    XD = X
    i = 1
    for index1.row_c in centroids.iterrows():
        ED = []
        for index2.row_d in XD.iterrows():
            d1 = (row_c[' '] - row_d[' ']**2)
             # need to use cosine similarity values here ??
            d2 = (row_c[' '] - row_d[' ']**2)
              # need to use cosine similarity values here ??
            d = np.sqrt(d1 + d2)
            ED.append(d)
        X[i] = ED
        i = i + 1

    cluster = []

    for index, row in X.iterrows():
        min_dist = row[1]
        pos = 1
        for i in range(k):
            if row(i + 1) < min_dist:
                min_dist = row[i + 1]
                pos = i + 1
        cluster.append(pos)

    X['Cluster'] = cluster

    centroids_new = X.groupby(['Cluster']).mean()[[]]

    if j == 0:
        diff = 1
        j = j + 1
    else:
        diff = (centroids_new[' ']-centroids[' ']),sum() + (centroids_new[' ']-centroids[' ']),sum()...
        print(diff.sum())
    
    centroids = X.groupby(['Cluster']).mean()[[]]
    

In [None]:
## description of flow
# calculate weight of each document - sum all term weights and divide by total terms in each document
# use cosine similarity as weight

# cluster each large KNN sample using K-means
    # initialize the value of K as the number of clusters of doc to be created
    # generate the centroid randomly
    # assign each object to the group that is closets to centroid
    # update centroid by calculating average value of cluster
    # repeat until centroids no longer move (convergence)
# after clustering for each category, the cluster centers now represent  the new training sets for KNN algorithm 
# - reducing time needed for calclating similarities


## Explore LDA as an alternative to K-Means


## Evaluation

In [None]:
## description of flow
# k-fold cross validation ??
# create confusion matrix - cluster by system (Y/N), cluster is actually (Y/N)
# calculate recall
# calculate precision
# calculate F-measure ??
# inertia ??
# dunn index ??


# Exploratory Visualizations

## Bar chart
specific judge's (historical) grants vs denials of similarly classified cases

In [None]:
# generic code until we have real data and database schema
fig = go.Figure()

fig.add_trace(go.Bar(
  x = [['First', 'First', 'Second', 'Second'],
       ["A", "B", "A", "B"]],
  y = [2, 3, 1, 5],
  name = "Adults",
))

fig.add_trace(go.Bar(
  x = [['First', 'First', 'Second', 'Second'],
       ["A", "B", "A", "B"]],
  y = [8, 3, 6, 5],
  name = "Children",
))

fig.update_layout(title_text="Multi-category axis")

fig.show()

In [None]:
# generic code until we have data and database schema
df = px.data.tips()
fig = px.bar(df, x="day", y="total_bill", color="smoker", barmode="group", facet_col="sex",
             category_orders={"day": ["Thur", "Fri", "Sat", "Sun"],
                              "smoker": ["Yes", "No"],
                              "sex": ["Male", "Female"]})
fig.show()



## Line graph(s)
(historical) term frequency associated with specific judge and denials
(including appeallate data)
(excluding appeallate data)

In [None]:
# generic code until we have data and database schema
df = px.data.gapminder()
all_continents = df.continent.unique()

app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Checklist(
        id="checklist",
        options=[{"label": x, "value": x} 
                 for x in all_continents],
        value=all_continents[3:],
        labelStyle={'display': 'inline-block'}
    ),
    dcc.Graph(id="line-chart"),
])

@app.callback(
    Output("line-chart", "figure"), 
    [Input("checklist", "value")])
def update_line_chart(continents):
    mask = df.continent.isin(continents)
    fig = px.line(df[mask], 
        x="year", y="lifeExp", color='country')
    return fig

app.run_server(debug=True)



In [None]:
# generic code until we have data and database schema 
df = 

fig = px.line(df, x='Date', y='AAPL.High', title='Time Series with Range Slider and Selectors')

fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
fig.show()


## Bubble chart
comparison of all current seated judges (historical) grants vs denials of similarly classified cases
(or alternative bubble chart to be determined)

In [None]:
# generic code until we have data and database schema
# Load data, define hover text and bubble size
data = px.data.gapminder()
df_2007 = data[data['year']==2007]
df_2007 = df_2007.sort_values(['continent', 'country'])

hover_text = []
bubble_size = []

for index, row in df_2007.iterrows():
    hover_text.append(('Country: {country}<br>'+
                      'Life Expectancy: {lifeExp}<br>'+
                      'GDP per capita: {gdp}<br>'+
                      'Population: {pop}<br>'+
                      'Year: {year}').format(country=row['country'],
                                            lifeExp=row['lifeExp'],
                                            gdp=row['gdpPercap'],
                                            pop=row['pop'],
                                            year=row['year']))
    bubble_size.append(math.sqrt(row['pop']))

df_2007['text'] = hover_text
df_2007['size'] = bubble_size
sizeref = 2.*max(df_2007['size'])/(100**2)

# Dictionary with dataframes for each continent
continent_names = ['Africa', 'Americas', 'Asia', 'Europe', 'Oceania']
continent_data = {continent:df_2007.query("continent == '%s'" %continent)
                              for continent in continent_names}

# Create figure
fig = go.Figure()

for continent_name, continent in continent_data.items():
    fig.add_trace(go.Scatter(
        x=continent['gdpPercap'], y=continent['lifeExp'],
        name=continent_name, text=continent['text'],
        marker_size=continent['size'],
        ))

# Tune marker appearance and layout
fig.update_traces(mode='markers', marker=dict(sizemode='area',
                                              sizeref=sizeref, line_width=2))

fig.update_layout(
    title='Life Expectancy v. Per Capita GDP, 2007',
    xaxis=dict(
        title='GDP per capita (2000 dollars)',
        gridcolor='white',
        type='log',
        gridwidth=2,
    ),
    yaxis=dict(
        title='Life Expectancy (years)',
        gridcolor='white',
        gridwidth=2,
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)
fig.show()


# Linear Models
the team agrees that this model is not appropriate for our problem

In [None]:
# Build linear regression model using predictors
predictors = []
    # update predictors when schema for database is complete
    
# Split data into predictors X and output Y
X = advert[predictors]
y = advert['adjudication']

# Initialize and fit model
lr = LinearRegression()
model = lr.fit(X, y)

# identify alpha and beta
print(f'alpha = {model.intercept_}')
print(f'betas = {model.coef_}')

# predict judicial outcome
model.predict(X)


In [None]:
# new predictions
new_X = [[_,_]]
print(model.predict(new_X))


In [None]:
class Adjudication(BaseModule):
    pass

--when adding endpoints to ml.py:


*   consider null values 
*   client error/404
*   check instance type
*   bound integers
*   explicitly identify variable object type and allow for conversion
*   ***use class method and define function to create json request body***
