# Pre-processing

###### Prep Test index format: 
For the purposes of the LSAT, the typical format of "Prep Test/Section/Question Number as used.  e.g. [21.0,02,05] indicates Prep test 21, Section 2, Question 5.  Two irregularies exist.  First, a decimal was utilized for prep test reference.  This would allow for Prep Test exams to remain both in chronological as well as numerical order since six LSAT prep test exams were released periodically throughout the years.  See here https://www.powerscore.com/lsat/help/pub_ident.cfm.  e.g. "June 2007" was imported as [51.5,01,01] since June 2007 falls chronolically between prep test 51 (Dec 2006) and 52 (Sept 2007). 

Second, section numbers 01 and 02 were utilized consistently throughout the data set since the LSAC consistently provided their exams in LG,LR,LR,RC order throught all of their practice exams online, see https://www.lsac.org/lsat/prep#lsatprepplus .  Since the official LSAT rotates these sections for different students on test day, the section numbers may not correspond with the actual exam a test taker experienced or that is downloaded from a test provider's website such as https://www.kaplan.com . 

###### Pipe addition and removal from pipeline
Most Python cells within a Jupyter notebook may be run multiple times within the span of running an entire notebook.  However, multiple Spacy pipes labeled the same name may not be re-added/overwritten.  Therefore, several instances of code to delete pipes are commented throught this project.  These instances may be uncommented, run, then commented again to facilitate this process.

###### Code output review:
Jupyter notebook as a module titled "collapsible headings" which allows code folding based upon Markdown headers.  see here https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/collapsible_headings/readme.html .  While this is not necessary for the production of this notebook, it helps to view the overall continuity of flow since there are a number of code snippets designed to allow the reviewer a glimps into how the data is being processed and returned.  Examples of this non-essential, but helpful code can be found within markdown text titles similar to the following.

"REVIEW sample variable START" 

"REVIEW sample variable END"

### Required packages

In [34]:
# Required Packages & other dependencies
# NOTE: package imports are duplicated here and throughout the entire project.  This will facilitate
#   (1) Ease of import: One only needs to look here to view every required installation
#   (2) Understanding of functionality: duplicated package imports are commented out, but should identify 
#         to the reader where imports become necessary


import pandas as pd
import spacy
from spacy.matcher import Matcher
import numpy as np 

from spacy.util import minibatch, compounding, decaying

import itertools
from functools import reduce
import operator
import random



In [33]:
# NOTE: A Spacy model must either be loaded from a default model or a custom designed.  
#    For this capstone, "en_core_web_sm" was used.  More information may be found here
#    https://spacy.io/usage/models

### Prep Dataframe

In [3]:
# import pandas as pd
df = pd.read_csv('PT1988.v4.txt', sep='\t', encoding='utf_8') # allows for special utf8 char

In [4]:
# Adds the correct answer (as str) to the df labeled "Correct_Strings".
# NOTE: this cell was NOT USED in this capstone and may be omitted.

Correct_Strings = []     # A list of [str, str] containing the correct answers
for row in range(len(df)):    
    Correct_Letter_Col = df.loc[row, "Correct_Letter"]        # List of [str, str] of correct letters. [A, C, D, A,...]
    Correct_Strings.append(df.loc[row, Correct_Letter_Col])   # appends the str of correct answers to Correct_strings as text
   
df['Correct_Strings'] = Correct_Strings   # adds the list of correct answers to df as a column.

### Filtering out and correcting Sentence Boundaries

In [5]:
# relevant review material
# Custom Sentence stops, for ":"   https://spacy.io/usage/linguistic-features#sbd-custom
# Tokens, including "nbor()", and "is_sent_start" : https://spacy.io/api/token
# Rule based matching: https://spacy.io/usage/rule-based-matching
# Online Spacy Pattern Matcher : https://explosion.ai/demos/matcher

# import spacy  # Import Spacy module
# Dissables non-essentiaal pipes such as NER, tagger, etc.  https://spacy.io/usage/processing-pipelines#disabling
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"])

# Custom boundary markers for sentences updated below.  Where possible, a small sample of text with "\n" is used
#    to illustrate the originl issue.  
# NOTE: a 'doc' object is a single stimulus of multiple sentences in the function.
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ":":  #  See 28.0_02_07 as example, "Tiya: But\nsome..."
            doc[token.i+2].is_sent_start = False 
        if token.text == ":":  
            doc[token.i+1].is_sent_start = False
        if token.text == ":":  #changes here....   28.0_02_07, 34.0_02_08 example
            doc[token.i].is_sent_start = False 
        if token.text == '.' and token.nbor().text == '"':  #changes here.  See "20.0_02_22" as example  ."\n
            doc[token.i+1].is_sent_start = False
        if token.text == '?' and token.nbor().text == '"':  #changes here.  See "22.0_02_11" as example  ?"\n
            doc[token.i+1].is_sent_start = False  
        if token.text == '!' and token.nbor().text == '"':  #changes here.  
            doc[token.i+1].is_sent_start = False
        if token.text == ',' and token.nbor().text == '"':  #changes here.  See "31.0_02_15" as example
            doc[token.i].is_sent_start = False
        if token.text == 'Dr.':  
            doc[token.i+1].is_sent_start = False
        if token.text == ':' and token.nbor().text == ' ':  #changes here.  See "24.0_02_15" as example "Dr. Z: \nMany"
            doc[token.i+2].is_sent_start = False
        if token.text == ':' and token.nbor().text == ' ':  #changes here.  See "" as example "Dr. Z: \nMany"
            doc[token.i+1].is_sent_start = False                
    return doc


# adds this sentence boundary recognizer before the parser. 
nlp.add_pipe(set_custom_boundaries, before="parser")

In [1]:
# To delete this pipeline part
# name, component = nlp.remove_pipe("set_custom_boundaries")

#### REVIEW Sent Tokenizer Test START

In [6]:
# Word Tokens and Neighbors
#
# This review snippet was utilized in two ways.
#   First: to identify whether or not a token was or was not recognized as a sentence start.
#   Second: to help identify neighbors to certian word tokens used to identify and fix incorrect sentence boundaries
#     by utilizing "w.nbor()" (i.e. the word's neighbor) in the "if" statements of the sentence set_custom_boundaries.

test_boundaries = "28.0_02_07"  # Sample PT/Sec/Num of the stimulus to review
test_stimulus_to_sentences = nlp([*df[df['Order'].str.contains(test_boundaries)].loc[df[df['Order']==test_boundaries].index.values,"Stimulus"]][0]).text
# doc_test_sentences = nlp(test_stimulus_to_sentences)
print(test_stimulus_to_sentences)  # print's the entire stimulus as a single line.

print("word:\t Sent_Start\tNeighbor(0) to the right")  # column headers used in "for" loop below
for w in nlp(test_stimulus_to_sentences):
    print(w, "\t", w.is_sent_start, "\t\t", w.nbor())


Sam: In a recent survey, over 95 percent of people who purchased a Starlight automobile last year said they were highly satisfied with their purchase. Since people who have purchased a new car in the last year are not highly satisfied if that car has a manufacturing defect, Starlight automobiles are remarkably free from such defects.  Tiya: But some manufacturing defects in automobiles become apparent only after several years of use.
word:	 Sent_Start	Neighbor0
Sam 	 True 		 :
: 	 False 		 In
In 	 False 		 a
a 	 False 		 recent
recent 	 None 		 survey
survey 	 None 		 ,
, 	 None 		 over
over 	 None 		 95
95 	 None 		 percent
percent 	 None 		 of
of 	 None 		 people
people 	 None 		 who
who 	 None 		 purchased
purchased 	 None 		 a
a 	 None 		 Starlight
Starlight 	 None 		 automobile
automobile 	 None 		 last
last 	 None 		 year
year 	 None 		 said
said 	 None 		 they
they 	 None 		 were
were 	 None 		 highly
highly 	 None 		 satisfied
satisfied 	 None 		 with
with 	 None 		 their
their

IndexError: [E042] Error accessing doc[78].nbor(1), for doc of length 79.

In [268]:
# Print a test Stimulus to Sentences to evaluate the tokenizer
#
# Used to evaluate a single stimulus and verify that sentences were properly parsed.
# nlp.replace_pipe("set_custom_boundaries", )
test_boundaries = "28.0_02_07"   # <- enter the "PT_Sec_Quest" to evaluate

# NOTE: 106 is the absolute row index for 28.0_02_07 above.  However, this code only requires that the "28.0_02_07" format 
#    be used
# logic behind stimulus call     [*df[df['Order'].str.contains("19.0_01_07")]   .loc[106,                                          "docS"        ] .sents]   
test_stimulus_to_sentences = nlp([*df[df['Order'].str.contains(test_boundaries)].loc[df[df['Order']==test_boundaries].index.values,"Stimulus"]][0]).sents
 
ii = 0
for s in test_stimulus_to_sentences:
    print(ii, s)
    ii = ii + 1

0 Sam: In a recent survey, over 95 percent of people who purchased a Starlight automobile last year said they were highly satisfied with their purchase.
1 Since people who have purchased a new car in the last year are not highly satisfied if that car has a manufacturing defect, Starlight automobiles are remarkably free from such defects.  
2 Tiya: But some manufacturing defects in automobiles become apparent only after several years of use.


#### REVIEW Sent Tokenizer Test END

In [7]:
###### Prints the current pipe order in the Spacy pipeline
nlp.pipe_names

['set_custom_boundaries', 'parser']

In [19]:
# Attaches a list of 'doc' objects for each sentence for each stimuli
#   Code modified from: https://stackoverflow.com/questions/46981137/tokenizing-using-pandas-and-spacy
df['docS'] = df['Stimulus'].apply(lambda x: nlp(x)) # creates a col of nlp docs from the Stimuli

In [20]:
df.head(3)  # Run time 30 seconds on 3606 rows

Unnamed: 0,index,Order,PT,Section,Question_Number,Stimulus,Question_Type,Star_Level,Correct_Letter,Question_Stem,A,B,C,D,E,END,Notes,Correct_Strings,docS
0,0,51.5_01_01,51.5,1,1,Economist: Every business strives to increase ...,Main Point,1,B,Which one of the following most accurately exp...,If an action taken to secure the survival of a...,Some measures taken by a business to increase ...,Only if the employees of a business are also i...,There is no business that does not make effort...,Decreasing the number of employees in a busine...,.,June 2007,Some measures taken by a business to increase ...,"(Economist, :, Every, business, strives, to, i..."
1,1,51.5_01_02,51.5,1,2,All Labrador retrievers bark a great deal. All...,Parallel Flaw,1,B,Which one of the following uses flawed reasoni...,All students who study diligently make good gr...,All type A chemicals are extremely toxic to hu...,All students at Hanson School live in Green Co...,All transcriptionists know shorthand. All engi...,All of Kenisha's dresses are very well made. A...,.,June 2007,All type A chemicals are extremely toxic to hu...,"(All, Labrador, retrievers, bark, a, great, de..."
2,2,51.5_01_03,51.5,1,3,"A century in certain ways is like a life, and ...",Inference (Completes),1,D,Which one of the following most logically comp...,reminisce about their own lives,fear that their own lives are about to end,focus on what the next century will bring,become very interested in the history of the c...,reflect on how certain unfortunate events of t...,.,June 2007,become very interested in the history of the c...,"(A, century, in, certain, ways, is, like, a, l..."


# Create Search Patterns for identifying Conclusions based upon classic keywords (not utilized)

#### NOTE: This was an initial base case model for the capstone. Later, it was abandonded for a base case that more closely modeled the original intent of this project.  It has been left here for review.

In [21]:
# Resources for basic Spacy keyword search patterns are below.
# 
# https://stackabuse.com/python-for-nlp-vocabulary-and-phrase-matching-with-spacy/
# Interactive Pattern Matcher online : https://explosion.ai/demos/matcher
# import spacy
from spacy.matcher import Matcher

#### Conclusion 1: by Keywords

In [22]:
c1 = [{'LOWER': 'thus'}]        # Thus
c2 = [{'LOWER': 'therefore'}]   # Therefore
c3 = [{'LOWER': 'hence'}]       # Hence
c4 = [{'LOWER': 'so'}]          # So
c5 = [{'LOWER': 'as'},          # as a result
      {'LOWER': 'a'},             
      {'LOWER': 'result'}]        
c6 = [{'LOWER': 'it'},          # it follows that
      {'LOWER': 'follows'},      
      {'LOWER': 'that'}]        
c7 = [{'LOWER': 'is'},          # is evidence that
      {'LOWER': 'evidence'},      
      {'LOWER': 'that'}]         

#### Conclusion 2: by Subsidary Conclusion p171 (selects last conclusion found)

In [9]:
# No pattern matching was necessary for this model.  This conclusion type was identified by first finding all instances of keywords from C1 type above, then selecting the last sentence

#### Conclusion 3: by Evidence KW p173

In [9]:
# because, since, ___ is evidence of ___, After all, For
# e# represents the evidence keyword number pattern identified.

In [23]:
e1 = [{'LOWER': 'because'}]     #   
e2 = [{'LOWER': 'since'}]       #
e3 = [{'LOWER': 'for'}]         # ***** remove instances e.g. "are submitted for approval to"
e4 = [{'LOWER': 'is'},          #   
      {'LOWER': 'evidence'},    #          
      {'LOWER': 'of'}]          #
e5 = [{'LOWER': 'after'},       # ***** Assumes that "After all" does not start an entire stimulus.
      {'LOWER': 'all'}]         #         This pattern match was modified later on to tag the sentence prior to the one found

# m_tool.add('CON3b', None, e5)

# m_tool.remove("CON3a")

#### Conclusion 4: from Context p171  

#### Conclusion 5: as negation of opponents point p171 

In [13]:
# The purposes of this capstonme was to identify conclusions based upon this type.  
#      It's code is implemented later as an ensemble (CNN,BOW) model

# Identify Conclusion sentences based upon c1-c3.   (not utilized)

In [24]:
cMatches = []
c1Matchs = []
c2Matchs = []
c3aMatchs = []
c3bMatchs = []

c1_tool = Matcher(nlp.vocab)
c3a_tool = Matcher(nlp.vocab)
c3b_tool = Matcher(nlp.vocab)

c1_tool.add('CON1', None, c1, c2, c3, c4, c5, c6, c7)    # by Conclusion Keyword
c3a_tool.add('CON3a', None, e1, e2, e3, e4)
c3b_tool.add('CON3b', None, e5)

# import numpy as np
m = np.empty((0,4))  # A single Matrix per stimulus, Creates an empty Numpy 2D array of SENTENCES x METHOD
m2 = []    # An empty list to contain all of the matrices found and add to "df"

for row in df.loc[:, 'docS']:  # 41 - 48 has some "after all"
    m = np.empty((0,4))    # Empty's array

    for sentence in row.sents:

        c1m  = c1_tool(sentence)      # Searches for patterns
        c3am = c3a_tool(sentence)
        c3bm = c3b_tool(sentence)
        
        c1mCount  = len(list(c1m))    # 
        c3amCount = len(list(c3am))
        c3bmCount = len(list(c3bm))
        
        m = np.vstack((m, np.array([c1mCount, 0, c3amCount, c3bmCount])))

    
    # Wrapup processing of Numpy Matrix
    # Fold method 3b into 3a
    m[:,2] = list(map(lambda pair: sum(pair), zip(np.roll(m[:,3], 1, axis = 0), m[:,2]))) # After shifting col 3 up one (for "after all"), summes method 3b into 3a for 'conclusion by evidence kw'

    # if a c1 exhists, finds last non-zero index by method c1 (Conclusion KW) and uses it for c2
    if np.any(m[:,0]): m[:,1] = np.bincount([np.max(np.nonzero(m[:,0]))], None, len(m[:,0]))
        
    # appends numpy matrix to running list, to become a column of counts.
    m2.append(m.astype(int))   # appends data found for current stimulus to a master list of all methods found.

df['cType_counts'] = m2  # 3-5 seconds

In [25]:
# PT/Sec/Quest      QType       [0,1,0]       Sent1
#                               [0,1,0]       Sent2
#                               [0,1,1]       Sent3
#                               [1,1,0]       Sent4
dftest = df.loc[:, ['Order', 'Question_Type', 'docS', 'cType_counts']]

In [37]:
dftest.head()

Unnamed: 0,Order,Question_Type,docS,cType_counts
0,51.5_01_01,Main Point,"(Economist, :, Every, business, strives, to, i...","[[0, 0, 2, 0], [0, 0, 0, 0], [0, 0, 0, 0]]"
1,51.5_01_02,Parallel Flaw,"(All, Labrador, retrievers, bark, a, great, de...","[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1,..."
2,51.5_01_03,Inference (Completes),"(A, century, in, certain, ways, is, like, a, l...","[[0, 0, 0, 0], [1, 1, 0, 0]]"
3,51.5_01_04,Flaw,"(Consumer, :, The, latest, Connorly, Report, s...","[[0, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0]]"
4,51.5_01_05,Weaken,"(Scientist, :, Earth, 's, average, annual, tem...","[[0, 0, 0, 0], [0, 0, 0, 0]]"


### Formats results to save as CSV file. 

In [None]:
dfResults = pd.DataFrame()

for index, Order, Question_Type, d, cType_counts in dftest.itertuples():
    numS = len(cType_counts)

    dfOrder = pd.DataFrame(np.zeros((4,1)).astype(int).astype(str))
    dfOrder.iloc[0,0] = Order
    
    dfQuestion_Type = pd.DataFrame(np.zeros((4,1)).astype(int).astype(str))
    dfQuestion_Type.iloc[0,0] = Question_Type    
    
    dfStim = []
    for s in d.sents:
        dfStim.append(s.text)

    dfTypes = []
    for cType_counts in cType_counts:
        dfTypes.append(cType_counts.astype(int))

    dfQuestion = pd.concat([dfOrder, dfQuestion_Type, pd.DataFrame(np.asarray(dfTypes).astype(int)), pd.DataFrame(dfStim)], axis = 1)
    dfResults = pd.concat([dfResults, dfQuestion])

    # 25 seconds
dfResults.to_csv('output2.csv')

# ##############################################################

# Ensemble model to ID conclusions based upon negations

### Prep Train / Test Data

In [None]:
# the below list of lists in the format ["PT_Sec_Quest#", # of C5 sentence] was used to feed stimuli to the trainer.
#    e.g. [["19.0_01_07",  1]] Represents PT19, Section 01, Question 01, where the second sentence "1"
#       the conclusion as a negation.  A "-1" was used to identify stimuli where conclusion as negation was NOT
#       present.

In [38]:
c5train = [["19.0_01_07",  1],  
           ["20.0_01_14",  2], 
           ["20.0_02_22",  3],
           ["23.0_01_02",  2],
           ["24.0_01_15",  2],
           ["25.0_01_07",  1],
           ["28.0_02_13",  3],
           ["29.0_01_11",  1],
           ["29.0_02_06",  2],
           ["32.0_02_12",  1],
           ["34.0_01_15",  1],
           ["34.0_01_18",  1],
           ["35.0_01_03",  1],
           ["35.0_01_16",  1],
           ["36.0_01_06",  2],
           ["38.0_01_02",  3],
           ["51.5_01_01",  1],
           ["29.0_02_08", -1],
           ["58.0_02_01", -1],
           ["35.0_02_08",  2],  
           ["26.0_01_19", -1],
           ["64.0_01_25", -1],
           ["79.0_02_15",  1],
           ["66.0_01_18", -1],
           ["74.0_01_08", -1],
           ["68.0_01_11", -1],
           ["64.0_02_14",  1],  
           ["28.0_02_26",  1],  
           ["55.0_02_20",  3],  
           ["74.0_02_01",  1],
           ["24.0_02_04",  1], 
           ["67.0_02_23", -1], 
           ["73.0_02_06", -1], 
           ["38.0_02_03", -1], 
           ["38.0_02_19",  2], 
           ["67.0_01_06", -1], 
           ["39.0_01_13",  3], 
           ["67.0_01_12", -1], 
           ["32.0_02_03",  2],
           ["63.0_01_06", -1], 
           ["83.0_01_23", -1],  
           ["87.0_02_07", -1], 
           ["84.0_02_23", -1], 
           ["75.0_02_03",  3],  
           ["52.0_01_08",  1],  
           ["88.0_02_12", -1],
           ["70.0_01_07",  2],
           ["69.0_02_19",  1],
           ["49.0_02_06",  1],
           ["44.0_01_19", -1],
           ["56.5_02_22", -1],
           ["65.0_01_01",  2],
           ["71.0_01_03", -1],
           ["73.0_01_14",  2],
           ["32.0_02_09",  2],
           ["81.0_01_08", -1],
           ["32.0_01_17", -1],
           ["67.0_01_04", -1],
           ["73.0_02_19",  1],
           ["51.0_02_02",  1],
           ["43.0_01_15",  1],
           ["67.0_01_05", -1],
           ["26.0_01_15", -1],
           ["36.0_01_08",  1],
           ["45.0_02_13", -1],
           ["84.0_02_19",  2],
           ["61.0_02_22",  3],  
           ["73.0_01_03",  1],
           ["24.0_01_23",  2],
           ["26.0_01_20",  1],
           ["27.0_01_22", -1],
           ["71.0_02_07", -1],
           ["67.0_01_17", -1],
           ["59.0_01_05",  2],
           ["34.0_02_16", -1],
           ["58.0_02_10", -1],
           ["20.0_01_13",  2],
           ["37.0_02_22",  1],
           ["59.0_02_05",  3],
           ["32.0_02_25",  2],
           ["65.0_02_19",  1],
           ["57.0_01_21",  2],
           ["69.0_01_13", -1],
           ["59.0_01_24", -1],
           ["22.0_01_10", -1],
           ["26.0_02_04", -1],
           ["51.5_02_07",  2],
           ["36.0_02_03", -1],
           ["65.0_01_06",  2],
           ["57.0_01_01",  3],
           ["56.5_02_21", -1],
           ["73.0_02_05",  0],
           ["50.0_02_21", -1],
           ["79.0_02_20",  2],
           ["63.0_01_11", -1], 
           ["72.0_02_06",  3],    
           ["62.0_02_14",  3],
           ["23.0_01_03",  2],
           ["54.0_02_15", -1],
           ["65.0_02_16",  2],
           ["32.0_02_06",  1],
           ["53.0_01_10",  3],
           ["41.0_02_13",  2],    
           ["45.0_01_07", -1],
           ["50.0_01_11",  4],   
           ["65.0_02_17",  1],
           ["80.0_02_10",  2],
           ["46.0_01_01",  1],
           ["85.0_01_24", -1],
           ["54.0_01_11", -1], 
           ["71.0_01_18",  2],
           ["87.0_02_17", -1],
           ["66.0_02_23",  1],
           ["36.0_01_01",  2],
           ["38.0_02_18",  2],  
           ["39.0_02_08",  2],
           ["34.0_02_25",  2],
           ["61.0_02_10",  1],
           ["36.0_02_06", -1],
           ["64.0_01_18", -1],
           ["81.0_02_07", -1],
           ["28.0_02_06",  3],
           ["31.0_02_04",  2],
           ["43.0_02_20", -1],
           ["27.0_02_15",  1],
           ["71.0_02_23",  4],
           ["74.0_01_16",  5],   
           ["31.0_02_24",  3],   
           ["54.0_01_08",  1],
           ["42.0_01_08",  2],  
		   ["33.0_02_13",  1],
		   ["30.0_01_24",  4],
		   ["52.0_01_17", -1],
		   ["32.0_01_25",  1],       
		   ["22.0_02_25", -1],
		   ["78.0_01_04",  1], 
		   ["19.0_02_25",  2],
		   ["53.0_01_23",  2],
		   ["82.0_02_26",  2],
		   ["23.0_02_01",  3],
		   ["23.0_02_16",  2],
		   ["87.0_02_12",  2],       
		   ["50.0_02_04",  2],       
		   ["70.0_02_04",  1],       
		   ["43.0_01_04", -1],
		   ["40.0_01_06", -1],
		   ["51.5_01_02", -1],
		   ["32.0_01_06", -1],
		   ["70.0_01_17", -1],
		   ["84.0_01_21", -1],
		   ["36.0_02_14", -1],
		   ["33.0_01_06", -1],
		   ["31.0_01_23", -1],
		   ["83.0_01_11",  1],
		   ["87.0_01_02",  1],
		   ["48.0_01_13",  2],
		   ["80.0_01_25", -1],
		   ["24.0_02_09", -1],
		   ["57.0_01_25", -1],
		   ["65.0_02_18", -1],
		   ["85.0_01_18", -1],
		   ["48.0_01_20", -1],
		   ["61.0_02_03", -1],
		   ["25.0_01_09", -1],
		   ["29.0_02_10", -1],
		   ["56.0_02_05", -1],
		   ["61.0_01_13", -1],
		   ["77.0_02_20", -1],
		   ["81.0_01_12", -1],
		   ["86.0_01_01",  1],
		   ["78.0_01_17",  1],
		   ["43.0_01_11",  1],
		   ["60.0_01_04",  2],
		   ["66.0_02_11",  1],
		   ["76.0_02_02",  3],
		   ["36.0_01_07",  2],
		   ["30.0_01_03",  1],
		   ["49.0_01_05",  1],
		   ["58.0_02_18",  2],
		   ["85.0_02_14",  2],
		   ["22.0_02_19",  4],
		   ["33.0_02_11",  5],
		   ["53.0_01_16",  2],
		   ["27.0_02_06",  3],
		   ["51.0_01_19",  2],
		   ["57.0_01_13",  2],
		   ["59.0_01_07",  2],
		   ["60.0_01_18",  2],
		   ["69.0_01_23",  2],
		   ["62.0_02_26",  1],           
           ["34.0_01_13",  2]]   

In [39]:
# returns difference between the entire training index as C5train_index and ALL 3606 LR stimuli as c5test_index.
#    It should be noted that c5test_index was not part of the original train/test questions 1 thru 191 above
# import numpy as np 
c5train_index = list(np.array(c5train).transpose()[0])
c5full_index = list(df['Order'])
c5test_index = list(list(set(c5full_index)-set(c5train_index)) + list(set(c5train_index)-set(c5full_index)))
# In the format...
# ['19.0_01_07',
#  '20.0_01_14',
#  '20.0_02_22',..]

In [41]:
c5test = list(zip(c5test_index, [-1] * len(c5test_index)))

## Functions

In [42]:
# Function to take a single  ["PT_sec_Quest", -1] format and returns a list of lists containing [[sents, sent],[]]

def prepCon(df, pairs):                    # pairs as [ [[Sentences], ConcNumber, PT/Sec/Quest], 
    results = []                           #            [[Sent1, Sent2, Sent3], 1, "20.0_02_05"]  ]
    for i in pairs:
#                   [*df[df['Order'].str.contains("19.0_01_07")].loc[106,"docS"                                  ].sents]      
        sentences = [*df[df['Order'].str.contains(i[0])        ].loc[df[df['Order']==i[0]].index.values[0],"docS"].sents]
        results.append([sentences,
                        [np.asarray([0] * len(sentences), dtype=np.int64)] if i[1] == -1 else [np.bincount([i[1]], None, len(sentences))],    # Index of conclusion, expressed as binary list.  e.g. [0,0,1,0]
                        [i[0]]])          # the PT/Sec/Quest Number Identifier.
    return results

alist = prepCon(df, c5train)
blist = prepCon(df, c5test)
# Curren trun time aprox 20 seconds

#### REVIEW prepCon results START

In [49]:
# Verify Train/Test Data output. While c5train, is used here, any list of list as [["PT_sec_quest", Conc#], etc]
#    may be used.  This is used to test the function "prepCon()"
test_list = c5train
# test_list = [["19.0_01_07",  1],  
#              ["20.0_01_14",  2]]
compare_list = prepCon(df, test_list)
compare_list[:3]

[[[A commonly accepted myth is that left-handed people are more prone to cause accidents than are right-handed people.,
   But this is, in fact, just a myth, as is indicated by the fact that more household accidents are caused by right-handed people than are caused by left-handed people.],
  [array([0, 1], dtype=int64)],
  ['19.0_01_07']],
 [[Yolanda: Gaining access to computers without authorization and manipulating the data and programs they contain is comparable to joyriding in stolen cars; both involve breaking into private property and treating it recklessly.,
   Joyriding, however, is the more dangerous crime because it physically endangers people, whereas only intellectual property is harmed in the case of computer crimes.  ,
   Arjun: I disagree!,
   For example, unauthorized use of medical records systems in hospitals could damage data systems on which human lives depend, and therefore computer crimes also cause physical harm to people.],
  [array([0, 0, 1, 0], dtype=int64)],


#### REVIEW prepCon results END

## Prep Trainer

In [9]:
# import spacy
# nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"])
# nlp.add_pipe(set_custom_boundaries, before="parser")
# nlp.pipe_names

In [52]:
# Adding the built-in textcat component to the pipeline.
# page code edited.  "text_cat" to "textcat in line #4 below"
textcat=nlp.create_pipe( "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"})
nlp.add_pipe(textcat, last=True)

# Adding the labels to textcat
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

nlp.pipe_names

['set_custom_boundaries', 'parser', 'textcat']

In [51]:
# Run to remove pipe added above
# nlp.remove_pipe('textcat')

('textcat', <spacy.pipeline.pipes.TextCategorizer at 0x27a5b825c18>)

In [53]:
# preps a single stimuls in [(text sentence), {dict: POSITIVE, NEGATIVE}] format
def load_stimuls(stim):      # takes alist or blist variables
    ss, yy, ii = enumerate(stim)  # ss=sentences, yy=0/1 for yes/no stim, ii=stimulus index ID
    l = []
#     print(len(ss[1]))
#     print(ss[1])
#     t_senLengths.append(len(ss[1]))
    for s, y in zip(ss[1], *yy[1]):
#         print(s)
        l.append((s.text, {'cats': {'POSITIVE': bool(y), 'NEGATIVE': not bool(y)}}))
#         l.append((s.text, {'POSITIVE': bool(y), 'NEGATIVE': not bool(y)}))   
#     print("\n")
    return l

In [54]:
def count_sent(stim):      # takes alist or blist variables
    ss, yy, ii = enumerate(stim)  # ss=sentences, yy=0/1 for yes/no stim, ii=stimulus index ID
#     num_of_sent = []
#     print(len(ss[1]))
#     for s in ss:
#         print(s)
#     num_of_sent.append(len(ss[1]))
    return len(ss[1])

In [55]:
import itertools
from functools import reduce
import operator
import random
# splits training data into "train" and "dev"

# This function both shuffles as well as preps 

def shuffle_data(l, split=0.8):   # Splits the data upon an 80/20% split.  Other values may be utilized.
    random.shuffle(l)
    split = int(len(l) * split)
    
    
    t_senLengths = []    
    train_text = []
    for x in l[:split]:
        train_text.append(load_stimuls(x))
        t_senLengths.append(count_sent(x))
        

    d_t = []
    d_c = []
    d_senLengths = []
    for stimuli in l[split:]:  
        for areConc in stimuli[1]:
            for isConc in areConc:
                d_c.append({'POSITIVE': bool(isConc), 'NEGATIVE': not bool(isConc)})
        d_senLengths.append(len(stimuli[0]))
        for sentences in stimuli[0]:
            d_t.append(sentences.text)


    
    return list(itertools.chain.from_iterable(train_text)), tuple(d_t), d_c, t_senLengths, d_senLengths

train_data, dev_texts, dev_cats, train_senLengths, dev_senLengths = shuffle_data(alist)

#### REVIEW Formatted train/test results START

In [56]:
train_data[:5]

[('Veterinarian: A disease of purebred racehorses that is caused by a genetic defect prevents afflicted horses from racing and can cause paralysis and death.',
  {'cats': {'POSITIVE': False, 'NEGATIVE': True}}),
 ('Some horse breeders conclude that because the disease can have such serious consequences, horses with this defect should not be bred.',
  {'cats': {'POSITIVE': False, 'NEGATIVE': True}}),
 ('But they are wrong because, in most cases, the severity of the disease can be controlled by diet and medication, and the defect also produces horses of extreme beauty that are in great demand in the horse show industry.',
  {'cats': {'POSITIVE': True, 'NEGATIVE': False}}),
 ('Anthropologist: Every human culture has taboos against eating certain animals.',
  {'cats': {'POSITIVE': False, 'NEGATIVE': True}}),
 ('Some researchers have argued that such taboos originated solely for practical reasons, pointing out, for example, that in many cultures it is taboo to eat domestic animals that prov

In [62]:
train_data[0][0]

"Camera manufacturers typically advertise their products by citing the resolution of their cameras' lenses, the resolution of a lens being the degree of detail the lens is capable of reproducing in the image it projects onto the film."

In [63]:
train_data[0][1]

{'cats': {'POSITIVE': False, 'NEGATIVE': True}}

In [43]:
type(train_data[0][1])

dict

In [64]:
train_data[0][1]['cats']

{'POSITIVE': False, 'NEGATIVE': True}

In [57]:
dev_texts[:5]

('Cynthia: Corporations amply fund research that generates marketable new technologies.',
 'But the fundamental goal of science is to achieve a comprehensive knowledge of the workings of the universe.',
 'The government should help fund those basic scientific research projects that seek to further our theoretical knowledge of nature.  ',
 'Luis: The basic goal of government support of scientific research is to generate technological advances that will benefit society as a whole.',
 'So only research that is expected to yield practical applications in fields such as agriculture and medicine ought to be funded.')

In [58]:
type(dev_texts)

tuple

In [59]:
dev_cats[:5]

[{'POSITIVE': False, 'NEGATIVE': True},
 {'POSITIVE': False, 'NEGATIVE': True},
 {'POSITIVE': False, 'NEGATIVE': True},
 {'POSITIVE': False, 'NEGATIVE': True},
 {'POSITIVE': True, 'NEGATIVE': False}]

In [60]:
print(type(dev_cats))

<class 'list'>


#### REVIEW Formatted train/test results END

## Execute Training

In [36]:
# NOTE: Most of the inspiration for the code below came from 
#    https://www.machinelearningplus.com/nlp/custom-text-classification-spacy/
#    Edits have been notated where appropriate.

In [61]:
def evaluate(tokenizer, textcat, texts, cats):  
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if label == "NEGATIVE":
                continue
            if score >= 0.5 and gold[label] >= tp_lim:  # True Positive (as tp_lim) etc. were constructed as global variables.
                tp += 1.0
            elif score >= 0.5 and gold[label] < fp_lim:
                fp += 1.0
            elif score < 0.5 and gold[label] < tn_lim:
                tn += 1
            elif score < 0.5 and gold[label] >= fn_lim:
                fn += 1
    accuracy = (tp + tn) / (tp + fp + fn + tn)  # accuracy added to the evaluation process.
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
        

    # the following code adds accuracy as well as TP, TN, FP, FN to the dictionary returned from this function.
    return {"textcat_a": accuracy, "textcat_p": precision, "textcat_r": recall, "textcat_f": f_score,  
            "TP": tp, "FP": fp, "TN": tn, "FN": fn}

#### START Old evaluator

#### END Old Evaluator

In [62]:
# from spacy.util import minibatch, compounding, decaying

# Disabling other components
viewbatch = []   
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']


def run_multi_batch(num_of_cycles, n_iter):
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'Accuracy', 'Precision', 'Recall', 'Fscore', "TP", "FP", "TN", "FN"))    
    for b in range(0, num_of_cycles):
        print("cycle:", b)
        with nlp.disable_pipes(*other_pipes):  # only train textcat
            optimizer = nlp.begin_training()

            # Performing training
            for i in range(n_iter):
                losses = {}
                batches = minibatch(train_data, size=iter(train_senLengths))

                for batch in batches:
                    texts, annotations = zip(*batch)
                    viewbatch.append([texts, annotations])   
                    nlp.update(texts, annotations, sgd=optimizer, drop=next(drop_rate),      # running this line creates a decaying drop rate    
#                     nlp.update(texts, annotations, sgd=optimizer, drop=0,                    # running this line maintains a constant drop rate
                               losses=losses)

              # Calling the evaluate() function and printing the scores
                with textcat.model.use_params(optimizer.averages):
                    scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
                print('{0:.3f}\t{1:.3f}\t\t{2:.3f}\t\t{3:.3f}\t{4:.3f}\t{5:.0f}\t{6:.0f}\t{7:.0f}\t{8:.0f}'  
                      .format(losses['textcat'], 
                              scores['textcat_a'], scores['textcat_p'],
                              scores['textcat_r'], scores['textcat_f'],
                              scores['TP'], scores['FP'], scores['TN'], scores['FN']))
                

In [69]:
# Collects most variables 
# from spacy.util import decaying
drop_rate = decaying(0.25, 0.15, len(train_data))  # constructs a decay function https://spacy.io/usage/training#tips-dropout
tp_lim = 0.5
fp_lim = 0.5
tn_lim = 0.5
fn_lim = 0.5
n_iter = 2   # increasing this value does not increase the number of RANDOMIZED iterations, but runs an entire epoch
num_of_cycles = 5  # this value WILL both run extra epochs as well as RANDOMIZE them.

run_multi_batch(num_of_cycles, n_iter)

Training the model...
LOSS 	Accuracy	Precision	Recall	Fscore	 TP  	 FP  	 TN  	 FN  
cycle: 0




0.193	0.821		0.552		0.593	0.571	16	13	94	11
0.484	0.858		0.700		0.519	0.596	14	6	101	13
cycle: 1
0.383	0.821		0.560		0.519	0.538	14	11	96	13
0.503	0.836		0.632		0.444	0.522	12	7	100	15
cycle: 2
0.669	0.836		0.600		0.556	0.577	15	10	97	12
0.762	0.806		0.520		0.481	0.500	13	12	95	14
cycle: 3
0.549	0.799		0.500		0.481	0.491	13	13	94	14
0.040	0.821		0.556		0.556	0.556	15	12	95	12
cycle: 4
0.344	0.843		0.600		0.667	0.632	18	12	95	9
0.535	0.843		0.625		0.556	0.588	15	9	98	12


In [66]:
# Saves trained pipe to disk
output_dir="./YourSaveFolder"
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to ./YourSaveFolder


In [67]:
save_scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
save_scores

{'textcat_a': 0.8208955222655379,
 'textcat_p': 0.5555555553497942,
 'textcat_r': 0.5555555553497942,
 'textcat_f': 0.5555555553497942,
 'TP': 15.0,
 'FP': 12.00000001,
 'TN': 95.0,
 'FN': 12.00000001}

In [68]:
viewbatch[:3]

[[('Veterinarian: A disease of purebred racehorses that is caused by a genetic defect prevents afflicted horses from racing and can cause paralysis and death.',
   'Some horse breeders conclude that because the disease can have such serious consequences, horses with this defect should not be bred.',
   'But they are wrong because, in most cases, the severity of the disease can be controlled by diet and medication, and the defect also produces horses of extreme beauty that are in great demand in the horse show industry.'),
  ({'cats': {'POSITIVE': False, 'NEGATIVE': True}},
   {'cats': {'POSITIVE': False, 'NEGATIVE': True}},
   {'cats': {'POSITIVE': True, 'NEGATIVE': False}})],
 [('Anthropologist: Every human culture has taboos against eating certain animals.',
   'Some researchers have argued that such taboos originated solely for practical reasons, pointing out, for example, that in many cultures it is taboo to eat domestic animals that provide labor and that are therefore worth more 

### Notes

In [169]:
# Optimizer
# https://stackoverflow.com/questions/56804988/change-default-learning-rate-in-spacys-optimizer
# https://spacy.io/api/cli#train-hyperparams
# https://github.com/explosion/spaCy/blob/69e70ffae16700e990d60640f27eb7f980c0ba50/spacy/_ml.py#L49

# Stats
# https://towardsdatascience.com/a-complete-understanding-of-precision-recall-and-f-score-concepts-23dc44defef6


# Prep for output and Human readable format

In [72]:
## Function to control floating point digits displayed.
# http://randlet.com/blog/python-significant-figures-format/
import math

def to_precision(x,p):
    """
    returns a string representation of x formatted with a precision of p

    Based on the webkit javascript implementation taken from here:
    https://code.google.com/p/webkit-mirror/source/browse/JavaScriptCore/kjs/number_object.cpp
    """

    x = float(x)

    if x == 0.:
        return "0." + "0"*(p-1)

    out = []

    if x < 0:
        out.append("-")
        x = -x

    e = int(math.log10(x))
    tens = math.pow(10, e - p + 1)
    n = math.floor(x/tens)

    if n < math.pow(10, p - 1):
        e = e -1
        tens = math.pow(10, e - p+1)
        n = math.floor(x / tens)

    if abs((n + 1.) * tens - x) <= abs(n * tens -x):
        n = n + 1

    if n >= math.pow(10,p):
        n = n / 10.
        e = e + 1

    m = "%.*g" % (p, n)

    if e < -2 or e >= p:
        out.append(m[0])
        if p > 1:
            out.append(".")
            out.extend(m[1:p])
        out.append('e')
        if e > 0:
            out.append("+")
        out.append(str(e))
    elif e == (p -1):
        out.append(m)
    elif e >= 0:
        out.append(m[:e+1])
        if e+1 < len(m):
            out.append(".")
            out.extend(m[e+1:])
    else:
        out.append("0.")
        out.extend(["0"]*-(e+1))
        out.append(m)

    return "".join(out)



#### Output results of training against remaining stimuli.

In [73]:
from IPython.core.display import display, HTML
display(HTML("<style>div.output_area pre {white-space: pre;}</style>"))

def search_blist(l):  
    for row in l:
        print(row[2], end="\n")
        for sentences in row[0]:
            d = nlp(sentences.as_doc().text)
            print(d)
            print(to_precision(d.cats['POSITIVE'], 3), '\t', d)
        print("\n")


results = search_blist(blist[:3])
results

['34.0_01_08']
Conservationist: The risk to airplane passengers from collisions between airplanes using the airport and birds from the wildlife refuge is negligible. 
1.74e-3 	 Conservationist: The risk to airplane passengers from collisions between airplanes using the airport and birds from the wildlife refuge is negligible. 
In the 10 years since the refuge was established, only 20 planes have been damaged in collisions with birds, and no passenger has been injured as a result of such a collision. 
7.58e-5 	 In the 10 years since the refuge was established, only 20 planes have been damaged in collisions with birds, and no passenger has been injured as a result of such a collision. 
The wildlife refuge therefore poses no safety risk.  
9.79e-8 	 The wildlife refuge therefore poses no safety risk.  
Pilot: You neglect to mention that 17 of those 20 collisions occurred within the past 2 years, and that the number of birds in the refuge is rapidly increasing. 
0.977 	 Pilot: You neglect 

#### Used to verify training data is sent correctly

In [75]:
output_pairs = [["19.0_01_07",  1], 
                ["20.0_01_14",  2],  
                ["20.0_02_22",  3]]

def output_stimuli(df, output_pairs):                    
    results = []                           
    for i in output_pairs:
        print(i[0], end="\n")
#                   [*df[df['Order'].str.contains("19.0_01_07")].loc[106,                                 "docS" ].sents]      
        sentences = [*df[df['Order'].str.contains(i[0])        ].loc[df[df['Order']==i[0]].index.values[0],"docS"].sents]
        for j, s in enumerate(sentences):         # i = index, s = sentence
            [print("1", sentences[j], end="\n") if j == i[1] else print("0", sentences[j], end="\n")]
        print("\n")
    
output_stimuli(df, output_pairs)


19.0_01_07
0 A commonly accepted myth is that left-handed people are more prone to cause accidents than are right-handed people.
1 But this is, in fact, just a myth, as is indicated by the fact that more household accidents are caused by right-handed people than are caused by left-handed people.


20.0_01_14
0 Yolanda: Gaining access to computers without authorization and manipulating the data and programs they contain is comparable to joyriding in stolen cars; both involve breaking into private property and treating it recklessly.
0 Joyriding, however, is the more dangerous crime because it physically endangers people, whereas only intellectual property is harmed in the case of computer crimes.  
1 Arjun: I disagree!
0 For example, unauthorized use of medical records systems in hospitals could damage data systems on which human lives depend, and therefore computer crimes also cause physical harm to people.


20.0_02_22
0 Wirth: All efforts to identify a gene responsible for predisposi