Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
6b72c4e
added unit test for windowing - work in progress
albertoandreottiATgmail Nov 24, 2017
ddd5a32
added script for converting negex dataset to CSV
albertoandreottiATgmail Nov 29, 2017
e7f2b96
work in progress - fixed problem with embeddings
albertoandreottiATgmail Nov 29, 2017
8754a95
some cleanup
albertoandreottiATgmail Nov 30, 2017
2601e8b
cleanup
albertoandreottiATgmail Nov 30, 2017
dc3f4c4
i2b2 reader
albertoandreottiATgmail Nov 30, 2017
fdca19b
fixed problem with windows
albertoandreottiATgmail Dec 4, 2017
9a06fbc
enhancements in i2b2 reader
albertoandreottiATgmail Dec 4, 2017
4dd89f6
added case for windowing unit test
albertoandreottiATgmail Dec 4, 2017
4b60b1b
work in progress assertion model in the pipeline
albertoandreottiATgmail Dec 8, 2017
427fdbd
work in progress
albertoandreottiATgmail Dec 8, 2017
29319d2
modified tokenizer to match the one using in embeddings
albertoandreottiATgmail Dec 12, 2017
01973f9
work in progress
albertoandreottiATgmail Dec 15, 2017
d7c29d8
work in progress
albertoandreottiATgmail Dec 15, 2017
1e3b516
deleted negex dataset reader
albertoandreottiATgmail Dec 18, 2017
e28b52a
deleted negex dataset reader
albertoandreottiATgmail Dec 18, 2017
f807856
little cleanup in test
albertoandreottiATgmail Dec 18, 2017
a274039
added tokenizers
albertoandreottiATgmail Dec 18, 2017
bd8e614
behavior for some tokenizers
albertoandreottiATgmail Dec 18, 2017
fd99273
refactor in windowing code
albertoandreottiATgmail Dec 18, 2017
9cf8c4e
some cleanup in test and comments
albertoandreottiATgmail Dec 18, 2017
b72db86
work in progress
albertoandreottiATgmail Dec 19, 2017
9f53eab
added commons evaluation metrics class
albertoandreottiATgmail Dec 20, 2017
8a035dd
some cleanup
albertoandreottiATgmail Dec 22, 2017
5f1eea7
Merge branch 'word_embeddings' into assertion_status
albertoandreottiATgmail Dec 22, 2017
935b49b
corrections in tests, annotations, and labels
albertoandreottiATgmail Dec 23, 2017
4a7fd43
restored complete i2b2 dataset
albertoandreottiATgmail Dec 23, 2017
d6049a9
code for before & after parameters in model
albertoandreottiATgmail Dec 23, 2017
f319eca
cosmetic
albertoandreottiATgmail Dec 23, 2017
26f4366
cosmetic
albertoandreottiATgmail Dec 23, 2017
c092b2c
cleanup
albertoandreottiATgmail Dec 23, 2017
fbb8106
cosmetic
albertoandreottiATgmail Dec 23, 2017
d2b8dc8
minor changes
albertoandreottiATgmail Jan 11, 2018
35c57e5
implemented simple version of the regex tokenizer
albertoandreottiATgmail Jan 11, 2018
abc4d43
added html documentation for assertion status
albertoandreottiATgmail Jan 18, 2018
705ae69
work in progress for notepad
albertoandreottiATgmail Jan 19, 2018
fdc59b8
refactor to include Dataset interface in models
albertoandreottiATgmail Jan 22, 2018
e6dfcdb
added reader for negex dataset
albertoandreottiATgmail Jan 24, 2018
aa195e6
added jupyter notebook for assertion status
albertoandreottiATgmail Jan 24, 2018
51cdf09
some changes to make parameter names match in notebook
albertoandreottiATgmail Jan 24, 2018
6e640c4
added parquet version of negex dataset for notebook
albertoandreottiATgmail Jan 24, 2018
b2baa0e
added test cases for negex dataset
albertoandreottiATgmail Jan 24, 2018
fc617de
fixed problem with parameters
albertoandreottiATgmail Jan 24, 2018
748b1e6
removed some hard-coded params
albertoandreottiATgmail Jan 24, 2018
9512e45
minor changes in parameters
albertoandreottiATgmail Jan 24, 2018
001a842
Merge remote-tracking branch 'origin/flexible-word-embeddings' into a…
albertoandreottiATgmail Jan 25, 2018
c207608
refactor to avoid embeddings serialization
albertoandreottiATgmail Jan 25, 2018
7ce496c
missing file
albertoandreottiATgmail Jan 25, 2018
35eb023
fixes for serialization
albertoandreottiATgmail Jan 25, 2018
a5132af
transient lazy pattern
albertoandreottiATgmail Jan 26, 2018
a447606
cleanup
albertoandreottiATgmail Jan 26, 2018
bf944d0
removed embeddings logic from RawAnnotator
albertoandreottiATgmail Jan 26, 2018
8e364e0
- New tokenizer wrap up
saif-ellafi Jan 27, 2018
214873a
unit test work in progress
albertoandreottiATgmail Jan 27, 2018
4d7d973
- Fixed bug in word embeddings write process
saif-ellafi Jan 27, 2018
3b3cad0
unit test
albertoandreottiATgmail Jan 27, 2018
13cc10c
removed hard-coded path
albertoandreottiATgmail Jan 27, 2018
f29a4ca
Merge remote-tracking branch 'origin/assertion-serialization-improvem…
albertoandreottiATgmail Jan 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 98 additions & 5 deletions docs/components.html
Original file line number Diff line number Diff line change
Expand Up @@ -1055,7 +1055,8 @@ <h4 id="ViveknSentimentDetector" class="section-block"> 14. ViveknSentimentDetec
</div>
</div>

<h4 id="Finisher" class="section-block"> 15. Finisher: Getting data out </h4>

<h4 id="AssertionStatus" class="section-block"> 15. AssertionStatus: Assertion Status Classifier</h4>
<ul class="nav nav-tabs" role="tablist">
<li role="presentation" class="active"><a href="#python" aria-controls="home"
role="tab" data-toggle="tab">Python</a>
Expand All @@ -1067,8 +1068,100 @@ <h4 id="Finisher" class="section-block"> 15. Finisher: Getting data out </h4>
<div role="tabpanel" class="tab-pane active" id="python">
<div class="code-block">
<p>
Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.<br>
<b>Type:</b> assertion<br>
<b>Requires:</b> Document, Token<br>
<b>Functions:</b>
<ul>
<li>
setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column.
You don't need to specify them explicitly.
</li>
<li>
setInputCol(document): sets the name of the column that contains the text to be analyzed.
</li>
<li>
setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
</li>
<li>
setBefore(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
</li>
<li>
setAfter(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
</li>
<li>
setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp).
</li>
</ul>
<br>
<b>Input:</b> a document as output by the Document Assembler.<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">
assertion_status = AssertionStatusApproach() \
.setLabelCol("label") \
.setInputCols("document") \
.setOutputCol("assertion") \
.setBefore(11) \
.setAfter(13) \
.setEmbeddingsSource(embeddingsFile, 200, 3)</code></pre>
</div><!--//code-block-->
</div>
<div role="tabpanel" class="tab-pane active" id="scala">
<div class="code-block">
<p>
Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.<br>
<b>Type:</b> assertion<br>
<b>Requires:</b> Document, Token<br>
<b>Functions:</b>
<ul>
<li>
setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column.
You don't need to specify them explicitly.
</li>
<li>
setInputCol(document): sets the name of the column that contains the text to be analyzed.
</li>
<li>
setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
</li>
<li>
setBefore(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
</li>
<li>
setAfter(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
</li>
<li>
setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp).
</li>
</ul>
<br>
<b>Input:</b> a document as output by the Document Assembler.<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">
val assertionStatus = new AssertionStatusApproach()
.setLabelCol("label")
.setInputCols("document")
.setOutputCol("assertion")
.setBefore(11)
.setAfter(13)
.setEmbeddingsSource(embeddingsFile, 200, WordEmbeddingsFormat.Binary)</code></pre>
</div><!--//code-block-->
</div>


<h4 id="Finisher" class="section-block"> 16. Finisher: Getting data out </h4>
<ul class="nav nav-tabs" role="tablist">
<li role="presentation" class="active"><a href="#python" aria-controls="home"
role="tab" data-toggle="tab">Python</a>
</li>
<li role="presentation"><a href="#scala" aria-controls="profile" role="tab"
data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div role="tabpanel" class="tab-pane active" id="python">
<div class="code-block">
<p>
Once we have our NLP pipeline ready to go, we might want to use our
annotation results somewhere else where it is easy to use. The Finisher
outputs annotation(s) values into string.
Expand Down Expand Up @@ -1153,7 +1246,7 @@ <h4 id="Finisher" class="section-block"> 15. Finisher: Getting data out </h4>
</ul>
</div><!--//code-block--></div>
</div>
<h4 id="TokenAssembler" class="section-block"> 16. TokenAssembler: Getting data reshaped </h4>
<h4 id="TokenAssembler" class="section-block"> 17. TokenAssembler: Getting data reshaped </h4>
<ul class="nav nav-tabs" role="tablist">
<li role="presentation" class="active"><a href="#python" aria-controls="home"
role="tab" data-toggle="tab">Python</a>
Expand Down Expand Up @@ -1237,8 +1330,8 @@ <h4 id="TokenAssembler" class="section-block"> 16. TokenAssembler: Getting data
<li><a class="scrollto" href="#SentimentDetector">Sentiment Detector</a></li>
<li><a class="scrollto" href="#NERTagger">NERTagger</a></li>
<li><a class="scrollto" href="#SpellChecker">Spell Checker</a></li>
<li><a class="scrollto" href="#ViveknSentimentDetector">Vivekn Sentiment
Detector</a></li>
<li><a class="scrollto" href="#ViveknSentimentDetector">Vivekn Sentiment Detector</a></li>
<li><a class="scrollto" href="#AssertionStatus">Assertion Status</a></li>
<li><a class="scrollto" href="#Finisher">Finisher</a></li>

</ul><!--//nav-->
Expand Down
256 changes: 256 additions & 0 deletions python/example/logreg-assertion/assertion.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append('../../')\n",
"\n",
"from pyspark.sql import SparkSession\n",
"from pyspark.ml import Pipeline\n",
"\n",
"from sparknlp.annotator import *\n",
"from sparknlp.common import *\n",
"from sparknlp.base import *\n",
"\n",
"if sys.version_info[0] < 3:\n",
" from urllib import urlretrieve\n",
"else:\n",
" from urllib.request import urlretrieve\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder \\\n",
" .appName(\"assertion-status\")\\\n",
" .master(\"local[2]\")\\\n",
" .config(\"spark.driver.memory\",\"4G\")\\\n",
" .config(\"spark.driver.maxResultSize\", \"2G\")\\\n",
" .config(\"spark.jar\", \"lib/sparknlp.jar\")\\\n",
" .config(\"spark.kryoserializer.buffer.max\", \"500m\")\\\n",
" .getOrCreate()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"1. required imports.\n",
"2. create spark session."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"\n",
"embeddingsFile = 'PubMed-shuffle-win-2.bin'\n",
"embeddingsUrl = 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/PubMed-shuffle-win-2.bin'\n",
"# this may take a couple minutes\n",
"urlretrieve('https://tpc.googlesyndication.com/simgad/15370925399314456202', embeddingsFile)\n",
"\n",
"documentAssembler = DocumentAssembler()\\\n",
" .setInputCol(\"sentence\")\\\n",
" .setOutputCol(\"document\")\\\n",
"\n",
"assertion = AssertionLogRegApproach()\\\n",
" .setLabelCol(\"label\")\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"assertion\")\\\n",
" .setBefore(11)\\\n",
" .setAfter(13)\\\n",
" .setEmbeddingsSource(embeddingsFile,200,3)\n",
"\n",
"\n",
"finisher = Finisher() \\\n",
" .setInputCols([\"assertion\"]) \\\n",
" .setIncludeKeys(True)\n",
"\n",
"pipeline = Pipeline(\n",
" stages = [\n",
" documentAssembler,\n",
" assertion,\n",
" finisher\n",
" ])\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+--------+-----+---+\n",
"| sentence| target| label|start|end|\n",
"+--------------------+--------------------+--------+-----+---+\n",
"|**initials ______...|multinodular goit...|Affirmed| 21| 25|\n",
"|02) mild aortic r...|mild aortic regur...|Affirmed| 1| 3|\n",
"|02) mild left atr...|mild left atrial ...|Affirmed| 1| 4|\n",
"|02) mild left atr...|mild left atrial ...|Affirmed| 1| 4|\n",
"|02) mild to moder...|mild to moderate ...|Affirmed| 1| 5|\n",
"|02) mild to moder...|mild to moderate ...|Affirmed| 1| 5|\n",
"|02) no valvular a...|valvular abnormal...| Negated| 2| 3|\n",
"|02) nondilated ri...|nondilated right ...|Affirmed| 1| 9|\n",
"|02) normal left v...|normal left ventr...|Affirmed| 1| 4|\n",
"|02) normal left v...|normal left ventr...|Affirmed| 1| 6|\n",
"|02) paradoxical s...|post-operative se...|Affirmed| 6| 8|\n",
"|02) small left ve...|small left ventri...|Affirmed| 1| 8|\n",
"|03) mild mitral r...|mild mitral regur...|Affirmed| 1| 3|\n",
"|03) mitral annula...|mitral annular ca...|Affirmed| 1| 3|\n",
"|03) moderate left...|moderate left atr...|Affirmed| 1| 4|\n",
"|03) normal pulmon...|normal pulmonary ...|Affirmed| 1| 5|\n",
"|03) thickened aor...|thickened aortic ...|Affirmed| 1| 3|\n",
"|03) thickened aor...|thickened aortic ...|Affirmed| 1| 6|\n",
"|03) thickened aor...|thickened aortic ...|Affirmed| 1| 8|\n",
"|03) thickened mit...|thickened mitral ...|Affirmed| 1| 6|\n",
"+--------------------+--------------------+--------+-----+---+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"#Load the input data to be annotated\n",
"data = spark. \\\n",
" read. \\\n",
" parquet(\"../../../src/test/resources/negex.parquet\"). \\\n",
" limit(3000)\n",
"data.cache()\n",
"data.count()\n",
"data.show()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Start fitting\n",
"Fitting is ended\n"
]
}
],
"source": [
"print(\"Start fitting\")\n",
"model = pipeline.fit(data)\n",
"print(\"Fitting is ended\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+------------------+\n",
"| sentence| target|finished_assertion|\n",
"+--------------------+--------------------+------------------+\n",
"|**initials ______...|multinodular goit...| result->Affirmed|\n",
"|02) mild aortic r...|mild aortic regur...| result->Affirmed|\n",
"|02) mild left atr...|mild left atrial ...| result->Affirmed|\n",
"|02) mild left atr...|mild left atrial ...| result->Affirmed|\n",
"|02) mild to moder...|mild to moderate ...| result->Affirmed|\n",
"|02) mild to moder...|mild to moderate ...| result->Affirmed|\n",
"|02) no valvular a...|valvular abnormal...| result->Negated|\n",
"|02) nondilated ri...|nondilated right ...| result->Affirmed|\n",
"|02) normal left v...|normal left ventr...| result->Affirmed|\n",
"|02) normal left v...|normal left ventr...| result->Affirmed|\n",
"|02) paradoxical s...|post-operative se...| result->Affirmed|\n",
"|02) small left ve...|small left ventri...| result->Affirmed|\n",
"|03) mild mitral r...|mild mitral regur...| result->Affirmed|\n",
"|03) mitral annula...|mitral annular ca...| result->Affirmed|\n",
"|03) moderate left...|moderate left atr...| result->Affirmed|\n",
"|03) normal pulmon...|normal pulmonary ...| result->Affirmed|\n",
"|03) thickened aor...|thickened aortic ...| result->Affirmed|\n",
"|03) thickened aor...|thickened aortic ...| result->Affirmed|\n",
"|03) thickened aor...|thickened aortic ...| result->Affirmed|\n",
"|03) thickened mit...|thickened mitral ...| result->Affirmed|\n",
"+--------------------+--------------------+------------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"result = model.transform(data)\n",
"result.select(\"sentence\", \"target\", \"finished_assertion\").show()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"pipeline.write().overwrite().save(\"./assertion_pipeline\")\n",
"model.write().overwrite().save(\"./assertion_model\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"from pyspark.ml import PipelineModel, Pipeline\n",
"\n",
"Pipeline.read().load(\"./assertion_pipeline\")\n",
"sameModel = PipelineModel.read().load(\"./assertion_model\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
2 changes: 2 additions & 0 deletions python/sparknlp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
sys.modules['com.johnsnowlabs.nlp.annotators.ner'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.ner.regex'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.ner.crf'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.assertion'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.assertion.logreg'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.pos'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.pos.perceptron'] = annotator
sys.modules['com.johnsnowlabs.nlp.annotators.sbd'] = annotator
Expand Down
Loading