Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions docs/components.html
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ <h4 id="DocumentAssembler" class="section-block"> 1. DocumentAssembler: Getting
</div><!--//code-block--></div>
</div>

<h4 id="RegexTokenizer" class="section-block">2. RegexTokenizer: Word tokens</h4>
<h4 id="Tokenizer" class="section-block">2. Tokenizer: Word tokens</h4>
<ul class="nav nav-tabs" role="tablist">
<li role="presentation" class="active"><a href="#python" aria-controls="home"
role="tab" data-toggle="tab">Python</a>
Expand All @@ -197,7 +197,7 @@ <h4 id="RegexTokenizer" class="section-block">2. RegexTokenizer: Word tokens</h4
<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">tokenizer = RegexTokenizer() \
<pre><code class="language-python">tokenizer = Tokenizer() \
.setInputCols(["sentences"]) \
.setOutputCol("token")</code></pre>
</div><!--//code-block-->
Expand All @@ -218,7 +218,7 @@ <h4 id="RegexTokenizer" class="section-block">2. RegexTokenizer: Word tokens</h4
<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">val regexTokenizer = new RegexTokenizer()
<pre><code class="language-python">val regexTokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")</code></pre>
</div><!--//code-block-->
Expand Down Expand Up @@ -653,7 +653,7 @@ <h4 id="SentenceDetector" class="section-block"> 9. SentenceDetector: Sentence B
</ul>
<b>Example:</b><br>
</p>
<pre><code class="language-python">sentence_detector = SentenceDetectorModel() \
<pre><code class="language-python">sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)</code></pre>
Expand All @@ -673,7 +673,7 @@ <h4 id="SentenceDetector" class="section-block"> 9. SentenceDetector: Sentence B
</ul>
<b>Example:</b><br>
</p>
<pre><code class="language-python">val sentenceDetector = new SentenceDetectorModel()
<pre><code class="language-python">val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")</code></pre>
</div>
Expand Down Expand Up @@ -790,7 +790,7 @@ <h4 id="SentimentDetector" class="section-block"> 11. SentimentDetector: Sentime
<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">sentiment_detector = SentimentDetectorModel() \
<pre><code class="language-python">sentiment_detector = SentimentDetector() \
.setInputCols(["lemma", "sentence"]) \
.setOutputCol("sentiment")</code></pre>
</div><!--//code-block-->
Expand Down Expand Up @@ -825,7 +825,7 @@ <h4 id="SentimentDetector" class="section-block"> 11. SentimentDetector: Sentime
<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">val sentimentDetector = new SentimentDetectorModel
<pre><code class="language-python">val sentimentDetector = new SentimentDetector
.setInputCols(Array("token", "sentence"))
.setOutputCol("sentiment")</code></pre>
</div><!--//code-block--></div>
Expand Down Expand Up @@ -902,7 +902,7 @@ <h4 id="SpellChecker" class="section-block"> 13. SpellChecker: Token spell
<b>Inputs:</b> Any text for corpus. A list of words for dictionary. A
comma
separated custom dictionary.<br>
<b>Requires:</b> RegexTokenizer<br>
<b>Requires:</b> Tokenizer<br>
<b>Functions:</b><br>
<ul>
<li>
Expand Down Expand Up @@ -947,7 +947,7 @@ <h4 id="SpellChecker" class="section-block"> 13. SpellChecker: Token spell
<b>Inputs:</b> Any text for corpus. A list of words for dictionary. A
comma
separated custom dictionary.<br>
<b>Requires:</b> RegexTokenizer<br>
<b>Requires:</b> Tokenizer<br>
<b>Functions:</b><br>
<ul>
<li>
Expand Down Expand Up @@ -1017,7 +1017,7 @@ <h4 id="ViveknSentimentDetector" class="section-block"> 14. ViveknSentimentDetec
<b>Input:</b> File or folder of text files of positive and negative data<br>
<b>Example:</b><br>
</p>
<pre><code class="language-python">sentiment_detector = SentimentDetectorModel() \
<pre><code class="language-python">sentiment_detector = SentimentDetector() \
.setInputCols(["lemma", "sentence"]) \
.setOutputCol("sentiment")</code></pre>
</div><!--//code-block-->
Expand Down Expand Up @@ -1225,7 +1225,7 @@ <h4 id="TokenAssembler" class="section-block"> 16. TokenAssembler: Getting data
<a class="scrollto" href="#code-section">Annotators</a>
<ul class="nav doc-sub-menu">
<li><a class="scrollto" href="#DocumentAssembler">Document Assembler</a></li>
<li><a class="scrollto" href="#RegexTokenizer">Regex Tokenizer</a></li>
<li><a class="scrollto" href="#Tokenizer">Regex Tokenizer</a></li>
<li><a class="scrollto" href="#Normalizer">Normalizer</a></li>
<li><a class="scrollto" href="#Stemmer">Stemmer</a></li>
<li><a class="scrollto" href="#Lemmatizer">Lemmatizer</a></li>
Expand Down
4 changes: 2 additions & 2 deletions docs/notebooks.html
Original file line number Diff line number Diff line change
Expand Up @@ -115,13 +115,13 @@ <h4 id="Notebook1" class="section-block"> Sentiment Analysis using John Snow Lab
#assembled = document_assembler.transform(data)

### Sentence detector
sentence_detector = SentenceDetectorModel() \
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
#sentence_data = sentence_detector.transform(checked)
In [ ]:
### Tokenizer
tokenizer = RegexTokenizer() \
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
#tokenized = tokenizer.transform(assembled)
Expand Down
10 changes: 5 additions & 5 deletions docs/quickstart.html
Original file line number Diff line number Diff line change
Expand Up @@ -222,19 +222,19 @@ <h3 class="block-title">Sentence detection and tokenization</h3>
<p>
In this quick example, we now proceed to identify the sentences in each of our
document lines.
SentenceDetectorModel requires a Document annotation, which is provided by the
SentenceDetector requires a Document annotation, which is provided by the
DocumentAssembler
output, and it's itself a Document type token.
The RegexTokenizer requires a Document annotation type, meaning it works both
The Tokenizer requires a Document annotation type, meaning it works both
with DocumentAssembler
or SentenceDetector output, in here, we use the sentence output.
</p>
<pre><code class="language-python">import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetectorModel
val sentenceDetector = new SentenceDetectorModel()
<pre><code class="language-python">import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")

val regexTokenizer = new RegexTokenizer()
val regexTokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")</code></pre>
</div>
Expand Down
4 changes: 2 additions & 2 deletions python/example/crf-ner/ner.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,11 +101,11 @@
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"sentenceDetector = SentenceDetectorModel()\\\n",
"sentenceDetector = SentenceDetector()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"tokenizer = RegexTokenizer()\\\n",
"tokenizer = Tokenizer()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
Expand Down
4 changes: 2 additions & 2 deletions python/example/crf-ner/ner_benchmark.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -182,11 +182,11 @@
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
" sentenceDetector = SentenceDetectorModel()\\\n",
" sentenceDetector = SentenceDetector()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
" tokenizer = RegexTokenizer()\\\n",
" tokenizer = Tokenizer()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
Expand Down
6 changes: 3 additions & 3 deletions python/example/dictionary-sentiment/sentiment.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,11 @@
"document_assembler = DocumentAssembler() \\\n",
" .setInputCol(\"text\")\n",
"\n",
"sentence_detector = SentenceDetectorModel() \\\n",
"sentence_detector = SentenceDetector() \\\n",
" .setInputCols([\"document\"]) \\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"tokenizer = RegexTokenizer() \\\n",
"tokenizer = Tokenizer() \\\n",
" .setInputCols([\"sentence\"]) \\\n",
" .setOutputCol(\"token\")\n",
"\n",
Expand All @@ -58,7 +58,7 @@
" .setOutputCol(\"lemma\") \\\n",
" .setDictionary(\"../../../src/test/resources/lemma-corpus/AntBNC_lemmas_ver_001.txt\")\n",
" \n",
"sentiment_detector = SentimentDetectorModel() \\\n",
"sentiment_detector = SentimentDetector() \\\n",
" .setInputCols([\"lemma\", \"sentence\"]) \\\n",
" .setOutputCol(\"sentiment_score\") \\\n",
" .setDictPath(\"../../../src/test/resources/sentiment-corpus/default-sentiment-dict.txt\")\n",
Expand Down
4 changes: 2 additions & 2 deletions python/example/entities-extractor/extractor.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,11 @@
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"sentenceDetector = SentenceDetectorModel()\\\n",
"sentenceDetector = SentenceDetector()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"tokenizer = RegexTokenizer()\\\n",
"tokenizer = Tokenizer()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
Expand Down
65 changes: 54 additions & 11 deletions python/example/vivekn-sentiment/sentiment.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {
"collapsed": true
},
Expand All @@ -20,9 +20,42 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------+---------+--------------------+\n",
"|itemid|sentiment| text|\n",
"+------+---------+--------------------+\n",
"| 1| 0| ...|\n",
"| 2| 0| ...|\n",
"| 3| 1| omg...|\n",
"| 4| 0| .. Omga...|\n",
"| 5| 0| i think ...|\n",
"| 6| 0| or i jus...|\n",
"| 7| 1| Juuuuuuuuu...|\n",
"| 8| 0| Sunny Agai...|\n",
"| 9| 1| handed in m...|\n",
"| 10| 1| hmmmm.... i...|\n",
"| 11| 0| I must thin...|\n",
"| 12| 1| thanks to a...|\n",
"| 13| 0| this weeken...|\n",
"| 14| 0| jb isnt show...|\n",
"| 15| 0| ok thats it ...|\n",
"| 16| 0| &lt;-------- ...|\n",
"| 17| 0| awhhe man.......|\n",
"| 18| 1| Feeling stran...|\n",
"| 19| 0| HUGE roll of ...|\n",
"| 20| 0| I just cut my...|\n",
"+------+---------+--------------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"#Load the input data to be annotated\n",
"data = spark. \\\n",
Expand All @@ -36,7 +69,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {
"collapsed": true
},
Expand All @@ -59,7 +92,7 @@
"outputs": [],
"source": [
"### Sentence detector\n",
"sentence_detector = SentenceDetectorModel() \\\n",
"sentence_detector = SentenceDetector() \\\n",
" .setInputCols([\"document\"]) \\\n",
" .setOutputCol(\"sentence\")\n",
"#sentence_data = sentence_detector.transform(checked)"
Expand All @@ -74,7 +107,7 @@
"outputs": [],
"source": [
"### Tokenizer\n",
"tokenizer = RegexTokenizer() \\\n",
"tokenizer = Tokenizer() \\\n",
" .setInputCols([\"sentence\"]) \\\n",
" .setOutputCol(\"token\")\n",
"#tokenized = tokenizer.transform(assembled)"
Expand Down Expand Up @@ -154,7 +187,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pipeline = Pipeline(stages=[\n",
Expand All @@ -178,7 +213,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for r in sentiment_data.take(5):\n",
Expand All @@ -188,7 +225,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"start = time.time()\n",
Expand All @@ -201,7 +240,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"start = time.time()\n",
Expand All @@ -214,7 +255,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"start = time.time()\n",
Expand Down
Loading