![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/13.NLU_crashcourse_every_Spark_NLP_Model_in_one_line.ipynb)


# NLU 20 Minutes Crashcourse - the fast & easy Data Science route


## Spark NLP vs NLU, whats the difference?
[NLU](https://nlu.johnsnowlabs.com/) is a Python wrapper around Spark NLP. It gives you all of Spark NLPs features in 1 line of code and supports all the common Pythonic Data Structures like Pandas and Modin Dataframes. It's the ultimate tool to swifitly explore the models in Spark NLP and evaluate them for different use cases. With NLU you can :

- Use any Model in Spark NLP in 1 line of code, by leveragin NLU's automatic pipeline generation
- Predicts on most common Python data structure like strings and Pandas array
- Transforms the Spark NLP Dataframe structure into a pretty Pandas Dataframe structure, which can be configured to `document`, `sentence`, `chunk`, and`token` output levels.
- Enables you to visualize outputs of  models in 1 line of code using the [viz methods](https://nlu.johnsnowlabs.com/docs/en/viz_examples)
- Use a [Powerful Streamlit Dashboard and Buildingblocks](https://nlu.johnsnowlabs.com/docs/en/streamlit_viz_examples) that enable to you to test out any model in 0 lines of code using a GUI. In addition, you can compare embeddings using various Manifold and Matrix Decomposition visualizations



Under the hood, NLU automagically generates a Spark NLP pipeline for you, based on the model name you put in `nlu.load()`. All the NLP data transformations and predictions are still beeing performed by Spark NLP, NLU just gives you the most simple API possible for all of the features.      



NLU's core processing peformed on the data returned by Spark NLP is currently happening via the Numpy engine and will not be distributed by default, this means NLU is slower and takes up more memory than Spark NLP, because there is additional computation performed on your data.       
You have the option to set **.predict(return_spark_df =True)**. With this setting, NLU all computations will be **distributed** but NLU will not peform further data processing on the datafame, so you will get the standard Spark NLP Dataframe structure.



This short notebook will teach you a lot of things!
- Sentiment classification, binary, multi class and regressive
- Extract Parts of Speech (POS)
- Extract Named Entities (NER)
- Extract Keywords (YAKE!)
- Answer Open and Closed book questions with T5
- Summarize text and more with Multi task T5
- Translate text with Microsofts Marian Model
- Train a Multi Lingual Classifier for 100+ languages from a dataset with just one language

## NLU Webinars and Video Tutorials
- [NLU & Streamlit Tutorial](https://vimeo.com/579508034#)
- [Crash course of the 50 + Medical Domains and the 200+ Healtchare models in NLU](https://www.youtube.com/watch?v=gGDsZXt1SF8)
- [Multi Lingual NLU Webinar - Tutorial on Chinese News dataset](https://www.youtube.com/watch?v=ftAOqJuxnV4)
- [John Snow Labs NLU: Become a Data Science Superhero with One Line of Python code](https://events.johnsnowlabs.com/john-snow-labs-nlu-become-a-data-science-superhero-with-one-line-of-python-code?hsCtaTracking=c659363c-2188-4c86-945f-5cfb7b42fcfc%7C8b2b188b-92a3-48ba-ad7e-073b384425b0)
- [Python Web Def Conf - Python's NLU library: 1,000+ Models, 200+ Languages, State of the Art Accuracy, 1 Line of Code](https://2021.pythonwebconf.com/presentations/john-snow-labs-nlu-the-simplicity-of-python-the-power-of-spark-nlp)
- [NYC/DC NLP Meetup with NLU](https://youtu.be/hJR9m3NYnwk?t=2155)


## More ressources
- [Join our Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA)
- [NLU Website](https://nlu.johnsnowlabs.com/)
- [NLU Github](https://github.com/JohnSnowLabs/nlu)
- [Many more NLU example tutorials](https://github.com/JohnSnowLabs/nlu/tree/master/examples)
- [Overview of every powerful nlu 1-liner](https://nlu.johnsnowlabs.com/docs/en/examples)
- [Checkout the Modelshub for an overview of all models](https://nlp.johnsnowlabs.com/models)
- [Checkout the NLU Namespace where you can find every model as a tabel](https://nlu.johnsnowlabs.com/docs/en/spellbook)
- [Intro to NLU article](https://medium.com/spark-nlp/1-line-of-code-350-nlp-models-with-john-snow-labs-nlu-in-python-2f1c55bba619)
- [Indepth and easy Sentence Similarity Tutorial, with StackOverflow Questions using BERTology embeddings](https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf)
- [1 line of Python code for BERT, ALBERT, ELMO, ELECTRA, XLNET, GLOVE, Part of Speech with NLU and t-SNE](https://medium.com/spark-nlp/1-line-of-code-for-bert-albert-elmo-electra-xlnet-glove-part-of-speech-with-nlu-and-t-sne-9ebcd5379cd)

# Install NLU
You need Java8, Pyspark and Spark-NLP installed, [see the installation guide for instructions](https://nlu.johnsnowlabs.com/docs/en/install). If you need help or run into troubles, [ping us on slack :)](https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA)

In [1]:
%%capture
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu

# Simple NLU basics on Strings

## Context based spell Checking in 1 line

![Spell Check](https://i.imgflip.com/52wb7w.jpg)

In [2]:
nlu.load('spell').predict('I also liek to live dangertus')

spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,spell,token
0,I,I
0,also,also
0,like,liek
0,to,to
0,live,live
0,dangerous,dangertus


## Binary Sentiment classification in 1 Line
![Binary Sentiment](https://cdn.pixabay.com/photo/2015/11/13/10/07/smiley-1041796_960_720.jpg)


In [3]:
nlu.load('sentiment').predict('I love NLU and rainy days!')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,I love NLU and rainy days!,"[0.015170135535299778, 0.3010264039039612, 0.3...",pos,0.999995,"[[-0.046539001166820526, 0.6196600198745728, 0..."


## Part of Speech (POS) in 1 line
![Parts of Speech](https://image.shutterstock.com/image-photo/blackboard-background-written-colorful-chalk-600w-1166166529.jpg)

|Tag |Description | Example|
|------|------------|------|
|CC| Coordinating conjunction | This batch of mushroom stew is savory **and** delicious    |
|CD| Cardinal number | Here are **five** coins    |
|DT| Determiner | **The** bunny went home    |
|EX| Existential there | **There** is a storm coming    |
|FW| Foreign word | I'm having a **déjà vu**    |
|IN| Preposition or subordinating conjunction | He is cleverer **than** I am   |
|JJ| Adjective | She wore a **beautiful** dress    |
|JJR| Adjective, comparative | My house is **bigger** than yours    |
|JJS| Adjective, superlative | I am the **shortest** person in my family   |
|LS| List item marker | A number of things need to be considered before starting a business **,** such as premises **,** finance **,** product demand **,** staffing and access to customers |
|MD| Modal | You **must** stop when the traffic lights turn red    |
|NN| Noun, singular or mass | The **dog** likes to run    |
|NNS| Noun, plural | The **cars** are fast    |
|NNP| Proper noun, singular | I ordered the chair from **Amazon**  |
|NNPS| Proper noun, plural | We visted the **Kennedys**   |
|PDT| Predeterminer | **Both** the children had a toy   |
|POS| Possessive ending | I built the dog'**s** house    |
|PRP| Personal pronoun | **You** need to stop    |
|PRP$| Possessive pronoun | Remember not to judge a book by **its** cover |
|RB| Adverb | The dog barks **loudly**    |
|RBR| Adverb, comparative | Could you sing more **quietly** please?   |
|RBS| Adverb, superlative | Everyone in the race ran fast, but John ran **the fastest** of all    |
|RP| Particle | He ate **up** all his dinner    |
|SYM| Symbol | What are you doing **?**    |
|TO| to | Please send it back **to** me    |
|UH| Interjection | **Wow!** You look gorgeous    |
|VB| Verb, base form | We **play** soccer |
|VBD| Verb, past tense | I **worked** at a restaurant    |
|VBG| Verb, gerund or present participle | **Smoking** kills people   |
|VBN| Verb, past participle | She has **done** her homework    |
|VBP| Verb, non-3rd person singular present | You **flit** from place to place    |
|VBZ| Verb, 3rd person singular present | He never **calls** me    |
|WDT| Wh-determiner | The store honored the complaints, **which** were less than 25 days old    |
|WP| Wh-pronoun | **Who** can help me?    |
|WP\$| Possessive wh-pronoun | **Whose** fault is it?    |
|WRB| Wh-adverb | **Where** are you going?  |

In [4]:
nlu.load('pos').predict('POS assigns each token in a sentence a grammatical label')

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,pos,token
0,NNP,POS
0,NNS,assigns
0,DT,each
0,NN,token
0,IN,in
0,DT,a
0,NN,sentence
0,DT,a
0,JJ,grammatical
0,NN,label


## Named Entity Recognition (NER) in 1 line

![NER](http://ckl-it.de/wp-content/uploads/2021/02/ner-1.png)

|Type | 	Description |
|------|--------------|
| PERSON | 	People, including fictional like **Harry Potter** |
| NORP | 	Nationalities or religious or political groups like the **Germans** |
| FAC | 	Buildings, airports, highways, bridges, etc. like **New York Airport** |
| ORG | 	Companies, agencies, institutions, etc. like **Microsoft** |
| GPE | 	Countries, cities, states. like **Germany** |
| LOC | 	Non-GPE locations, mountain ranges, bodies of water. Like the **Sahara desert**|
| PRODUCT | 	Objects, vehicles, foods, etc. (Not services.) like **playstation** |
| EVENT | 	Named hurricanes, battles, wars, sports events, etc. like **hurricane Katrina**|
| WORK_OF_ART | 	Titles of books, songs, etc. Like **Mona Lisa** |
| LAW | 	Named documents made into laws. Like : **Declaration of Independence** |
| LANGUAGE | 	Any named language. Like **Turkish**|
| DATE | 	Absolute or relative dates or periods. Like every second **friday**|
| TIME | 	Times smaller than a day. Like **every minute**|
| PERCENT | 	Percentage, including ”%“. Like **55%** of workers enjoy their work |
| MONEY | 	Monetary values, including unit. Like **50$** for those pants |
| QUANTITY | 	Measurements, as of weight or distance. Like this person weights **50kg** |
| ORDINAL | 	“first”, “second”, etc. Like David placed **first** in the tournament |
| CARDINAL | 	Numerals that do not fall under another type. Like **hundreds** of models are avaiable in NLU |


In [5]:
nlu.load('ner').predict("John Biden from America and Putin from Russia do not share many opinions.", output_level='chunk')

onto_recognize_entities_sm download started this may take some time.
Approx size to download 160.1 MB
[OK!]


Unnamed: 0,document,entities,entities_class,entities_confidence,entities_origin_chunk,entities_origin_sentence,sentence_pragmatic,word_embedding_embeddings
0,John Biden from America and Putin from Russia ...,John Biden,PERSON,0.9735,0,0,[John Biden from America and Putin from Russia...,"[[-0.2747400104999542, 0.48680999875068665, -0..."
0,John Biden from America and Putin from Russia ...,America,GPE,0.9985,1,0,[John Biden from America and Putin from Russia...,"[[-0.2747400104999542, 0.48680999875068665, -0..."
0,John Biden from America and Putin from Russia ...,Putin,PERSON,0.8682,2,0,[John Biden from America and Putin from Russia...,"[[-0.2747400104999542, 0.48680999875068665, -0..."
0,John Biden from America and Putin from Russia ...,Russia,GPE,0.9968,3,0,[John Biden from America and Putin from Russia...,"[[-0.2747400104999542, 0.48680999875068665, -0..."


## Transformer Based Sequence Classification with NLU

In [6]:
import nlu
# https://nlp.johnsnowlabs.com/2021/11/03/bert_sequence_classifier_finbert_en.html
seq_pipe = nlu.load('en.classify.bert_sequence.finbert')
seq_pipe.predict('Stocks rallied and the British pound gained.')

bert_sequence_classifier_finbert download started this may take some time.
Approximate size to download 390.9 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,classified_sequence,classified_sequence_confidence,sentence
0,positive,0.936465,Stocks rallied and the British pound gained.


In [7]:
# https://nlp.johnsnowlabs.com/2021/11/21/distilbert_sequence_classifier_industry_en.html
seq_pipe = nlu.load('en.classify.distilbert_sequence.industry')
seq_pipe.predict('Stellar Capital Services Limited is an India-based non-banking financial company ... loan against property, management consultancy, personal loans and unsecured loans.')

distilbert_sequence_classifier_industry download started this may take some time.
Approximate size to download 238.4 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,classified_sequence,classified_sequence_confidence,sentence
0,Asset Management & Custody Banks,0.908935,Stellar Capital Services Limited is an India-b...
0,Consumer Finance,0.981309,".. loan against property, management consultan..."


In [8]:
#https://nlp.johnsnowlabs.com/2021/11/21/distilbert_sequence_classifier_banking77_en.html
seq_pipe = nlu.load('en.classify.distilbert_sequence.industry')
seq_pipe.predict('I am still waiting on my card?')

distilbert_sequence_classifier_industry download started this may take some time.
Approximate size to download 238.4 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,classified_sequence,classified_sequence_confidence,sentence
0,Internet & Direct Marketing Retail,0.306007,I am still waiting on my card?


## Transformer Based Token Classification with NLU

In [9]:
# https://nlp.johnsnowlabs.com/2021/12/27/roberta_token_classifier_ticker_en.html
tok_pipe = nlu.load('en.ner.stocks_ticker')
tok_pipe.predict('MFST is a great stock to buy!!!')

roberta_token_classifier_ticker download started this may take some time.
Approximate size to download 443.8 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,classified_token,token
0,I-TICKER,MFST
0,O,is
0,O,a
0,O,great
0,O,stock
0,O,to
0,O,buy
0,O,!!!


In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

# Let's apply NLU to a dataset!

<div>
<img src="http://ckl-it.de/wp-content/uploads/2021/02/crypto.jpeg " width="400"  height="250" >
</div>


In [1]:
import pandas as pd
import nlu
!wget http://ckl-it.de/wp-content/uploads/2020/12/small_btc.csv
df = pd.read_csv('/content/small_btc.csv').iloc[0:5000].title
df

--2023-07-06 05:32:47--  http://ckl-it.de/wp-content/uploads/2020/12/small_btc.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22244914 (21M) [text/csv]
Saving to: ‘small_btc.csv’


2023-07-06 05:32:51 (5.83 MB/s) - ‘small_btc.csv’ saved [22244914/22244914]



0          Bitcoin Price Update: Will China Lead us Down?
1       Key Bitcoin Price Levels for Week 51 (15 – 22 ...
2       National Australia Bank, Citing Highly Flawed ...
3       Chinese Bitcoin Ban Driven by  Chinese Banking...
4                   Bitcoin Trade Update: Opened Position
                              ...                        
1995    Bitcoin Bill Pay Company Living Room of Satosh...
1996    NYDFS Extends BitLicense Bitcoin Regulation Co...
1997    Bitfinex Passes Stefan Thomas’s Proof Of Solve...
1998    Cryptocurrency Exchange Platform AlphaPoint Pa...
1999    Want to Buy And Sell Bitcoin Fast and Secure? ...
Name: title, Length: 2000, dtype: object

## NER on a Crypto News dataset
### The **NER** model which you can load via `nlu.load('ner')` recognizes 18 different classes in your dataset.
We set output level to chunk, so that we get 1 row per NER class.


#### Predicted entities:


NER is avaiable in many languages, which you can [find in the John Snow Labs Modelshub](https://nlp.johnsnowlabs.com/models)

In [2]:
ner_df = nlu.load('ner').predict(df, output_level = 'chunk')
ner_df

onto_recognize_entities_sm download started this may take some time.
Approx size to download 160.1 MB
[OK!]


Unnamed: 0,document,entities,entities_class,entities_confidence,entities_origin_chunk,entities_origin_sentence,sentence_pragmatic,word_embedding_embeddings
0,Bitcoin Price Update: Will China Lead us Down?,,,,,,[Bitcoin Price Update: Will China Lead us Down?],"[[0.8403199911117554, 0.13267000019550323, -0...."
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,Week 51 (15 – 22,DATE,0.76498336,0,0,[Key Bitcoin Price Levels for Week 51 (15 – 22...,"[[-0.22009000182151794, 0.12280000001192093, 0..."
2,"National Australia Bank, Citing Highly Flawed ...",Australia,GPE,0.7144,0,0,"[National Australia Bank, Citing Highly Flawed...","[[-0.003313800087198615, 0.3894599974155426, 0..."
3,Chinese Bitcoin Ban Driven by Chinese Banking...,Chinese,NORP,0.9957,0,0,[Chinese Bitcoin Ban Driven by Chinese Bankin...,"[[0.4327400028705597, 0.3958199918270111, 0.58..."
3,Chinese Bitcoin Ban Driven by Chinese Banking...,Chinese,NORP,0.9437,1,0,[Chinese Bitcoin Ban Driven by Chinese Bankin...,"[[0.4327400028705597, 0.3958199918270111, 0.58..."
...,...,...,...,...,...,...,...,...
1995,Bitcoin Bill Pay Company Living Room of Satosh...,,,,,,[Bitcoin Bill Pay Company Living Room of Satos...,"[[0.8403199911117554, 0.13267000019550323, -0...."
1996,NYDFS Extends BitLicense Bitcoin Regulation Co...,,,,,,[NYDFS Extends BitLicense Bitcoin Regulation C...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1997,Bitfinex Passes Stefan Thomas’s Proof Of Solve...,Bitfinex Passes,PERSON,0.70669997,0,0,[Bitfinex Passes Stefan Thomas’s Proof Of Solv...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1998,Cryptocurrency Exchange Platform AlphaPoint Pa...,,,,,,[Cryptocurrency Exchange Platform AlphaPoint P...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."


### Top 50 Named Entities

In [3]:
# ner_df.entities.value_counts()[:100].plot.barh(figsize = (16,20))

In [4]:
x = ner_df.entities.value_counts()[:100].reset_index()

In [5]:
import plotly.express as px
fig = px.bar(x, x='entities', y='index', orientation='h', height=1000)
fig.show()

### Top 50 Named Entities which are PERSONS

In [6]:
# ner_df[ner_df.entities_class == 'PERSON'].entities.value_counts()[:50].plot.barh(figsize=(18,20), title ='Top 50 Occuring Persons in the dataset')

In [7]:
x = ner_df[ner_df.entities_class == 'PERSON'].entities.value_counts()[:50].reset_index()

In [8]:
fig = px.bar(x, x='entities', y='index', orientation='h', height=1000)
fig.show()

### Top 50 Named Entities which are Countries/Cities/States

In [9]:
# ner_df[ner_df.entities_class == 'GPE'].entities.value_counts()[:50].plot.barh(figsize=(18,20),title ='Top 50 Countries/Cities/States Occuring in the dataset')

In [10]:
x = ner_df[ner_df.entities_class == 'GPE'].entities.value_counts()[:50].reset_index()

In [11]:
fig = px.bar(x, x='entities', y='index', orientation='h', height=1000)
fig.show()

### Top 50 Named Entities which are PRODUCTS

In [12]:
# ner_df[ner_df.entities_class == 'PRODUCT'].entities.value_counts()[:50].plot.barh(figsize=(18,20),title ='Top 50 products occuring in the dataset')

In [13]:
x = ner_df[ner_df.entities_class == 'PRODUCT'].entities.value_counts()[:50].reset_index()

In [14]:
fig = px.bar(x, x='entities', y='index', orientation='h', height=1000)
fig.show()

### Top 50 Named Entities which are ORGANIZATIONS

In [15]:
# ner_df[ner_df.entities_class == 'ORG'].entities.value_counts()[:50].plot.barh(figsize=(18,20),title ='Top 50 products occuring in the dataset')

In [16]:
x = ner_df[ner_df.entities_class == 'ORG'].entities.value_counts()[:50].reset_index()

In [17]:
fig = px.bar(x, x='entities', y='index', orientation='h', height=1000)
fig.show()

## YAKE on a Crypto News dataset
### The **YAKE!** model (Yet Another Keyword Extractor) is a **unsupervised** keyword extraction algorithm.
You can load it via   which you can load via `nlu.load('yake')`. It has no weights and is very fast.
It has various parameters that can be configured to influence which keywords are beeing extracted, [here for an more indepth YAKE guide](https://github.com/JohnSnowLabs/nlu/blob/master/examples/webinars_conferences_etc/multi_lingual_webinar/1_NLU_base_features_on_dataset_with_YAKE_Lemma_Stemm_classifiers_NER_.ipynb)

In [18]:
yake_df = nlu.load('yake').predict(df)
yake_df

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,keywords,keywords_confidence
0,Bitcoin Price Update: Will China Lead us Down?,update,0.5798862558280943
0,Bitcoin Price Update: Will China Lead us Down?,china,0.5798862558280943
0,Bitcoin Price Update: Will China Lead us Down?,china lead,0.5066323531331214
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,price,0.5798862558280943
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,levels,0.5798862558280943
...,...,...,...
1998,Cryptocurrency Exchange Platform AlphaPoint Pa...,growth,0.26804494089513314
1998,Cryptocurrency Exchange Platform AlphaPoint Pa...,support growth,0.1840422979793308
1999,Want to Buy And Sell Bitcoin Fast and Secure? ...,bitcoin fast,0.3579604335906263
1999,Want to Buy And Sell Bitcoin Fast and Secure? ...,try coinrnr,0.2564243599387429


### Top 50 extracted Keywords with YAKE!

In [19]:
# yake_df.keywords.value_counts()[0:50].plot.barh(figsize=(14,18))

In [20]:
x = yake_df.keywords.value_counts()[0:50].reset_index()

In [21]:
fig = px.bar(x, x='keywords', y='index', orientation='h', height=1000)
fig.show()

## Binary Sentimental Analysis and Distribution on a dataset

In [22]:
sent_df = nlu.load('sentiment').predict(df)
sent_df

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,Bitcoin Price Update: Will China Lead us Down?,"[0.03502599149942398, 0.3820391297340393, 0.76...",neg,0.899232,"[[0.8403199911117554, 0.13267000019550323, -0...."
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,"[0.04941307753324509, 0.1195848360657692, 0.17...",neg,0.994394,"[[-0.22009000182151794, 0.12280000001192093, 0..."
2,"National Australia Bank, Citing Highly Flawed ...","[-0.1178220584988594, 0.02187376841902733, 0.3...",neg,0.998585,"[[-0.003313800087198615, 0.3894599974155426, 0..."
3,Chinese Bitcoin Ban Driven by Chinese Banking ...,"[0.19560889899730682, 0.19520443677902222, 0.3...",neg,0.999998,"[[0.4327400028705597, 0.3958199918270111, 0.58..."
4,Bitcoin Trade Update: Opened Position,"[0.048274993896484375, 0.14680083096027374, 0....",pos,0.985043,"[[0.8403199911117554, 0.13267000019550323, -0...."
...,...,...,...,...,...
1996,NYDFS Extends BitLicense Bitcoin Regulation Co...,"[0.1282588541507721, -0.1378742903470993, 0.17...",neg,0.998673,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1997,Bitfinex Passes Stefan Thomas’s Proof Of Solve...,"[-0.22165587544441223, -0.19953998923301697, 0...",neg,0.999471,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1998,Cryptocurrency Exchange Platform AlphaPoint Pa...,"[0.027292732149362564, 0.3372064232826233, -0....",pos,0.999911,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1999,Want to Buy And Sell Bitcoin Fast and Secure?,"[0.0892653837800026, 0.25083836913108826, 0.14...",pos,0.973229,"[[-0.17124000191688538, 0.5644699931144714, 0...."


In [23]:
# sent_df.sentiment.value_counts().plot.bar(title='Sentiment Distribution ')

In [24]:
x = sent_df.sentiment.value_counts().reset_index()

In [25]:
fig = px.bar(x, x='sentiment', y='index')
fig.show()

## Emotional Analysis and Distribution of Headlines

In [26]:
emo_df = nlu.load('emotion').predict(df)
emo_df

classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,emotion,emotion_confidence,sentence,sentence_embedding_use
0,fear,0.998173,Bitcoin Price Update: Will China Lead us Down?,"[0.05829370766878128, -0.03690449520945549, -0..."
1,joy,0.997696,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,"[0.03808825463056564, -0.04514158144593239, -0..."
2,fear,0.999997,"National Australia Bank, Citing Highly Flawed ...","[0.050343189388513565, -0.013036509975790977, ..."
3,fear,0.999135,Chinese Bitcoin Ban Driven by Chinese Banking ...,"[0.05515281856060028, -0.052379172295331955, -..."
4,joy,0.998864,Bitcoin Trade Update: Opened Position,"[0.05926975607872009, -0.05646343156695366, -0..."
...,...,...,...,...
1996,fear,0.998281,NYDFS Extends BitLicense Bitcoin Regulation Co...,"[0.0639236643910408, -0.05505230277776718, -0...."
1997,fear,0.772053,Bitfinex Passes Stefan Thomas’s Proof Of Solve...,"[0.05917809531092644, -0.04149803891777992, -0..."
1998,joy,0.999348,Cryptocurrency Exchange Platform AlphaPoint Pa...,"[0.053696729242801666, -0.02348092757165432, -..."
1999,fear,0.998905,Want to Buy And Sell Bitcoin Fast and Secure?,"[0.0626637190580368, -0.05945301055908203, -0...."


In [27]:
# emo_df.emotion.value_counts().plot.bar(title='Emotion Distribution')

In [28]:
x = emo_df.emotion.value_counts().reset_index()
x

Unnamed: 0,index,emotion
0,joy,914
1,fear,791
2,surprise,385
3,sadness,70


In [29]:
import plotly.express as px
fig = px.bar(x, x='emotion', y='index')
fig.show()

**Make sure to restart your notebook again** before starting the next section

In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

# Answer **Closed Book** and Open **Book Questions** with Google's T5!

<!-- [T5]() -->
![T5 GIF](https://1.bp.blogspot.com/-o4oiOExxq1s/Xk26XPC3haI/AAAAAAAAFU8/NBlvOWB84L0PTYy9TzZBaLf6fwPGJTR0QCLcBGAsYHQ/s1600/image3.gif)

You can load the **question answering** model with `nlu.load('en.t5')`

In [1]:
import nlu
# Load question answering T5 model
t5_closed_question = nlu.load('en.t5')

google_t5_small_ssm_nq download started this may take some time.
Approximate size to download 170.8 MB
[OK!]


## Answer **Closed Book Questions**  
Closed book means that no additional context is given and the model must answer the question with the knowledge stored in it's weights

In [2]:
t5_closed_question.predict("Who is president of Nigeria?")

Unnamed: 0,document,t5
0,Who is president of Nigeria?,Muhammadu Buhari


In [3]:
t5_closed_question.predict("What is the most common language in India?")

Unnamed: 0,document,t5
0,What is the most common language in India?,Hindi


In [4]:
t5_closed_question.predict("What is the capital of Germany?")

Unnamed: 0,document,t5
0,What is the capital of Germany?,Berlin


## Answer **Open Book Questions**
These are questions where we give the model some additional context, that is used to answer the question

In [5]:
import nlu
t5_open_book = nlu.load('answer_question')

t5_base download started this may take some time.
Approximate size to download 451.8 MB
[OK!]


In [6]:
context   = 'Peters last week was terrible! He had an accident and broke his leg while skiing!'
question1  = 'Why was peters week so bad?'
question2  = 'How did peter broke his leg?'

t5_open_book.predict([question1+context, question2 + context])

Unnamed: 0,document,t5
0,Why was peters week so bad?Peters last week wa...,He had an accident and broke his leg while skiing
1,How did peter broke his leg?Peters last week w...,skiing


In [7]:
# Ask T5 questions in the context of a News Article
question1 = 'Who is Jack ma?'
question2 = 'Who is founder of Alibaba Group?'
question3 = 'When did Jack Ma re-appear?'
question4 = 'How did Alibaba stocks react?'
question5 = 'Whom did Jack Ma meet?'
question6 = 'Who did Jack Ma hide from?'


# from https://www.bbc.com/news/business-55728338
news_article_context = """ context:
Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire.
His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses.
The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media.
Alibaba shares surged 5% on Hong Kong's stock exchange on the news.
"""

questions = [
             question1+ news_article_context,
             question2+ news_article_context,
             question3+ news_article_context,
             question4+ news_article_context,
             question5+ news_article_context,
             question6+ news_article_context,]



In [8]:
t5_open_book.predict(questions)

Unnamed: 0,document,t5
0,Who is Jack ma? context: Alibaba Group founder...,Alibaba Group founder
1,Who is founder of Alibaba Group? context: Alib...,Jack Ma
2,When did Jack Ma re-appear? context: Alibaba G...,Wednesday
3,How did Alibaba stocks react? context: Alibaba...,surged 5%
4,Whom did Jack Ma meet? context: Alibaba Group ...,100 rural teachers
5,Who did Jack Ma hide from? context: Alibaba Gr...,Chinese regulators


In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

# Multi Problem T5 model for Summarization and more
The main T5 model was trained for over 20 tasks from the SQUAD/GLUE/SUPERGLUE datasets. See [this notebook](https://github.com/JohnSnowLabs/nlu/blob/master/examples/webinars_conferences_etc/multi_lingual_webinar/7_T5_SQUAD_GLUE_SUPER_GLUE_TASKS.ipynb) for a demo of all tasks


# Overview of every task available with T5
[The T5 model](https://arxiv.org/pdf/1910.10683.pdf) is trained on various datasets for 17 different tasks which fall into 8 categories.



1. Text summarization
2. Question answering
3. Translation
4. Sentiment analysis
5. Natural Language inference
6. Coreference resolution
7. Sentence Completion
8. Word sense disambiguation

### Every T5 Task with explanation:
|Task Name | Explanation |
|----------|--------------|
|[1.CoLA](https://nyu-mll.github.io/CoLA/)                   | Classify if a sentence is gramaticaly correct|
|[2.RTE](https://dl.acm.org/doi/10.1007/11736790_9)                    | Classify whether if a statement can be deducted from a sentence|
|[3.MNLI](https://arxiv.org/abs/1704.05426)                   | Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).|
|[4.MRPC](https://www.aclweb.org/anthology/I05-5002.pdf)                   | Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)|
|[5.QNLI](https://arxiv.org/pdf/1804.07461.pdf)                   | Classify whether the answer to a question can be deducted from an answer candidate.|
|[6.QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)                    | Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)|
|[7.SST2](https://www.aclweb.org/anthology/D13-1170.pdf)                   | Classify the sentiment of a sentence as positive or negative|
|[8.STSB](https://www.aclweb.org/anthology/S17-2001/)                   | Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)|
|[9.CB](https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601)                     | Classify for a premise and a hypothesis whether they contradict each other or not (binary).|
|[10.COPA](https://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418/0)                   | Classify for a question, premise, and 2 choices which choice the correct choice is (binary).|
|[11.MultiRc](https://www.aclweb.org/anthology/N18-1023.pdf)                | Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),|
|[12.WiC](https://arxiv.org/abs/1808.09121)                    | Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.|
|[13.WSC/DPR](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492/0)       | Predict for an ambiguous pronoun in a sentence what it is referring to.  |
|[14.Summarization](https://arxiv.org/abs/1506.03340)          | Summarize text into a shorter representation.|
|[15.SQuAD](https://arxiv.org/abs/1606.05250)                  | Answer a question for a given context.|
|[16.WMT1.](https://arxiv.org/abs/1706.03762)                  | Translate English to German|
|[17.WMT2.](https://arxiv.org/abs/1706.03762)                   | Translate English to French|
|[18.WMT3.](https://arxiv.org/abs/1706.03762)                   | Translate English to Romanian|



In [3]:
import nlu

In [4]:
# Load the Multi Task Model T5
t5_multi = nlu.load('en.t5.base')

t5_base download started this may take some time.
Approximate size to download 451.8 MB
[OK!]


In [5]:
# https://www.reuters.com/article/instant-article/idCAKBN2AA2WF
text = """(Reuters) - Mastercard Inc said on Wednesday it was planning to offer support for some cryptocurrencies on its network this year, joining a string of big-ticket firms that have pledged similar support.

The credit-card giant’s announcement comes days after Elon Musk’s Tesla Inc revealed it had purchased $1.5 billion of bitcoin and would soon accept it as a form of payment.

Asset manager BlackRock Inc and payments companies Square and PayPal have also recently backed cryptocurrencies.

Mastercard already offers customers cards that allow people to transact using their cryptocurrencies, although without going through its network.

"Doing this work will create a lot more possibilities for shoppers and merchants, allowing them to transact in an entirely new form of payment. This change may open merchants up to new customers who are already flocking to digital assets," Mastercard said. (mstr.cd/3tLaPZM)

Mastercard specified that not all cryptocurrencies will be supported on its network, adding that many of the hundreds of digital assets in circulation still need to tighten their compliance measures.

Many cryptocurrencies have struggled to win the trust of mainstream investors and the general public due to their speculative nature and potential for money laundering.
"""
t5_multi['t5_transformer'].setTask('summarize ')
short = t5_multi.predict(text)
short

Unnamed: 0,document,t5
0,(Reuters) - Mastercard Inc said on Wednesday i...,mastercard said on Wednesday it was planning t...


In [6]:
print(f"Original Length {len(short.document.iloc[0])}   Summarized Length : {len(short.t5.iloc[0])} \n summarized text :{short.t5.iloc[0]} ")


Original Length 1277   Summarized Length : 352 
 summarized text :mastercard said on Wednesday it was planning to offer support for some cryptocurrencies on its network this year . the credit-card giant’s announcement comes days after Elon Musk’s Tesla Inc revealed it had purchased $1.5 billion of bitcoin . asset manager blackrock and payments companies Square and PayPal have also recently backed cryptocurrencies . 


In [7]:
short.t5.iloc[0]

'mastercard said on Wednesday it was planning to offer support for some cryptocurrencies on its network this year . the credit-card giant’s announcement comes days after Elon Musk’s Tesla Inc revealed it had purchased $1.5 billion of bitcoin . asset manager blackrock and payments companies Square and PayPal have also recently backed cryptocurrencies .'

# Fine Tuned T5 models
NLU provides various fine tuned T5 models for a diverse set of tasks, ranging from generating SQL queries from natural language, style transfer and error correcting

In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

## T5 for generating SQL queries

In [9]:
# https://nlp.johnsnowlabs.com/2022/01/12/t5_small_wikiSQL_en.html
import nlu
nlu.load('t5.wikiSQL').predict('How many customers have ordered more than 2 items?')

t5_small_wikiSQL download started this may take some time.
Approximate size to download 249.9 MB
[OK!]


Unnamed: 0,document,t5
0,How many customers have ordered more than 2 it...,How many customers ordered > 2 items


## T5 for text style transfer grammar correcting

In [10]:
# https://nlp.johnsnowlabs.com/2022/01/12/t5_active_to_passive_styletransfer_en.html
nlu.load('t5.active_to_passive_styletransfer').predict('I am writing you a letter.')

t5_active_to_passive_styletransfer download started this may take some time.
Approximate size to download 252.7 MB
[OK!]


Unnamed: 0,document,t5
0,I am writing you a letter.,a letter is written by me.


**Make sure to restart your notebook again** before starting the next section

In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

# Conditional Text Generation with GPT2
- GPT2  is very capable of generating text, but introduces new engineering challenges, so called Prompt Engineering
- The outputs of GPT2 depend on the text sequence we feed it in the beginning, the so called “Prompt”. Choosing the right prompt for your problem is the biggest challenge

## GPT2 with NLU



In [1]:
import nlu
gpt2_pipe = nlu.load('gpt2')
gpt2_pipe.print_info()

gpt2 download started this may take some time.
Approximate size to download 442.7 MB
[OK!]
The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['gpt2'] has settable params:
component_list['gpt2'].setBatchSize(4)                         | Info: Size of every batch | Currently set to : 4
component_list['gpt2'].setIgnoreTokenIds([])                   | Info: A list of token ids which are ignored in the decoder's output | Currently set to : []
component_list['gpt2'].setRepetitionPenalty(1.0)               | Info: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details | Currently set to : 1.0
component_list['gpt2'].setTask('')                             | Info: Transformer's task, e.g. 'is it true that'> | Currently set to : 
component_list['gpt2'].setTemperature(1.0)                     | Info: The value used to module the next token probabiliti

In [2]:
gpt2_pipe['gpt2'].setMaxOutputLength(100)
gpt2_pipe.predict("""Explain the plot of Star Wars""")['generated'].values[0]

' Explain the plot of Star Wars: The Last Jedi.\n\nThe plot of The Last Star Wars is a story about a young Jedi named Obi-Wan Kenobi, who is sent to the planet Tatooine to study the Force. He is sent there by the Jedi Order to study and learn the Force, but he is not able to do so due to his lack of training. He eventually learns that the Force is not real, and that he must learn to use it to his advantage.\n.'

In [3]:
# Bad prompting, the input text we condition GPT2 yields bad output, it does not understand the pattern we want from the original input
gpt2_pipe.predict("Suggest me a good Sci-Fi movie")['generated'].values[0]


" Suggest me a good Sci-Fi movie.\n\nI'm not sure if I'm going to be able to get a good movie out of this one. I'm not going to get it. I don't know if I'll be able. I just don't have the time. I can't get it out of my head. I have to get the movie out. I've got to get my hands on it.\n.\n (Laughs.)\n\n."

In [4]:
# Good prompting. help GPT2 out and by giving it a few samples in the prompt we condition it on
gpt2_pipe.predict("""Generate a top 10 movie list: \n
1.The Matrix \n
2.Terminator \n
3. """)['generated'].values[0]


' Generate a top 10 movie list: 1.The Matrix 2.Terminator 3.The Hunger Games 4.The Dark Knight Rises 5.The Lord of the Rings 6.The Hobbit 7.The Last Crusade 8.The Lion King 9.The Lego Movie 10.The LEGO Movie 11.The King of the Hill 12.The Little Mermaid 13.The Princess Bride 14.The Pirates of the Caribbean 15.The Prince of Persia 16.The Return of the Jedi 17.The Star Wars'

**Make sure to restart your notebook again** before starting the next section

In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

# Translate between more than 200 Languages with  [ Microsofts Marian Models](https://marian-nmt.github.io/publications/)

Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
You need to specify the language your data is in as `start_language` and the language you want to translate to as `target_language`.    
 The language references must be [ISO language codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)

`nlu.load('<start_language>.translate_to.<target_language>')`       

**Translate Turkish to English:**     
`nlu.load('tr.translate_to.en')`

**Translate English to French:**     
`nlu.load('en.translate_to.fr')`


**Translate French to Hebrew:**     
`nlu.load('fr.translate_to.he')`





![Languages](https://camo.githubusercontent.com/b548abf3d1f9657d01fd74404354ec49fc11eea0/687474703a2f2f636b6c2d69742e64652f77702d636f6e74656e742f75706c6f6164732f323032312f30322f666c6167732e6a706567)

In [1]:
import nlu
import pandas as pd
!wget http://ckl-it.de/wp-content/uploads/2020/12/small_btc.csv
df = pd.read_csv('/content/small_btc.csv').iloc[0:20].title

--2023-07-06 05:55:43--  http://ckl-it.de/wp-content/uploads/2020/12/small_btc.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22244914 (21M) [text/csv]
Saving to: ‘small_btc.csv.1’


2023-07-06 05:55:48 (5.82 MB/s) - ‘small_btc.csv.1’ saved [22244914/22244914]



## Translate to German

In [2]:
translate_pipe = nlu.load('en.translate_to.de')
translate_pipe.predict(df)

translate_en_de download started this may take some time.
Approx size to download 268 MB
[OK!]


Unnamed: 0,sentence_dl,translated_translation
0,Bitcoin Price Update: Will China Lead us Down?,Bitcoin Price Update: Wird China uns nach unte...
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,Preisniveau für Bitcoin für Woche 51 (15 - 22 ...
2,"National Australia Bank, Citing Highly Flawed ...","National Australia Bank, zitiert hoch abgeflac..."
3,Chinese Bitcoin Ban Driven by Chinese Banking...,Chinesische Bitcoin Ban angetrieben durch chin...
4,Bitcoin Trade Update: Opened Position,Bitcoin Trade Update: Geöffnete Position
5,Key Bitcoin Price Levels for Week 52 (22 – 28 ...,Key Bitcoin Price Levels für Woche 52 (22 - 28...
6,Bitcoin Survival,Bitcoin Überleben
7,Massive Bitcoin Sell Going On,Massive Bitcoin verkaufen weiter
8,Why Bitcoin will rise on Monday 23rd by more t...,Warum Bitcoin am Montag um mehr als 10% steige...
9,"Why Bitcoin is falling, and will rise again",Warum Bitcoin fällt und wieder aufsteigt


## Translate to Chinese

In [3]:
translate_pipe = nlu.load('en.translate_to.zh')
translate_pipe.predict(df)

translate_en_zh download started this may take some time.
Approx size to download 280.9 MB
[OK!]


Unnamed: 0,sentence_dl,translated_translation
0,Bitcoin Price Update: Will China Lead us Down?,Bitcoin 价格最新消息:中国会带领我们下台吗 ? . . . . . . . . . . .
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,第51周(12月15 - 22日) Bitcoin 关键价格 水平 。 12月 15 - 2...
2,"National Australia Bank, Citing Highly Flawed ...","国家澳大利亚银行, 援引高易燃数据, 称 Bitcoin 是一个泡泡 。 Name UN C..."
3,Chinese Bitcoin Ban Driven by Chinese Banking...,被中国银行危机驱赶的中国 Bitcoin Ban ? ? ? 。 。 。 。 。 。 。 。 。
4,Bitcoin Trade Update: Opened Position,Bittcoin 贸易最新贸易 : 开放位置 : 开放位置 Name Name 开放位置 N...
5,Key Bitcoin Price Levels for Week 52 (22 – 28 ...,12月22 - 28日 - 我的贸易计划 - Bitcoin 价格第52周( 12月 22 ...
6,Bitcoin Survival,Bitcoin 生存 毕 生 活 生 生 业 业 业 业 业 业 业 业 业 业 业 业
7,Massive Bitcoin Sell Going On,大规模 Bittcoin 卖 卖 上 上 上 上 上 上 上 上 上 上 上 上 上 上
8,Why Bitcoin will rise on Monday 23rd by more t...,"为何比特币 会在 23 日星期一 上升 超过 10% 的 比例 , 超过 10% 的 比例 ..."
9,"Why Bitcoin is falling, and will rise again","为何比特币 跌了 , 还会再升 , 何必 跌 , 何必 , 何 跌 , 何 跌 ,"


## Translate to Hindi

In [4]:
translate_pipe = nlu.load('en.translate_to.hi')
translate_pipe.predict(df)

translate_en_hi download started this may take some time.
Approx size to download 275.1 MB
[OK!]


Unnamed: 0,sentence_dl,translated_translation
0,Bitcoin Price Update: Will China Lead us Down?,बिटटोन कीमत अद्यतन: क्या चीन हमें नीचे ले जाएगा?
1,Key Bitcoin Price Levels for Week 51 (15 – 22 ...,सप्ताह 51 (15 - 22 डेक) के लिए कुंजी बिटस्लेट्...
2,"National Australia Bank, Citing Highly Flawed ...","नैशनल ऑस्ट्रेलिया बैंक, उच्च रूप सेित डाटा का ..."
3,Chinese Bitcoin Ban Driven by Chinese Banking...,चीनी बिटपरन बैंगिंग संकट से ड्राइव?
4,Bitcoin Trade Update: Opened Position,बिटफिक्स अद्यतन:
5,Key Bitcoin Price Levels for Week 52 (22 – 28 ...,मैं इस बात को समझ नहीं पाया कि मैं क्या करूँ ।
6,Bitcoin Survival,मेक्सेन सुरक्षा
7,Massive Bitcoin Sell Going On,भारी धातु की लत पर बिकना
8,Why Bitcoin will rise on Monday 23rd by more t...,क्यों बिटकोन सोमवार 23 बजे से 10% तक बढ़ जाएगा
9,"Why Bitcoin is falling, and will rise again","अज़ाबेरी गिरता ही क्यों है, और फिर उठ खड़ा होगा"


# Train a Multi Lingual Classifier for 100+ languages from a dataset with just one language

[Leverage Language-agnostic BERT Sentence Embedding (LABSE)​ and acheive state of the art!](https://arxiv.org/abs/2007.01852) ​  ​  

Training a classifier with LABSE embeddings enables the knowledge to be transferred to 109 languages!
With the [SentimentDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#sentimentdl-multi-class-sentiment-analysis-annotator)  from Spark NLP you can achieve State Of the Art results on any binary class text classification problem.

### Languages suppoted by LABSE
![labse languages](http://ckl-it.de/wp-content/uploads/2021/02/LABSE.png)



**Make sure to restart your notebook again** before starting the next section

In [None]:
print("Please restart kernel if you are in google colab to clear up RAM")
1 + 'Please Restart'

In [1]:
import nlu
# Download French twitter  Sentiment dataset  https://www.kaggle.com/hbaflast/french-twitter-sentiment-analysis
! wget http://ckl-it.de/wp-content/uploads/2021/02/french_tweets.csv

import pandas as pd

train_path = '/content/french_tweets.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
columns=['text','y']
train_df = train_df[columns]
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df

--2023-07-06 06:04:01--  http://ckl-it.de/wp-content/uploads/2021/02/french_tweets.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10237264 (9.8M) [text/csv]
Saving to: ‘french_tweets.csv’


2023-07-06 06:04:05 (3.67 MB/s) - ‘french_tweets.csv’ saved [10237264/10237264]



Unnamed: 0,text,y
0,Prend la veille d'alix Ã lga pour son vol Ã ...,negative
1,A l'air bien personnel et pourtant professionn...,positive
2,Retour Ã l'avenir Ã©tait la bÃ»che de ma gÃ©n...,negative
3,"Hmm, l'Ã©paule fait de mauvais bruits si je re...",negative
4,Peut se rapporter. N'est pas nÃ© artsy soit & ...,positive
...,...,...
99995,Pas sÃ»r que hannity pourrait nuire ... pensez...,positive
99996,Si cela ne se passe pas sur ses propres points...,positive
99997,Je ne suis certainement pas fan de la pluie. A...,negative
99998,Il n'y a rien comme passer un temps de qualitÃ...,positive


## Train Deep Learning Classifier using `nlu.load('train.sentiment')`

Al you need is a Pandas Dataframe with a label column named `y` and the column with text data should be named `text`

We are training on a french dataset and can then predict classes correct **in 100+ langauges**

In [2]:
from sklearn.metrics import classification_report
# Train longer!
trainable_pipe = nlu.load('xx.embed_sentence.labse train.sentiment')
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(60)
trainable_pipe['trainable_sentiment_dl'].setLr(0.005)
fitted_pipe = trainable_pipe.fit(train_df.iloc[:2000])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:2000],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)


labse download started this may take some time.
Approximate size to download 1.7 GB
[OK!]


In [3]:
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.87      0.83      0.85       924
    positive       0.86      0.89      0.88      1076

    accuracy                           0.86      2000
   macro avg       0.86      0.86      0.86      2000
weighted avg       0.86      0.86      0.86      2000



Unnamed: 0,document,sentence_embedding_labse,sentiment,sentiment_confidence,text,y
0,Prend la veille d'alix Ã lga pour son vol Ã ...,"[-0.054983869194984436, -0.01708587259054184, ...",negative,3.0,Prend la veille d'alix Ã lga pour son vol Ã ...,negative
1,A l'air bien personnel et pourtant professionn...,"[0.030396170914173126, -0.017040973529219627, ...",positive,4.0,A l'air bien personnel et pourtant professionn...,positive
2,Retour Ã l'avenir Ã©tait la bÃ»che de ma gÃ©n...,"[-0.00018120733147952706, -0.02415195666253566...",positive,1.0,Retour Ã l'avenir Ã©tait la bÃ»che de ma gÃ©n...,negative
3,"Hmm, l'Ã©paule fait de mauvais bruits si je re...","[-0.044084448367357254, -0.04020474851131439, ...",negative,4.0,"Hmm, l'Ã©paule fait de mauvais bruits si je re...",negative
4,Peut se rapporter. N'est pas nÃ© artsy soit & ...,"[0.02326550893485546, -0.03429331257939339, -0...",positive,0.0,Peut se rapporter. N'est pas nÃ© artsy soit & ...,positive
...,...,...,...,...,...,...
1995,J'avais ce problÃ¨me plus tÃ´t. Myspace stupide.,"[0.03161639720201492, 0.002442318247631192, -0...",negative,2.0,J'avais ce problÃ¨me plus tÃ´t. Myspace stupide.,negative
1996,Regarder les britains a eu du talent Simon a d...,"[0.007778815925121307, -0.00585961202159524, -...",positive,3.0,Regarder les britains a eu du talent Simon a d...,positive
1997,Bienvenue Ã,"[0.03602616861462593, -0.044472962617874146, 0...",positive,2.0,Bienvenue Ã,positive
1998,Le soleil est sorti,"[-0.052676863968372345, -0.05059286952018738, ...",positive,1.0,Le soleil est sorti,positive


### Test the fitted pipe on new example

#### The Model understands Englsih
![en](https://www.worldometers.info/img/flags/small/tn_nz-flag.gif)

In [4]:
fitted_pipe.predict("This was awful!")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,This was awful!,"[0.03472338244318962, -0.06212151423096657, -0...",negative,0.996757


In [5]:
fitted_pipe.predict("This was great!")

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,This was great!,"[0.03868241235613823, -0.05859256163239479, -0...",positive,1.0


#### The Model understands German
![de](https://www.worldometers.info/img/flags/small/tn_gm-flag.gif)

In [6]:
# German for:' this movie was great!'
fitted_pipe.predict("Der Film war echt klasse!")

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,Der Film war echt klasse!,"[-0.01518925279378891, -0.048348307609558105, ...",positive,1.0


In [7]:
# German for: 'This movie was really boring'
fitted_pipe.predict("Der Film war echt langweilig!")

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,Der Film war echt langweilig!,"[-0.01357333268970251, -0.051444556564092636, ...",positive,0.717493


#### The Model understands Chinese
![zh](https://www.worldometers.info/img/flags/small/tn_ch-flag.gif)

In [8]:
# Chinese for: "This model was awful!"
fitted_pipe.predict("这部电影太糟糕了！")

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,这部电影太糟糕了！,"[-0.05075531452894211, -0.06159878522157669, -...",negative,0.999994


In [9]:
# Chine for : "This move was great!"
fitted_pipe.predict("此举很棒！")


Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,此举很棒！,"[0.026034913957118988, -0.065867118537426, -0....",positive,1.0


#### The model understands Vietnamese
![vi](https://www.worldometers.info/img/flags/small/tn_vm-flag.gif)

In [10]:
# Vietnamese for : 'The movie was painful to watch'
fitted_pipe.predict('Phim đau điếng người xem')


Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,Phim đau điếng người xem,"[-0.054146409034729004, 0.04168795421719551, -...",negative,1.0


In [11]:

# Vietnamese for : 'I am really happy today.'
fitted_pipe.predict('Tôi thực sự hạnh phúc ngày hôm nay.')

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,Tôi thực sự hạnh phúc ngày hôm nay.,"[0.0002011935575865209, -0.05289962515234947, ...",negative,0.994323


#### The model understands Japanese
![ja](https://www.worldometers.info/img/flags/small/tn_ja-flag.gif)


In [12]:

# Japanese for : 'This is now my favorite movie!'
fitted_pipe.predict('これが私のお気に入りの映画です！')

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,これが私のお気に入りの映画です！,"[-0.006344369612634182, -0.03161679580807686, ...",positive,1.0


In [13]:

# Japanese for : 'I would rather kill myself than watch that movie again'
fitted_pipe.predict('その映画をもう一度見るよりも自殺したい')

Unnamed: 0,sentence,sentence_embedding_labse,sentiment,sentiment_confidence
0,その映画をもう一度見るよりも自殺したい,"[-0.04823153465986252, -0.036920782178640366, ...",negative,0.999997


# There are many more models you can put to use in 1 line of code!
## Checkout [the Modelshub](https://nlp.johnsnowlabs.com/models) and the [NLU Namespace](https://nlu.johnsnowlabs.com/docs/en/spellbook) for more models

### NLU Webinars and Video Tutorials
- [NLU & Streamlit Tutorial](https://vimeo.com/579508034#)
- [Crash course of the 50 + Medical Domains and the 200+ Healtchare models in NLU](https://www.youtube.com/watch?v=gGDsZXt1SF8)
- [Multi Lingual NLU Webinar - Tutorial on Chinese News dataset](https://www.youtube.com/watch?v=ftAOqJuxnV4)
- [John Snow Labs NLU: Become a Data Science Superhero with One Line of Python code](https://events.johnsnowlabs.com/john-snow-labs-nlu-become-a-data-science-superhero-with-one-line-of-python-code?hsCtaTracking=c659363c-2188-4c86-945f-5cfb7b42fcfc%7C8b2b188b-92a3-48ba-ad7e-073b384425b0)
- [Python Web Def Conf - Python's NLU library: 1,000+ Models, 200+ Languages, State of the Art Accuracy, 1 Line of Code](https://2021.pythonwebconf.com/presentations/john-snow-labs-nlu-the-simplicity-of-python-the-power-of-spark-nlp)
- [NYC/DC NLP Meetup with NLU](https://youtu.be/hJR9m3NYnwk?t=2155)

### More ressources
- [Join our Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA)
- [NLU Website](https://nlu.johnsnowlabs.com/)
- [NLU Github](https://github.com/JohnSnowLabs/nlu)
- [Many more NLU example tutorials](https://github.com/JohnSnowLabs/nlu/tree/master/examples)
- [Overview of every powerful nlu 1-liner](https://nlu.johnsnowlabs.com/docs/en/examples)
- [Checkout the Modelshub for an overview of all models](https://nlp.johnsnowlabs.com/models)
- [Checkout the NLU Namespace where you can find every model as a tabel](https://nlu.johnsnowlabs.com/docs/en/spellbook)
- [Intro to NLU article](https://medium.com/spark-nlp/1-line-of-code-350-nlp-models-with-john-snow-labs-nlu-in-python-2f1c55bba619)
- [Indepth and easy Sentence Similarity Tutorial, with StackOverflow Questions using BERTology embeddings](https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf)
- [1 line of Python code for BERT, ALBERT, ELMO, ELECTRA, XLNET, GLOVE, Part of Speech with NLU and t-SNE](https://medium.com/spark-nlp/1-line-of-code-for-bert-albert-elmo-electra-xlnet-glove-part-of-speech-with-nlu-and-t-sne-9ebcd5379cd)