![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb)

## 0. Colab Setup

In [19]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


## 1. Start Spark Session

In [2]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())

print("Apache Spark version:", spark.version)


Spark NLP version 2.5.4
Apache Spark version: 2.4.4


## LanguageDetectorDL Pre-trained Pipelines

* Available pre-trained pipelines: https://github.com/JohnSnowLabs/spark-nlp-models#multi-language---pipelines

| Pipeline                 | Name                   | Build  | lang  | Offline   |
:-----------------------|:-------|:-------|:----------|:----------|
| LanguageDetectorDL    | `detect_language_7`        | 2.5.2 |      `xx` |[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_7_xx_2.5.0_2.4_1591875676774.zip) |
| LanguageDetectorDL    | `detect_language_20`        | 2.5.2 |      `xx` |[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_20_xx_2.5.0_2.4_1591875683182.zip) |


* Available pre-trained models: https://github.com/JohnSnowLabs/spark-nlp-models#multi-language

| Model                        | Name               | Build            | Lang | Offline |
|:-----------------------------|:-------------------|:-----------------|:------|:------|
| LanguageDetectorDL    | `ld_wiki_7`        | 2.5.2 |      `xx`         | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_7_xx_2.5.0_2.4_1591875673486.zip) |
| LanguageDetectorDL    | `ld_wiki_20`        | 2.5.2 |      `xx`         | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_20_xx_2.5.0_2.4_1591875680011.zip) |

* The model with 7 languages: Czech, German, English, Spanish, French, Italy, and Slovak
* The model with 20 languages: Bulgarian, Czech, German, Greek, English, Spanish, Finnish, French, Croatian, Hungarian, Italy, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Turkish, and Ukrainian



# LanguageDetectorDL
## Pre-trained Pipelines

In [3]:
from sparknlp.pretrained import PretrainedPipeline

In [4]:
# Download a pre-trained pipeline by name and language
language_detector_pipeline = PretrainedPipeline('detect_language_20', lang='xx')

# Depending on the language (how similar the characters are), the LanguageDetectorDL works
# best with text longer than 140 characters
language_detector_pipeline.annotate("«Нападение на 13-й участок»")


detect_language_20 download started this may take some time.
Approx size to download 3 MB
[OK!]


{'document': ['«Нападение на 13-й участок»'], 'language': ['ru']}

# LanguageDetectorDL
## Pre-trained Models

In [5]:
from sparknlp.base import *
from sparknlp.annotator import *

In [6]:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

language_detector = LanguageDetectorDL.pretrained("ld_wiki_20")\
.setInputCols(["document"])\
.setOutputCol("lang")\
.setThreshold(0.8)\
.setCoalesceSentences(True)

languagePipeline = Pipeline(stages=[
 documentAssembler, 
 language_detector
])

ld_wiki_20 download started this may take some time.
Approximate size to download 3 MB
[OK!]


In [13]:
test_df = spark.createDataFrame([
  ['Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.'], 
  ['Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.']]
).toDF("text")

results = languagePipeline.fit(test_df).transform(test_df)

In [14]:
results.select("lang.result").show()

+------+
|result|
+------+
|  [en]|
|  [fr]|
+------+



In [18]:
# probabilities for other languages
results.select("lang.metadata").show(2, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------