![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/language-detection/Language_Detection_and_Indentification.ipynb)

# Language Detection and Identification

## 0. Colab

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


## 1. Start Spark Session

In [None]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())

print("Apache Spark version:", spark.version)


Spark NLP version 4.3.1
Apache Spark version: 3.3.0


## LanguageDetectorDL Pre-trained Models & Pipelines

* Available pre-trained pipelines: https://sparknlp.org/models?tag=language_detection


| Model                        | Name               | Build            | Lang 
|:-----------------------------|:-------------------|:-----------------|:------
| LanguageDetectorDL    | `detect_language_21`        | 2.7.0 |      `xx`         | 
| LanguageDetectorDL    | `detect_language_43`        | 2.7.0 |      `xx`         | 
| LanguageDetectorDL    | `detect_language_95`        | 2.7.0 |      `xx`         | 
| LanguageDetectorDL    | `detect_language_99`        | 2.7.0 |      `xx`         | 
| LanguageDetectorDL    | `detect_language_220`        | 2.7.0 |      `xx`         | 
| LanguageDetectorDL    | `detect_language_231`        | 2.7.0 |      `xx`         | 
| LanguageDetectorDL    | `detect_language_375`        | 2.7.0 |      `xx`         | 

# LanguageDetectorDL
## Pre-trained Pipelines

In [None]:
from sparknlp.pretrained import PretrainedPipeline

In [None]:
# Download a pre-trained pipeline by name and language
language_detector_pipeline = PretrainedPipeline('detect_language_21', lang='xx')

# Depending on the language (how similar the characters are), the LanguageDetectorDL works
# best with text longer than 140 characters
language_detector_pipeline.annotate("«Нападение на 13-й участок»")


detect_language_21 download started this may take some time.
Approx size to download 7.7 MB
[OK!]


{'document': ['«Нападение на 13-й участок»'],
 'sentence': ['«Нападение на 13-й участок»'],
 'language': ['bg']}

# LanguageDetectorDL
## Pre-trained Models

In [None]:
from sparknlp.base import *
from sparknlp.annotator import *

In [None]:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_21")\
.setInputCols(["document"])\
.setOutputCol("lang")\
.setThreshold(0.8)\
.setCoalesceSentences(True)

languagePipeline = Pipeline(stages=[
 documentAssembler,
 language_detector
])

ld_wiki_tatoeba_cnn_21 download started this may take some time.
Approximate size to download 7.1 MB
[OK!]


In [None]:
test_df = spark.createDataFrame([
  ['Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.'], 
  ['Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.']]
).toDF("text")

results = languagePipeline.fit(test_df).transform(test_df)

In [None]:
results.select("lang.result").show()

+------+
|result|
+------+
|  [en]|
|  [fr]|
+------+



In [None]:
# probabilities for other languages
results.select("lang.metadata").show(2, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------