# Use Python to identify the language of a text

Using the library `langdetect` we can easily identify the language of a text with relatively high precision.

If `langdetect` is not installed, we can install it with `!pip install langdetect`.

We specifically import the `detect` function from `langdetect`. We also import `pandas` to handle the dataset.

In [1]:
from langdetect import detect
import pandas as pd

The `detect` function simply takes a text string as its input and returns the detected language.

In [13]:
detect('This text is in English')

'en'

In [14]:
detect('Dette er et andet sprog')

'no'

In order to apply the function to our data, we load the data into a DataFrame using the `read_csv` function from `pandas`.

In [36]:
df = pd.read_csv('data/cookies.csv')

Using the `apply` method, we can *apply* the function to each row of the DataFrame. `axis=1` tells Python to apply the function to the rows instead of the columns.

In [37]:
df['desc_lan'] = df.apply(lambda row: detect(row['description']), axis=1)

We now have a new column in the DataFrame, which we can summarise with `value_counts`.

In [46]:
df.value_counts('desc_lan')

desc_lan
da    554
en    129
af     32
id     29
fr      2
nl      2
ca      1
pl      1
dtype: int64

As expected, most of the descriptions are in Danish or English. However, we get some suspicious values such as Afrikaans, Indonesian and Catalan. If we inspect the texts we find that they are most likely misclassified due to sparse input.

In [39]:
df[df['desc_lan'] == 'af']

Unnamed: 0,name,provider,origin,expiration,duration,type,description,policy_link,policy_text,desc_lan
9,fp_last_login_adjustment_attempt,Jyllands-posten,jyllands-posten,Session,,HTTP,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
13,last_login_event_source,Jyllands-posten,jyllands-posten,Session,,HTTP,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
14,last_login_event_time,Jyllands-posten,jyllands-posten,Session,,HTTP,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
15,last_login_event_type,Jyllands-posten,jyllands-posten,Session,,HTTP,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
29,_c_ [x3],Jyllands-posten,jyllands-posten,Session,,HTTP,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
33,keesing-last-created-storages,Jyllands-posten,jyllands-posten,Persistent,,HTML,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
34,keesing-unique-userid,Jyllands-posten,jyllands-posten,Persistent,,HTML,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
56,ssoid,Jyllands-posten,jyllands-posten,Persistent,,HTML,Afventer,https://www.jyllands-posten.dk/om/cookies/,tirsdag 11. oktober 2022 Log ind Log ud Køb Me...,af
63,instapage-visit-#,g.fastcdn.co,jyllands-posten,Persistent,,HTML,Afventer,,,af
95,__tea_cache_tokens_#,Tiktok,jyllands-posten,Persistent,,HTML,Afventer,https://www.tiktok.com/legal/privacy-policy?la...,TikTokSelect regionU.S.If you live in the Unit...,af


Similarly, we can detect the language of the cookie policy texts. As these are generally longer texts, the `detect` function should be able to perform better. However, some policy texts are missing in the dataset and we have to account for that when we apply the function.

In [9]:
df['policy_lan'] = df.apply(lambda row: detect(row['policy_text']) if isinstance(row['policy_text'], str) else None, axis=1)

If we count the values again, the classifications seem to be more precise.

In [11]:
df.value_counts('policy_lan')

policy_lan
en    408
da    243
pl     25
de      2
fr      1
dtype: int64

Now we can subset our data based on language, which is very useful when we start analysing the text.

For instance, we can create a DataFrame with English policy texts and a DataFrame with Danish policy texts.

In [12]:
en_df = df[df['policy_lan'] == 'en']

da_df = df[df['policy_lan'] == 'da']

We now have two smaller and cleaner datasets which are easier to proceed with if we want to analyse the content of the policies.