# ``langdetect`` to filter non-Danish/German/Polish tweets

This notebook exemplifies how we used the ``langdetect`` package. We ran this on all three datasets.

In [None]:
import os
import pandas as pd
import csv
import re
from langdetect import detect
from langdetect import LangDetectException
from langdetect import DetectorFactory

In [None]:
# set working directory
os.chdir(r'C:\Users\maril\Documents\20-21 KU\block 4\DM\twitter')

In [None]:
# load data
df = pd.read_csv(r'sanity_check\de_sanity_check.csv', dtype = {'id': str})
df.head()

We use the ``langdetect`` package that detects languages based on the text it is given. It doesn't work perfectly on tweets because they are so short and people use more informal language, so we really need to go through and double check the results manually.

In [None]:
# detecting and checking all tweets that seem to be foreign language

# set the seed
DetectorFactory.seed = 0

# we will store all words that are classified as non-German in the 'nogerman' list
nogerman = []

# sometimes there are no meaningful words to predict a language from and this raises an LangDetectException error
# we want to catch all of these exceptions and check them manually
exceptions = []

# iterate through the text in the dataframe
for i in range(len(df)):
    try:
        # if the detected language is not German, add the tweet id to the 'nogerman' list
        
        # change to 'da' for Danish and 'pl' for Polish
        if detect(df.iloc[i]['text']) != 'de':
            nogerman.append(df.iloc[i]['id'])
    
    # if an exception is raised, add the tweet id to the 'exceptions' list
    except LangDetectException:
        exceptions.append(df.iloc[i]['id'])

In [None]:
# checking the 'exceptions' list
for ids in exceptions:
    t = df.loc[df['id']==ids, 'text']
    print(t.iloc[0])

In [None]:
# checking the 'nogerman' list
for ids in nogerman:
    t = df.loc[df['id']==ids, 'text']
    print(t.iloc[0])