# Identify the most likely language for each tweet

For this analyses I will explore the langid package in more detail in order to come up with the most likely language of a tweet.

Next to the regular language identification function in the langid package, the module also provides a function that returns the likelihood for each language in a text. For this analysis I will further explore the possibilities of this function.

In addition to the langid package, the column that states the country where the tweet was posted will also be used to select the most likely language out of all the possible languages that the function returns.

For this first iteration on the dataset, I will use only the information about the tweets and the country where the tweet was posted. 

#### Import and load the data

In [1]:
import pandas as pd
import numpy as np
import re
import langid

pd.set_option('display.max_colwidth', -1)
pd.set_option("display.max_rows", None)
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

In [2]:
# load data
T = pd.read_csv("all_annotated.tsv", sep = "\t")

In [3]:
print T.shape
T.head()

(10502, 10)


Unnamed: 0,Tweet ID,Country,Date,Tweet,Definitely English,Ambiguous,Definitely Not English,Code-Switched,Ambiguous due to Named Entities,Automatically Generated Tweets
0,434215992731136000,TR,2014-02-14,Bugün bulusmami lazimdiii,0,0,1,0,0,0
1,285903159434563584,TR,2013-01-01,Volkan konak adami tribe sokar yemin ederim :D,0,0,1,0,0,0
2,285948076496142336,NL,2013-01-01,Bed,1,0,0,0,0,0
3,285965965118824448,US,2013-01-01,I felt my first flash of violence at some fool who bumped into me.... I pity the fool.,1,0,0,0,0,0
4,286057979831275520,US,2013-01-01,Ladies drink and get in free till 10:30,1,0,0,0,0,0


### Compare regular language identifier with all possible identified languages

In [4]:
from langid.langid import LanguageIdentifier, model
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)

In [5]:
# adjust how many languages to store with a threshold parameter 
# that stores only the languages with a confidence score that is no larger than x digits behind the comma.  
def language_confidence(text, threshold=6):
    language_conf = []
    for pair in identifier.rank(text):
        if round(pair[1], threshold) > 0.:
            language_conf.append(pair)    
    return language_conf

In [6]:
T['Tweet'][0:5].apply(identifier.classify)

0    (az, 0.856022267327)
1    (ms, 0.865002557468)
2    (en, 0.169461505959)
3    (en, 1.0)           
4    (en, 0.999613490658)
Name: Tweet, dtype: object

In [7]:
T['Tweet'][0:5].apply(lambda x: language_confidence(x, threshold=2))

0    [(az, 0.856022267327), (tr, 0.143677328034)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
1    [(ms, 0.865002557468), (id, 0.0997131200704), (tr, 0.0340997291609)]                                                                                                                                                              

### Remove tags and urls

In [8]:
# remove urls starting with http 
http = r'http\S+'
T['Tweet_stripped'] = T['Tweet'].str.replace(http, ' ' )

# remove tags
T['Tweet_stripped'] = T['Tweet_stripped'].apply(lambda x: ' '.join(word for word in x.split(' ')\
                                                                   if not word.startswith('@')))

### Create column for identified languages and confidence scores

In [9]:
T['language_conf'] = T['Tweet_stripped'].apply(lambda x: language_confidence(x, threshold=2)) 

### Find the most likely language

In [10]:
# convert country ISO codes to lower case
T['Country'] = T['Country'].str.lower()

In [11]:
# write function that returns the identified language only
def languages(text, threshold=6):
    language_conf = []
    for pair in identifier.rank(text):
        if round(pair[1], threshold) > 0.:
            language_conf.append(pair)    
    return [row[0] for row in language_conf]

T['language_only'] = T['Tweet_stripped'].apply(lambda x: languages(x, threshold=2))

In [12]:
# write function that returns the most likely language(s)
def tweet_language(row):
    if row.Country in row.language_only:
        return row.Country
    elif row.Country == 'us|gb|ie' and 'en' in row.language_only:
        return 'en'
    elif row.Country == 'mx' and 'es' in row.language_only:
        return 'es'
    elif row.Country == 'br' and 'pt' in row.language_only:
        return 'pt'
    elif row.Country == 'br' and 'es' in row.language_only:
        return 'es'
    else:
        return row.language_only
    
T['language'] = T.apply(tweet_language, axis=1)

In [13]:
T[['Country', 'language', 'Tweet_stripped']]

Unnamed: 0,Country,language,Tweet_stripped
0,tr,tr,Bugün bulusmami lazimdiii
1,tr,tr,Volkan konak adami tribe sokar yemin ederim :D
2,nl,nl,Bed
3,us,[en],I felt my first flash of violence at some fool who bumped into me.... I pity the fool.
4,us,[en],Ladies drink and get in free till 10:30
5,nl,"[id, ms, sl, jv, en, de]",ahhahahahah dm!
6,us,"[en, de, es, fr, ja, zh, it, pt, sv, ru, nl, da, pl, cs, fi, hu, ro, sk, el, bg, lt, et, ar, mt, ko, sl, no, ca, lv, vi, uk, tr, gl]",Fuck
7,gb,"[bs, hr, jv, ms]",Watching #Miranda On bbc1!!! u r HILARIOUS ❤💋
8,rs,"[en, de, es, fr, ja, zh, it, pt, sv, ru, nl, da, pl, cs, fi, hu, ro, sk, el, bg, lt, et, ar, mt, ko, sl, no, ca, lv, vi, uk, tr, gl]",fino
9,us,"[de, cs, sk, en, fr, xh, zh, es, sl, zu]",Shopping! (@ Kohl's)
