# LT2222 Machine learning for statistical NLP: introduction
# Assignment 1: Language Identification

**Language Identification** (also known as LangID, LID, LI, or Language Detection) is an NLP task the goal of which is to correctly identify the language of a word or a passage. It is a type of **classification task.** It is a task that can be very useful in a variety of applications and situations, including e.g. returning relevant results in the same language in Information Retrieval tasks. It is especially important when the possible user base is multilingual; in some cases, that can even apply to e.g. governmental websites or applications - think of those countries that have more than one official language. For a comprehensive survey of Language Identification, see [Jauhiainen et al. (2019) ](https://www.proquest.com/scholarly-journals/automatic-language-identification-texts-survey/docview/2554056804/se-2?accountid=11162). As the authors write, simpler ML methods, such as **Support Vector Machines (SVMs),** can achieve very good performance in this task In fact, many of the best submissions for various LI shared tasks have been SVM-based.

While Language Identification is a term that encompasses all the possible modalities (e.g. speech or sign language), in this assignment, the focus will be on the LI of **textual data.** The general task for this assignment is to import the [CoLI-Kenglish dataset](https://sites.google.com/view/kanglishicon2022/dataset?authuser=0), a dataset of containing predominantly tokens in English and Kannada (one of the languages spoken in India), inspect its structure, select the features for the model to take into account, train, and evaluate an SVM model.

In this assignment, you will be provided with some pre-existing code and instructions for the missing parts. The assignment should therefore be completed in your copy of this notebook. It is possible to score 25 points in this assignment, with an additional 6 extra points.



### Part 1: Importing the dataset (5 points)




The first step for this assignment is downloading the [CoLI-Kenglish dataset's](https://sites.google.com/view/kanglishicon2022/dataset?authuser=0) train set and test set with labels. The *wget* commands below will download those two .csv files into your working directory as *kanglish-train.csv* and *kanglish-test.csv*.

In [2]:
#!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=15I5-evuUKgXjVfR1kFPWnhtfMXAjwVir' -O kanglish-train.csv
#!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ajTuulVO6uWH6izCLOOuefgI_GUjz0UH' -O kanglish-test.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.
'id' is not recognized as an internal or external command,
operable program or batch file.


'wget' is not recognized as an internal or external command,
operable program or batch file.
'id' is not recognized as an internal or external command,
operable program or batch file.


Next, we need to import the libraries that are relevant for this assignment. Feel free to add more to this list if you discover that you need to use a different library.

In [23]:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Now that we both have our dataset files and the necessary libraries, it is time to import and inspect the data in the notebook.

One easy way to import a .csv file in Python is using the pandas library. This will result in our data now being stored in a DataFrame object. These are very handy for storing and manipulating the data.

**YOUR TASK:**


*   Import the test and train files using [pandas' *read_csv* function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
*   Display the first 10 lines of the training set using the [*.head()* method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
*   Read about [indexing DataFrames](https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/) and use it to select the tag column, and then return the unique tags in that column using the [*.unique()* method](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html). Store that under a new variable name




In [24]:
# read in the files as DataFrames
kanglish_train =  pd.read_csv("Train_Kanglish.csv")
kanglish_test =  pd.read_csv("Test_withLabels_Kanglish.csv")

In [25]:
# display the first 10 lines of the training set
print(kanglish_train.head(11))

         word tag
0     anusthu  kn
1        woww  en
2     staying  en
3        near  en
4      hostel  en
5   confirmed  en
6       faith  en
7      linked  en
8      gotila  kn
9     germany  en
10     irodhu  kn


In [26]:
labels =  kanglish_train.loc[:,"tag"].unique() #assigning unique labels (tags) to labels-variable

In [27]:
# display the labels
print(labels)

['kn' 'en' 'name' 'location' 'en-kn' 'other']


### Part 2: Feature selection (10 points)

Now that we have the data imported and we know what it looks like, it is time for us to select the features that our machine learning model should be looking at. Character-based features, such as co-occurring characters, character repetitions, or sequence length are known to be informative for this task.

**YOUR TASK:**


*   Create a function that takes a word and returns a dictionary containing the following: word length (number of characters) and the last 2 letters of the word (e.g. for the word "tag" this dictionary should look somewhat like this: *{'len': 3, 'suffix': 'ag'}*, with the key names being up to you)
*   Iterate over the *word* column of the train set and test set to create two separate lists of features representing these words
*   Use sklearn's [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) to turn the feature dictionaries into a machine learning model-readable version: with just numbers. The output should be stored as *X_train* and *X_test*. **Important!** Note that the training and test data have to be encoded the same way. Pay attention to the *fit*, *transform*, and *fit_transform* methods that the DictVectorizer has in order to first fit it to your training data, then transform the training data, and transform the test data using the same vectorizer
*   Store the *tag* column of the train and test sets in as a *y_train* and *y_test*.



In [28]:
#returns a dictionary containing 1) word length 2) the last two letters of the word

def encode_features(word):
  word_dict = {}
  word_dict['len'] = len(word)
  word_dict['suffix'] = word[-2:]
  
  return word_dict


In [29]:
# encode all of the words in training and test sets, creating two lists of dictionaries

training_set_list_of_dicts = []
for word in kanglish_train.loc[:,"word"]:
    training_set_list_of_dicts.append(encode_features(word))

testing_set_list_of_dicts = []
for word in kanglish_test.loc[:,"word"]:
    testing_set_list_of_dicts.append(encode_features(word))

print(training_set_list_of_dicts)
print(testing_set_list_of_dicts)

[{'len': 7, 'suffix': 'hu'}, {'len': 4, 'suffix': 'ww'}, {'len': 7, 'suffix': 'ng'}, {'len': 4, 'suffix': 'ar'}, {'len': 6, 'suffix': 'el'}, {'len': 9, 'suffix': 'ed'}, {'len': 5, 'suffix': 'th'}, {'len': 6, 'suffix': 'ed'}, {'len': 6, 'suffix': 'la'}, {'len': 7, 'suffix': 'ny'}, {'len': 6, 'suffix': 'hu'}, {'len': 7, 'suffix': 'dh'}, {'len': 6, 'suffix': 'de'}, {'len': 7, 'suffix': 're'}, {'len': 10, 'suffix': 'on'}, {'len': 6, 'suffix': 'ne'}, {'len': 7, 'suffix': 'de'}, {'len': 7, 'suffix': 'de'}, {'len': 13, 'suffix': 'ke'}, {'len': 8, 'suffix': 'te'}, {'len': 10, 'suffix': 'hu'}, {'len': 8, 'suffix': 'de'}, {'len': 8, 'suffix': 'de'}, {'len': 5, 'suffix': 'al'}, {'len': 7, 'suffix': 'ed'}, {'len': 5, 'suffix': 'si'}, {'len': 5, 'suffix': 'ri'}, {'len': 7, 'suffix': 'la'}, {'len': 11, 'suffix': 'du'}, {'len': 4, 'suffix': 'om'}, {'len': 4, 'suffix': 'en'}, {'len': 8, 'suffix': 'er'}, {'len': 5, 'suffix': 'al'}, {'len': 5, 'suffix': 'an'}, {'len': 7, 'suffix': 'sh'}, {'len': 6, 'suf

In [30]:
# instantiate a vectorizer
vectorizer = DictVectorizer()

In [31]:
# attune the vectorizer to your data

#?

In [32]:
# use the vectorizer on the training data and the test data
X_train =  vectorizer.fit_transform(training_set_list_of_dicts)
X_test = vectorizer.transform(testing_set_list_of_dicts)

In [33]:
# extract the 'tag' column (classes)
y_train = kanglish_train.loc[:,"tag"]
y_test = kanglish_test.loc[:,"tag"]

print(y_train)
print(y_test)

0           kn
1           en
2           en
3           en
4           en
         ...  
14842    en-kn
14843    en-kn
14844    en-kn
14845    en-kn
14846    en-kn
Name: tag, Length: 14847, dtype: object
0             kn
1             kn
2             kn
3          en-kn
4           name
          ...   
4580          kn
4581          kn
4582          kn
4583    location
4584          kn
Name: tag, Length: 4585, dtype: object


### Part 3: Training the model (4 points)


We now have our train and test sets encoded in a machine learning-friendly format, with our features (X) and our classes (y) separated. It is high time we train a machine learning model.

**YOUR TASK**:


*   Instantiate a [LinearSVC model](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)
*   Fit the model on *X_train* and *y_train*.



In [34]:
#instantiating a LinearSVC model

model = LinearSVC()

In [35]:
# fit the model on your data

model.fit(X_train, y_train)




### Part 4: Evaluating the model (6 points)


We have successfully trained a model - but now what? In order to know how successful it is, we should evaluate it using some measures.

**YOUR TASK:**


*   Use your model to predict the classes for *X_test*
*   Use [sklearn's evaluation measure functions](https://scikit-learn.org/0.15/modules/model_evaluation.html) to calculate the following measures: accuracy, and per-class precision, recall, and F1 for the predictions in comparison with the ground truth (*y_test*). Note that you will have to specify some parameters in order to get the per-class measures; print them in a way that makes it clear which score refers to which class.
*   Discuss the results. Do you think the model is performing well? What classes is the model having problems with?



In [36]:
y_pred = model.predict(X_test)
print(y_pred)

['kn' 'kn' 'kn' ... 'kn' 'kn' 'en']


In [38]:
# print out the evaluation using various sklearn functions

accuracy = accuracy_score(y_test, y_pred)
print(f"The accuracy is {accuracy}.")

print("* * * * *")

for label, score in zip(labels, precision_score(y_test, y_pred, average=None)):
    print(f"The precision for '{label}' is {score}") #using average=None, the scores for each class are returned

print("* * * * *")

for label, score in zip(labels, recall_score(y_test, y_pred, average=None)):
    print(f"The recall score for '{label}' is {score}") #using average=None, the scores for each class are returned

print("* * * * *")

for label, score in zip(labels, f1_score(y_test, y_pred, average=None)):
    print(f"The f1-score for '{label}' is {score}") #using average=None, the scores for each class are returned


The accuracy is 0.7463467829880044.
* * * * *
The precision for 'kn' is 0.8315665488810365
The precision for 'en' is 0.16216216216216217
The precision for 'name' is 0.7347091932457787
The precision for 'location' is 0.0
The precision for 'en-kn' is 0.5263157894736842
The precision for 'other' is 0.1111111111111111
* * * * *
The recall score for 'kn' is 0.7788196359624932
The recall score for 'en' is 0.12903225806451613
The recall score for 'name' is 0.8924339106654512
The recall score for 'location' is 0.0
The recall score for 'en-kn' is 0.0847457627118644
The recall score for 'other' is 0.1
* * * * *
The f1-score for 'kn' is 0.8043292509256622
The f1-score for 'en' is 0.1437125748502994
The f1-score for 'name' is 0.80592714550319
The f1-score for 'location' is 0.0
The f1-score for 'en-kn' is 0.145985401459854
The f1-score for 'other' is 0.10526315789473685


**DISCUSS** the model performance.

In [None]:
# The accuracy of the model is OK, I would say. It does a good job with labeling Kannada words correctly, but for some reason fails quite miserably with English words. 
# The inequality of the precision and recall scores between the classes is definitely not ideal and probably indicates that looking into the feature extraction / choosing another model might be a good idea.
# When it comes to the other classes, scores do not look so great either, as 'location'-class has an f1-score of 0, and 'other'-class follows with an f1-score of 0.1. 

### Extra part 1: Feature selection (3 points)

The features we have selected in part 2 do not need to be the best out there - so let us expand on the feature selection.
**YOUR TASK:**


*   Pick one more feature we could use and justify your choice
*    Expand upon the code from part 2 to include that feature
*   Train and evaluate the model as above
*    Discuss whether the model's performance has improved

In [39]:
#Explanation and discussion below this code

def encode_features2(word):
  word_dict2 = {}
  word_dict2['len'] = len(word)
  word_dict2['suffix'] = word[-2:]

  vowels = ['a', 'i', 'e', 'o', 'u']   #added new feature (number of vowels)
  vowel_counter = 0   
  for letter in word:
    if letter in vowels:
      vowel_counter += 1
  word_dict2['vowelcount'] = vowel_counter
  
  return word_dict2

training_set_list_of_dicts2 = []
for word in kanglish_train.loc[:,"word"]:
    training_set_list_of_dicts2.append(encode_features(word))

testing_set_list_of_dicts2 = []
for word in kanglish_test.loc[:,"word"]:
    testing_set_list_of_dicts2.append(encode_features(word))

X_train2 =  vectorizer.fit_transform(training_set_list_of_dicts2)
X_test2 = vectorizer.transform(testing_set_list_of_dicts2)
y_train2 = kanglish_train.loc[:,"tag"]
y_test2 = kanglish_test.loc[:,"tag"]
model.fit(X_train2, y_train2)
y_pred2 = model.predict(X_test2)

accuracy2 = accuracy_score(y_test2, y_pred2)
print(f"The accuracy is {accuracy2}.")

print("* * * * *")

for label, score in zip(labels, precision_score(y_test2, y_pred2, average=None)):
    print(f"The precision for '{label}' is {score}") #using average=None, the scores for each class are returned

print("* * * * *")

for label, score in zip(labels, recall_score(y_test2, y_pred2, average=None)):
    print(f"The recall score for '{label}' is {score}") #using average=None, the scores for each class are returned

print("* * * * *")

for label, score in zip(labels, f1_score(y_test2, y_pred2, average=None)):
    print(f"The f1-score for '{label}' is {score}") #using average=None, the scores for each class are returned




The accuracy is 0.7485278080697928.
* * * * *
The precision for 'kn' is 0.8310771041789288
The precision for 'en' is 0.16363636363636364
The precision for 'name' is 0.7343517138599106
The precision for 'location' is 0.0
The precision for 'en-kn' is 0.5263157894736842
The precision for 'other' is 0.11235955056179775
* * * * *
The recall score for 'kn' is 0.7788196359624932
The recall score for 'en' is 0.0967741935483871
The recall score for 'name' is 0.898359161349134
The recall score for 'location' is 0.0
The recall score for 'en-kn' is 0.0847457627118644
The recall score for 'other' is 0.1
* * * * *
The f1-score for 'kn' is 0.8041002277904329
The f1-score for 'en' is 0.12162162162162163
The f1-score for 'name' is 0.8081180811808116
The f1-score for 'location' is 0.0
The f1-score for 'en-kn' is 0.145985401459854
The f1-score for 'other' is 0.10582010582010581




**DISCUSS** the model performance

In [None]:
#Tried adding the number of vowels in a word as a new feature. I based this purely on a non-scientific observation of the Kannada words having plenty of vowels. 
#This had no effect on the performance of the model, as accuracy was the same (around 0.748, having improved only around 0.001), proving that this feature is not optimal to improve the performance of the model.
#Other scores did not improve significantly either. 

#However, I think a better feature could have been 'consonant clusters' since Kannada has a lot of ones that are not possible in the English language (such as 'bh', 'ddh', 'dk'). 
#I also observed that some (not many, though) of the Kannada words in the training data had been written in Kannada script (like ಜಯನಗರದ, ಬಹಳ), which could have also been a feature to consider(?) 

### Extra part 2: Excluding the non-language classes (3 points)

As you may have noted in part 1, the dataset contains some tags that represent languages (kn, en, en-kn) and some that correspond to Named Entity types and miscellaneous tokens (name, location, other). Since our task is to detect the language of a token, and the ground truth is not provided for the latter three classes in the same way as it is for the first three, let us try to exclude them.

**YOUR TASK:**


*   Use [Boolean indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing) to filter out the words with tags other than kn, en, and en-kn
*   Proceed with encoding the features and training as in the main part of the assignment
*   Evaluate the model using whichever measures you deem relevant. Discuss your choice and whether the model's performance has improved



In [41]:
#Using boolean masking and the ~ operator to exclude 'name', 'location' and 'other'
#Wondering if there would be an easier way to avoid doing everything separately?

mask_f1 = kanglish_train['tag'] == 'name'
mask_f2 = kanglish_train['tag'] == 'location'
mask_f3 = kanglish_train['tag'] == 'other'
mask_f4 = kanglish_test['tag'] == 'name'
mask_f5 = kanglish_test['tag'] == 'location'
mask_f6 = kanglish_test['tag'] == 'other'

kanglish_train3 = kanglish_train[~mask_f1]
kanglish_train3 = kanglish_train3[~mask_f2]
kanglish_train3 = kanglish_train3[~mask_f3]
kanglish_test3 = kanglish_test[~mask_f4]
kanglish_test3 = kanglish_test3[~mask_f5]
kanglish_test3 = kanglish_test3[~mask_f6]

labels3 =  kanglish_train3.loc[:,"tag"].unique()
print(labels3)

training_set_list_of_dicts3 = []
for word in kanglish_train3.loc[:,"word"]:
    training_set_list_of_dicts3.append(encode_features2(word))

testing_set_list_of_dicts3 = []
for word in kanglish_test3.loc[:,"word"]:
    testing_set_list_of_dicts3.append(encode_features2(word))

X_train3 =  vectorizer.fit_transform(training_set_list_of_dicts3)
X_test3 = vectorizer.transform(testing_set_list_of_dicts3)
y_train3 = kanglish_train3.loc[:,"tag"]
y_test3 = kanglish_test3.loc[:,"tag"]
model.fit(X_train3, y_train3)
y_pred3 = model.predict(X_test3)

accuracy3 = accuracy_score(y_test3, y_pred3)
print(f"The accuracy is {accuracy3}.")

print("* * * * *")

for label, score in zip(labels3, precision_score(y_test3, y_pred3, average=None)):
    print(f"The precision for '{label}' is {score}") #using average=None, the scores for each class are returned

print("* * * * *")

for label, score in zip(labels3, recall_score(y_test3, y_pred3, average=None)):
    print(f"The recall score for '{label}' is {score}") #using average=None, the scores for each class are returned

print("* * * * *")

for label, score in zip(labels3, f1_score(y_test3, y_pred3, average=None)):
    print(f"The f1-score for '{label}' is {score}") #using average=None, the scores for each class are returned



  kanglish_train3 = kanglish_train3[~mask_f2]
  kanglish_train3 = kanglish_train3[~mask_f3]
  kanglish_test3 = kanglish_test3[~mask_f5]
  kanglish_test3 = kanglish_test3[~mask_f6]


['kn' 'en' 'en-kn']
The accuracy is 0.8397560975609756.
* * * * *
The precision for 'kn' is 0.898876404494382
The precision for 'en' is 0.1791044776119403
The precision for 'en-kn' is 0.8190045248868778
* * * * *
The recall score for 'kn' is 0.7942636514065086
The recall score for 'en' is 0.12903225806451613
The recall score for 'en-kn' is 0.9074749316317229
* * * * *
The f1-score for 'kn' is 0.843338213762811
The f1-score for 'en' is 0.15
The f1-score for 'en-kn' is 0.860972972972973




**DISCUSS** the model performance and your choice of measures

In [None]:
#Looks like filtering out the non-language classes improved our model's accuracy nicely! It went up by 10%, to 0.84. 
#However, it still looks like the precision and recall scores for 'en' class are very low compared to those of the 'kn' class. 
#Considering other models and/or spending more time on a detailed feature extraction could still improve the performance.