# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Receiving objects: 100% (71/71), 11.28 MiB | 15.12 MiB/s, done.
Resolving deltas: 100% (29/29), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [2]:

!ls nlp-course/lm-languages-data-new

en.csv	 es.csv   fr.csv   in.csv   it.csv   nl.csv   pt.csv   test.csv   tests.csv   tl.csv
en.json  es.json  fr.json  in.json  it.json  nl.json  pt.json  test.json  tests.json  tl.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [3]:
import os
import json

def preprocess(directory):
    # Initialize an empty set to store unique characters
    vocabulary = set()

    # Iterate over all files in the specified directory
    for filename in os.listdir(directory):
        # Check if the file is a JSON file
        if filename.endswith('.json'):
            # Construct the full path to the file
            filepath = os.path.join(directory, filename)

            # Open and read the JSON file
            with open(filepath, 'r', encoding='utf-8') as file:
                try:
                    # Load JSON data
                    data = json.load(file)

                    # Assuming the data is either a list or a dictionary that needs to be converted to string
                    # If the structure of JSON is known and different, adjust the logic here accordingly
                    if isinstance(data, dict):
                        data = json.dumps(data)  # Convert dict to string
                    elif isinstance(data, list):
                        data = " ".join(data)  # Convert list to a single string

                    # Add each letter in the JSON file to the set
                    for char in data:
                        vocabulary.add(char)

                except json.JSONDecodeError as e:
                    print(f"Error decoding JSON from file {filename}: {e}")
                except Exception as e:
                    print(f"Error processing file {filename}: {e}")

    # Convert the set to a sorted list
    vocabulary_list = sorted(vocabulary)
    return vocabulary_list

# Example usage
directory = 'nlp-course/lm-languages-data-new'
vocab = preprocess(directory)
print(vocab)


[' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']


**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [4]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)
    import collections
    import json

    # Initialize dictionaries to hold count of n-grams and (n-1)-grams
    ngram_counts = collections.defaultdict(collections.Counter)
    prefix_counts = collections.defaultdict(int)

    # Read the text data from the file
    with open(data_file_path, 'r', encoding='utf-8') as file:
        # Load JSON data
        for line in file:
            # Add padding for the start of the line
            line = (' ' * (n-1)) + line.strip() + ' '  # add trailing space as a stop symbol
            # Generate n-grams and (n-1)-grams
            for i in range(len(line) - n + 1):
                prefix = line[i:i+n-1]
                ngram = line[i:i+n]
                ngram_counts[prefix][ngram[-1]] += 1
                prefix_counts[prefix] += 1

    # Build the model dictionary with probabilities
    model = {}
    for prefix, counts in ngram_counts.items():
        if add_one:
            # If add-one smoothing is enabled
            total = prefix_counts[prefix] + len(vocabulary)
            model[prefix] = {char: (count + 1) / total for char, count in counts.items()}
            # Adding unseen characters from the vocabulary with probability 1/total
            for char in vocabulary:
                if char not in model[prefix]:
                    model[prefix][char] = 1 / total
        else:
            # No smoothing
            total = prefix_counts[prefix]
            model[prefix] = {char: count / total for char, count in counts.items()}

    return model

# Example usage
n = 3  # Trigram model
vocabulary = vocab
data_file_path = 'nlp-course/lm-languages-data-new/fr.json'
add_one = True  # Enable add-one smoothing

# Generate the language model
language_model = lm(n, vocabulary, data_file_path, add_one)

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [5]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
    import math

    # Open the data file
    with open(data_file, 'r', encoding='utf-8') as file:
        log_probability = 0
        total_tokens = 0

        for line in file:
            # Preprocess the line to include start and end padding similar to how the model was trained
            line = (' ' * (n - 1)) + line.strip() + ' '

            # Calculate probabilities for each n-gram in the line
            for i in range(len(line) - n + 1):
                prefix = line[i:i+n-1]
                target = line[i+n-1]

                # Retrieve the probability of the target given the prefix from the model
                if prefix in model and target in model[prefix]:
                    log_probability += math.log(model[prefix][target])
                else:
                    # Handle the case where the prefix+target combination is not in the model
                    # Here we might assume a very small probability since the combination is unseen
                    # This is a simplification and you may want to handle it differently
                    log_probability += math.log(1e-10)  # Using a very small probability

                total_tokens += 1

    # Calculate perplexity
    perplexity = math.exp(-log_probability / total_tokens)
    return perplexity

# Example usage
n = 3  # For a trigram model
model = language_model
data_file = 'nlp-course/lm-languages-data-new/en.json'

# Calculate perplexity
perplexity_score = eval(n, model, data_file)
print(f"Perplexity: {perplexity_score}")


Perplexity: 14.873785457603114


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [6]:
import os
import json
import pandas as pd


def match(n, add_one):
# n - the n-gram to use for creating n-gram models
# add_one - use add_one smoothing or not
    languages = ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
    perplexity_results = {}

    # Create language models for each language
    for lang in languages:
        data_file_path = f'nlp-course/lm-languages-data-new/{lang}.json'
        vocabulary = vocab
        language_model = lm(n, vocabulary, data_file_path, add_one)
        perplexity_results[lang] = []
        # Calculate perplexity for each language model applied to all data files
        for lang2 in languages:
            data_file = f'nlp-course/lm-languages-data-new/{lang2}.json'
            perplexity = eval(n, language_model, data_file)
            perplexity_results[lang].append(perplexity)

    # Organize results into a DataFrame
    df = pd.DataFrame(perplexity_results, index=languages)
    return df

# Example usage
n = 3
add_one = True
perplexity_df = match(n, add_one)
perplexity_df


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,8.686375,17.171329,14.873785,14.833869,16.783752,14.011432,17.844941,14.268463
es,16.293154,8.254378,13.972432,16.648642,12.450591,16.401253,11.411747,14.454439
fr,15.176171,15.052031,8.086993,16.817121,14.831322,14.03626,15.920432,18.18619
in,17.420222,19.161696,16.908573,9.49853,19.485017,16.310158,20.190096,15.37172
it,16.52659,13.845702,15.084137,16.840061,8.713761,17.374312,14.771806,14.688347
nl,15.867946,19.153604,16.098658,15.948623,19.761119,9.247979,20.60842,17.918582
pt,16.892216,11.357509,14.622268,17.195326,12.53924,16.798257,7.897387,14.66963
tl,16.108458,16.339211,17.614316,14.024209,15.019211,16.321132,16.384947,8.557587


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [7]:
from IPython.display import display
import pandas as pd

# Assuming the 'match' function is already defined and correctly set up

results = []
for n in range(1, 5):  # n values from 1 to 4
    for smoothing in (True, False):  # With and without smoothing
        print(f"Running match for n={n}, add_one={smoothing}")
        perplexity_df = match(n, smoothing)
        results.append((n, smoothing, perplexity_df))

        # Display the DataFrame clearly
        display(perplexity_df.style.set_caption(f"Perplexity for n={n} with add_one={smoothing}"))
        print("\n")  # Add a newline for better spacing between tables

# Optionally, you can store these DataFrames in a dictionary or list for later use or analysis.


Running match for n=1, add_one=True


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,44.913774,47.749302,47.810771,48.297959,47.328697,46.925865,48.405195,47.77622
es,43.730705,41.557604,42.619202,45.298847,42.583547,44.238783,42.158546,44.961804
fr,44.230433,43.053705,41.72947,46.303093,43.371956,44.114269,43.68672,46.879007
in,46.814516,48.217867,48.725841,43.369426,47.886137,46.333671,47.822347,44.613255
it,44.947387,44.136638,44.662605,46.705154,43.036474,45.326571,44.557905,46.072525
nl,46.129808,47.459076,46.771084,47.273204,46.978823,44.110985,47.747456,48.165704
pt,43.752245,41.576061,42.707314,44.832334,42.373613,44.210645,41.004821,44.300062
tl,48.787659,50.052098,51.812878,47.185358,49.557798,49.874221,49.937014,45.82482




Running match for n=1, add_one=False


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,44.913756,47.751784,47.812567,48.298667,47.329994,46.926399,48.409758,47.776681
es,43.730507,41.557582,42.619773,45.298825,42.583544,44.238653,42.159462,44.961707
fr,44.230479,43.055271,41.729426,46.303723,43.372326,44.114417,43.689075,46.879541
in,46.815225,48.220873,48.729085,43.369414,47.888392,46.334129,47.827377,44.613746
it,44.947232,44.137272,44.66305,46.70536,43.036472,45.326491,44.559297,46.072506
nl,46.130244,47.461142,46.773012,47.273598,46.980178,44.110969,47.751212,48.166327
pt,43.75199,41.576048,42.707292,44.832168,42.373501,44.210382,41.004761,44.299782
tl,48.787965,50.054015,51.814745,47.185691,49.559206,49.8748,49.939852,45.824792




Running match for n=2, add_one=True


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,15.184921,21.079259,19.553253,19.569761,20.652947,18.703873,21.383025,18.874753
es,19.055744,13.999201,17.14462,19.582558,16.435927,19.875269,15.757241,18.471417
fr,20.079133,18.431445,14.551399,21.279743,18.812287,19.599815,18.503461,21.478121
in,19.624595,21.352905,20.533057,15.089155,20.872665,19.384003,21.980381,17.709672
it,19.168853,17.082039,18.258019,19.141274,14.594167,20.142009,17.57047,18.275834
nl,19.020785,21.456031,19.721476,19.282776,21.361681,15.071147,22.18754,20.554144
pt,20.137962,16.097614,17.641477,20.490693,16.989418,20.513851,13.951305,19.17618
tl,18.793508,19.901822,20.839133,17.231904,19.279677,19.762221,20.256163,14.544468




Running match for n=2, add_one=False


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,15.151921,21.409403,19.819029,19.740541,20.890401,18.828211,21.931186,19.06977
es,19.259996,13.968621,17.24634,19.683377,16.534176,19.99026,15.950108,18.606818
fr,20.386437,19.10197,14.520849,22.095104,19.196564,19.854856,19.461732,22.646376
in,19.900636,21.768877,20.881683,15.052627,21.276076,19.582796,22.562046,17.90554
it,19.412401,17.28124,18.40894,19.424252,14.559984,20.312078,17.963773,18.54206
nl,19.230696,21.885897,20.040796,19.517878,21.726074,15.038256,22.796087,20.888838
pt,20.274243,16.178917,17.758256,20.605879,17.086576,20.622501,13.914352,19.272831
tl,19.013015,20.199616,21.155149,17.342006,19.563453,19.932295,20.736295,14.507238




Running match for n=3, add_one=True


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,8.686375,17.171329,14.873785,14.833869,16.783752,14.011432,17.844941,14.268463
es,16.293154,8.254378,13.972432,16.648642,12.450591,16.401253,11.411747,14.454439
fr,15.176171,15.052031,8.086993,16.817121,14.831322,14.03626,15.920432,18.18619
in,17.420222,19.161696,16.908573,9.49853,19.485017,16.310158,20.190096,15.37172
it,16.52659,13.845702,15.084137,16.840061,8.713761,17.374312,14.771806,14.688347
nl,15.867946,19.153604,16.098658,15.948623,19.761119,9.247979,20.60842,17.918582
pt,16.892216,11.357509,14.622268,17.195326,12.53924,16.798257,7.897387,14.66963
tl,16.108458,16.339211,17.614316,14.024209,15.019211,16.321132,16.384947,8.557587




Running match for n=3, add_one=False


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,7.620488,56.303023,37.579557,37.32347,53.183904,32.493169,68.050394,38.836406
es,42.743886,7.303707,35.678155,48.064202,27.940227,43.661069,27.302277,34.878687
fr,50.574632,50.632157,7.164783,61.3032,48.144704,38.790958,62.982848,77.944654
in,51.760613,84.235944,53.731605,8.264972,89.975002,46.609948,107.320328,44.89577
it,42.824626,34.1815,36.513908,48.10335,7.657545,43.528788,42.730331,36.640865
nl,57.149752,101.31587,59.732935,54.095447,101.561768,8.076974,124.871862,80.387358
pt,44.419448,25.359895,35.075897,48.841731,28.365149,43.708966,6.911732,34.367224
tl,46.130429,51.250621,55.99681,38.610307,45.415965,46.706783,59.121675,7.418788




Running match for n=4, add_one=True


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,8.381474,92.617912,57.947174,57.943593,85.037547,49.847732,114.639833,57.969999
es,70.701788,8.131448,56.442589,81.6693,40.210249,71.723916,38.929464,52.730688
fr,79.86708,79.881004,7.793051,94.904976,76.620889,58.589556,103.37976,128.205171
in,86.664121,145.519819,84.909069,9.611182,153.490897,74.067515,187.716958,74.430434
it,70.174022,52.937778,58.456825,81.50116,8.50258,69.071106,69.502857,56.546374
nl,91.499017,165.599216,91.93408,85.379633,162.228838,9.163481,210.06327,133.70451
pt,74.072725,35.811398,56.667478,82.384317,41.034334,73.286716,7.827423,51.823527
tl,73.006809,81.942488,90.760856,61.813498,69.080833,75.19958,96.967296,8.489168




Running match for n=4, add_one=False


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,4.407223,1238.671027,546.234566,432.015598,972.015882,328.381099,1824.16389,289.700835
es,916.825014,4.592071,507.168675,1296.316153,204.056986,1078.822188,199.662157,411.673838
fr,851.50166,834.862139,4.362058,1298.911183,793.898735,570.408379,1385.450755,1700.846572
in,2928.744492,5167.926168,2665.912831,4.954239,5222.648964,1933.482418,8702.461342,793.122854
it,1085.514868,393.423314,634.913951,1218.983042,4.704696,1385.202549,679.318775,503.775789
nl,1535.754314,4809.818957,1947.82815,1718.002403,5444.210377,4.678423,8961.50119,2897.97524
pt,1343.03079,177.289655,687.778723,1480.072651,244.339478,1387.989158,4.393131,478.710349
tl,1050.214455,1198.617132,1824.732261,432.353666,908.720489,1352.749659,1597.418491,4.461753






**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [8]:
def save_language_models(n, add_one, languages, vocab, data_directory):
    language_models = {}
    for lang in languages:
        data_file_path = f'{data_directory}/{lang}.json'
        language_models[lang] = lm(n, vocab, data_file_path, add_one)
    return language_models

# Example usage
n = 4
add_one = False
languages = ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
data_directory = 'nlp-course/lm-languages-data-new'

language_models = save_language_models(n, add_one, languages, vocab, data_directory)

# Now, you can access each language model using its language code
english_model = language_models['en']
spanish_model = language_models['es']
indian_model = language_models['in']
italian_model = language_models['it']
netherland_model = language_models['nl']
pt_model = language_models['pt']
french_model = language_models['fr']
tlv_model = language_models['tl']

In [9]:
import csv

# Open the CSV file
with open('nlp-course/lm-languages-data-new/test.csv', newline='') as csvfile:
    # Create a CSV reader object
    csv_reader = csv.reader(csvfile)

    # Iterate over each row and print it
    for row in csv_reader:
        print(row)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
['836257315825283074', 'RT @gabrielasaiia: bicho eu sou mestre na arte de "gosto de saber o que a gente tem" mas nunca esclarecer o que tenho com a pessoa com medo…', 'pt']
['836309409080946692', 'ah, dimenticavo, Jamie Dornan è tanta roba🙌🏼😻', 'it']
['836314492585684995', "Manchester City Pantau 'The Next Marco Verratti' https://t.co/u00ptBNjQn", 'it']
['847726526887903233', '@flowertje74 Is dat zo?  Is al van lang geleden toen ik nog getrouwd was. Staring at the sea. Gaat even niet zo goe… https://t.co/fCkEtdq1Kb', 'nl']
['836493694220005377', 'RT @opedrocaruso: Se você tem amor a vida esteja acordado hoje entre quatro e cinco da manhã pra ver o segundo maior espetáculo da terra. #…', 'pt']
['836305772610994176', 'RT @Flackoshit_: @abrantes_gil foste no bugio?', 'it']
['836464006923575296', 'RT @indeNiaLLady: Sa Pasig pala yung place ng party (hoy saan sa Pasig, dito yan samin😭) Taga Batangas yun may birthday 😅✌… ', 'tl

In [12]:
import csv

def classify_language(language_models, n, file_path):
    results = []
    with open(file_path, newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            sentence = row[1]
            true_lang = row[2]
            # Initialize variables to track the best fitting language
            min_perplexity = float('inf')
            best_language = None

            # Compute perplexity for each language model
            for language, model in language_models.items():
                perplexity = eval2(n, model, sentence)
                if perplexity < min_perplexity:
                    min_perplexity = perplexity
                    best_language = language

            # Store the result as a tuple of the sentence and its predicted language
            results.append((sentence, best_language, true_lang))

    return results

# Assuming the test file path is as follows
test_file_path = 'nlp-course/lm-languages-data-new/test.csv'
# Use the previously created language_models and n value
results = classify_language(language_models, n, test_file_path)

# Print or otherwise process the results as needed
for result in results:
    print(result)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
('RT @gabrielasaiia: bicho eu sou mestre na arte de "gosto de saber o que a gente tem" mas nunca esclarecer o que tenho com a pessoa com medo…', 'pt', 'pt')
('ah, dimenticavo, Jamie Dornan è tanta roba🙌🏼😻', 'it', 'it')
("Manchester City Pantau 'The Next Marco Verratti' https://t.co/u00ptBNjQn", 'in', 'it')
('@flowertje74 Is dat zo?  Is al van lang geleden toen ik nog getrouwd was. Staring at the sea. Gaat even niet zo goe… https://t.co/fCkEtdq1Kb', 'nl', 'nl')
('RT @opedrocaruso: Se você tem amor a vida esteja acordado hoje entre quatro e cinco da manhã pra ver o segundo maior espetáculo da terra. #…', 'pt', 'pt')
('RT @Flackoshit_: @abrantes_gil foste no bugio?', 'pt', 'it')
('RT @indeNiaLLady: Sa Pasig pala yung place ng party (hoy saan sa Pasig, dito yan samin😭) Taga Batangas yun may birthday 😅✌… ', 'tl', 'tl')
('E essa câimbra no meu pé', 'pt', 'pt')
('Schuttershof 28, 4421 GP, Kapelle: Nieuw Vraagprijs: € 182.000 Woo

In [11]:
def eval2(n, model, sentence):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # sentence - the string sentence that you wish to calculate a perplexity score for
    import math

    log_probability = 0
    total_tokens = 0

    # Preprocess the sentence to include start and end padding similar to how the model was trained
    sentence = (' ' * (n - 1)) + sentence.strip() + ' '

    # Calculate probabilities for each n-gram in the sentence
    for i in range(len(sentence) - n + 1):
        prefix = sentence[i:i+n-1]
        target = sentence[i+n-1]

        # Retrieve the probability of the target given the prefix from the model
        if prefix in model and target in model[prefix]:
            log_probability += math.log(model[prefix][target])
        else:
            # Handle the case where the prefix+target combination is not in the model
            # Here we might assume a very small probability since the combination is unseen
            # This is a simplification and you may want to handle it differently
            log_probability += math.log(1e-10)  # Using a very small probability

        total_tokens += 1

    # Calculate perplexity
    perplexity = math.exp(-log_probability / total_tokens)
    return perplexity



**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).


In [13]:
print(results[0:5])

[('tweet_text', 'en', 'label'), ('RT @jarsofshine: In 08 I had a volunteer who had to sell his home + car to pay for heart surgery. He took 3 buses to work @ Obama office. @…', 'en', 'en'), ('IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa 1$ al di x chi lavora a salario chi sono i disperati che arrivano?NON CERTO I VERI DISPERATI.', 'it', 'it'), ('@jaynaldmase @acobasilianne @dingDANGdantes @dadaadustin @caesartorre @altesersss Basang mani yan.', 'tl', 'tl'), ('Daags voor @RondeVlaanderen, @VoltaClassic als opwarmer. Interview met winnaar 2016: Floris Gerts,fietsende corpsbal https://t.co/aYKHLVAmxz', 'nl', 'nl')]


In [16]:
expected_languages = {'en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl'}

# Initialize the dictionaries for true positives, false positives, and false negatives
true_positives = {lang: 0 for lang in expected_languages}
false_positives = {lang: 0 for lang in expected_languages}
false_negatives = {lang: 0 for lang in expected_languages}

# Update the counts for TP, FP, FN based on each result
for _, predicted_lang, true_lang in results:
    # Skip processing if either language code is invalid
    if predicted_lang not in expected_languages or true_lang not in expected_languages:
        continue

    if predicted_lang == true_lang:
        true_positives[true_lang] += 1
    else:
        false_positives[predicted_lang] += 1
        false_negatives[true_lang] += 1


In [17]:
f1_scores = {}
for language in expected_languages:
    tp = true_positives[language]
    fp = false_positives[language]
    fn = false_negatives[language]

    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0

    if precision == 0 and recall == 0:
        f1 = 0  # Prevent division by zero
    else:
        f1 = 2 * (precision * recall) / (precision + recall)

    f1_scores[language] = f1


In [18]:
for language, f1 in f1_scores.items():
    print(f"F1 Score for {language}: {f1:.4f}")


F1 Score for fr: 0.9391
F1 Score for in: 0.8966
F1 Score for pt: 0.9100
F1 Score for nl: 0.9155
F1 Score for tl: 0.8688
F1 Score for es: 0.8860
F1 Score for it: 0.8977
F1 Score for en: 0.8815


# **Good luck!**