<a href="https://colab.research.google.com/github/HuiHAN0601/AVH-TFLmicrospeech/blob/main/cod_hackathon_hui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# To assign each CoD string from the 'frequency_tidy_cod_hackathon.csv' file the correct CoD code from the 'DK1875.csv' file using Natural Language Processing (NLP) techniques, following these steps:

1. Preprocess the data:
Read the 'frequency_tidy_cod_hackathon.csv' and 'DK1875.csv' files and load them into appropriate data structures, such as pandas dataframes. Perform any necessary data cleaning and preprocessing steps, such as removing missing values, standardizing text formats, and handling language variations.

2. Define a similarity measure:
Determine a similarity measure to compare the CoD strings from 'frequency_tidy_cod_hackathon.csv' with the 'latin_name' and 'danish_name' columns from 'DK1875.csv'. I use NLP techniques for similarity: Word Embedding Similarity

3. Perform matching:
Iterate over each CoD string in 'frequency_tidy_cod_hackathon.csv'.
Calculate the similarity score between the CoD string and the 'latin_name' and 'danish_name' columns in 'DK1875.csv' using the chosen similarity measure. Assign the CoD code based on the highest similarity score or a predefined threshold.

4. Evaluate and validate the results:
Compare the assigned CoD codes with the original codes in 'DK1875.csv' to evaluate the accuracy of the matching process. Assess any discrepancies or potential errors and make necessary adjustments.

1. Preprocess the data

In [None]:
import pandas as pd

# Specify the file paths
frequency_file_path = '/content/drive/MyDrive/frequency_tidy_cod_hackathon.csv'
dk1875_file_path = '/content/drive/MyDrive/DK1875.csv'

# Load the data from the CSV files
frequency_df = pd.read_csv(frequency_file_path, sep=';', header=None)
dk1875_df = pd.read_csv(dk1875_file_path, sep=';', header=None)


In [None]:
import pandas as pd
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Specify the file paths
frequency_file_path = '/content/drive/MyDrive/frequency_tidy_cod_hackathon.csv'
dk1875_file_path = '/content/drive/MyDrive/DK1875.csv'

# Load the data from the CSV files
frequency_df = pd.read_csv(frequency_file_path, sep=';', header=None)
dk1875_df = pd.read_csv(dk1875_file_path, sep=';', header=None)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Perform basic data exploration to understand the structure and contents of the dataframes.
print(frequency_df.head())
print(dk1875_df.head())

                                       0
0                               tidy_cod
1  morbus cordis (mb. cordis, mb. cord.)
2                                dødfødt
3                   pneumonia (pneumoni)
4                       bronchopneumonia
                                       0
0  DK_Danish_Code$latin_name$danish_name
1                       1$Variolæ$Kopper
2                   2$Morbilli$Mæslinger
3           3$Scarlatina$Skarlagensfeber
4       4$Diphtheritis$Ondartet Halssyge


2. Define a similarity measure

Using word embeddings for similarity measurement can be a powerful approach for assigning CoD codes based on the 'latin_name' and 'danish_name' columns in the 'DK1875.csv' file. Here's how we can leverage word embeddings to calculate similarity:

✻ Preprocess the text:
Convert the CoD strings, 'latin_name', and 'danish_name' columns to lowercase and remove any leading/trailing whitespaces or punctuation marks.
Tokenize the text by splitting it into individual words or subwords.

✻ Load pre-trained word embeddings:
Download and load a pre-trained word embedding model FastText.
These models map words to dense vector representations in a continuous vector space, capturing semantic and contextual similarities.

✻ Convert text to word embeddings:
Iterate over the CoD strings, 'latin_name', and 'danish_name' values.
For each text, represent it as a vector by averaging the word embeddings of its constituent words.
You can use the pre-trained word embedding model to obtain the vector representation for each word.

✻ Calculate similarity:
Use a similarity metric like cosine similarity or Euclidean distance to compare the vector representations of the CoD strings with the 'latin_name' and 'danish_name' vectors.
Cosine similarity is a common choice, as it measures the cosine of the angle between two vectors, indicating their similarity.
Higher cosine similarity scores indicate greater similarity between the CoD string and the 'latin_name' or 'danish_name'.

To use FastText word embeddings for Danish and Latin, we'll need to download the respective pre-trained models from the FastText website (https://fasttext.cc/docs/en/pretrained-vectors.html). Then, save the downloaded models to a local directory. Make sure to update the file paths in the code to point to the locally saved models.

In [None]:
import csv
import urllib.request


# Step 1: Load the FastText word embedding models for Danish and Latin

import fasttext.util
import zipfile

# Specify the paths to the Danish and Latin model zip files in Google Drive
danish_model_zip_path = '/content/drive/MyDrive/wiki.da.zip'
latin_model_zip_path = '/content/drive/MyDrive/wiki.la.zip'

# Extract the Danish model files from the zip file
with zipfile.ZipFile(danish_model_zip_path, 'r') as zip_ref:
    zip_ref.extractall('/content/danish_model')

# Extract the Latin model files from the zip file
with zipfile.ZipFile(latin_model_zip_path, 'r') as zip_ref:
    zip_ref.extractall('/content/latin_model')

# Load the Danish FastText model
danish_model_path = '/content/danish_model/wiki.da.bin'
danish_model = fasttext.load_model(danish_model_path)

# Load the Latin FastText model
latin_model_path = '/content/latin_model/wiki.la.bin'
latin_model = fasttext.load_model(latin_model_path)


# Step 2: Load the CoD strings from 'frequency_tidy_cod_hackathon.csv'
cod_strings = []
with open('frequency_tidy_cod_hackathon.csv', 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header row
    for row in reader:
        cod_strings.append(row[0])

# Step 3: Load the CoD codes and names from 'DK1875.csv'
cod_mapping = {}
with open('DK1875.csv', 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header row
    for row in reader:
        latin_name = row[0]
        danish_name = row[0]
        cod = row[0]
        cod_mapping[cod] = {'latin_name': latin_name, 'danish_name': danish_name}

# Step 4: Assign CoD codes based on similarity
results = []
for cod_string in cod_strings:
    best_similarity = -1
    best_cod = None

    for cod, cod_data in cod_mapping.items():
        latin_name = cod_data['latin_name']
        danish_name = cod_data['danish_name']

        similarity_latin = danish_model.get_word_vector(cod_string).dot(danish_model.get_word_vector(latin_name))
        similarity_danish = latin_model.get_word_vector(cod_string).dot(latin_model.get_word_vector(danish_name))
        similarity = max(similarity_latin, similarity_danish)

        if similarity > best_similarity:
            best_similarity = similarity
            best_cod = cod

    results.append((cod_string, best_cod))

# Step 5: Save the results in a new file or data structure
with open('assigned_cod_codes.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['CoD String', 'CoD Code'])
    writer.writerows(results)


In [None]:
from google.colab import files

# Download the file
files.download('assigned_cod_codes.csv')

FileNotFoundError: ignored

In [None]:
import os

# Specify the file path
output_file_path = '/Users/huihan/Desktop/frequency_tidy_cod_hackathon_assigned.csv'

# Save the DataFrame as a CSV file
frequency_df.to_csv(output_file_path, index=False)

# Check if the file was created successfully
if os.path.isfile(output_file_path):
    print("CSV file saved successfully.")
else:
    print("Failed to save the CSV file.")

In [None]:
!pip3 install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [None]:
import pandas as pd
from fuzzywuzzy import fuzz


# Iterate over each CoD string in 'frequency_df'
for index, row in frequency_df.iterrows():
    cod_string = row['tidy_cod']
    highest_similarity = 0
    matched_cod_code = None

    # Iterate over each row in 'dk1875_df'
    for dk_index, dk_row in dk1875_df.iterrows():
        latin_name = dk_row['latin_name']
        danish_name = dk_row['danish_name']

        # Calculate similarity scores between the CoD string and the Latin and Danish names
        latin_similarity = fuzz.ratio(cod_string, latin_name)
        danish_similarity = fuzz.ratio(cod_string, danish_name)

        # Update the highest similarity and matched CoD code if necessary
        if latin_similarity > highest_similarity:
            highest_similarity = latin_similarity
            matched_cod_code = dk_row['DK_Danish']
        if danish_similarity > highest_similarity:
            highest_similarity = danish_similarity
            matched_cod_code = dk_row['DK_Danish']

    # Assign the matched CoD code to the corresponding CoD string
    frequency_df.at[index, 'cod_code'] = matched_cod_code

# Save the updated dataframe to a new file or perform further analysis
frequency_df.to_csv('matched_cod_codes.csv', index=False)


KeyError: ignored