## Task: Identifying False Friends - Words with Similar Spelling but Different Semantics

In this task, the goal is to identify pairs of words that have similar or identical spellings but different semantic meanings. This is often referred to as identifying "false friends."

### Requirements:

- Read pairs of English and Spanish words from a CSV file.
- Determine whether each pair of words is a potential "false friend" based on semantic similarity and spelling.
- Print the results indicating whether the words are potential false friends or not.

### Problem

The problem being addressed is the identification of pairs of words that look similar but have different meanings in English and Spanish. The solution should use various techniques, including Levenshtein distance and word embeddings, to make this determination.

### Solution

**you need to create a file .env which contains your babelnet api key to use this.**

The solution involves the following steps:

#### Reading Pairs of Words from CSV

1. Open and read a CSV file containing pairs of English and Spanish words. The CSV file is specified by the `csv_file_path` variable.
2. Create a list of tuples, `translation_tuples`, to store the word pairs.

#### Identifying False Friends

1. Define a function `words_overlap` that calculates the Levenshtein distance between two words and returns `True` if the distance is less than or equal to a threshold (half of the maximum word length). This function helps identify similar spellings.
2. Define a function `count_overlapping_words` that counts the number of overlapping words between two meanings. It tokenizes the meanings, normalizes them to lowercase, and counts overlapping words.
3. Define a function `is_false_friend` that takes a pair of words and determines if they are potential false friends. It performs the following steps:
   - Checks if the words have overlapping spellings using `words_overlap`.
   - Retrieves the meanings of the words in English and Spanish using PyMultiDictionary and Babelnet for the second approch.
   - Normalizes the meanings to lowercase and computes the semantic similarity between the meanings using word embeddings.
   - Considers words potential false friends if they have similar spellings or if their semantic similarity is below a threshold (0.80).

#### Printing Results

1. Iterate through the list of word pairs from `translation_tuples`.
2. Call the `is_false_friend` function for each pair to determine if they are potential false friends.
3. Print the results, indicating whether the words are potential false friends or not.


In [1]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import spacy
import Levenshtein
from nltk.corpus import wordnet
from translate import Translator
# python -m spacy download en_core_web_md
import spacy
import requests
import json
from translate import Translator
from PyMultiDictionary import MultiDictionary
from Levenshtein import distance as levenshtein_distance

nlp = spacy.load('en_core_web_md')
dictionary = MultiDictionary()


2023-09-09 10:48:31.232189: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
import csv

# Specify the path to your CSV file
csv_file_path = '../lab6/false_friends.csv'

# Initialize an empty list to store the tuples
translation_tuples = []

# Open the CSV file and read its contents
with open(csv_file_path, newline='', encoding='utf-8') as csvfile:
    csvreader = csv.reader(csvfile)
    
    # Skip the header row
    next(csvreader)
    
    # Iterate through the rows and create tuples
    for row in csvreader:
        if len(row) == 2:
            english_word, spanish_word = row
            translation_tuples.append((english_word.strip(), spanish_word.strip()))

# Print the list of tuples
print(translation_tuples)


[('Actual', 'Actual'), ('Parent', 'Pariente'), ('Library', 'Librería'), ('Pretend', 'Pretender'), ('Sympathy', 'Simpatía'), ('Embarrassed', 'Embarazada'), ('Carpet', 'Carpeta'), ('Fabricate', 'Fabricar'), ('Actualize', 'Actualizar'), ('Sensation', 'Sensación'), ('Mayor', 'Mayor'), ('Excited', 'Excitado'), ('Intoxicated', 'Intoxicado'), ('Fabrication', 'Fabricación'), ('Casual', 'Casual'), ('Resume', 'Resumir'), ('Familiar', 'Familiar'), ('Pretend', 'Pretender'), ('Sensible', 'Sensible'), ('College', 'Colegio'), ('Education', 'Educación'), ('Introduce', 'Introducir'), ('Deception', 'Decepción'), ('Artist', 'Artista'), ('Accident', 'Accidente'), ('Parental', 'Parental'), ('Editor', 'Editor'), ('Eventually', 'Eventualmente')]


In [3]:
def words_overlap(word1, word2):
    """
    Determine if two words overlap based on a similarity threshold.

    Args:
        word1 (str): The first word.
        word2 (str): The second word.

    Returns:
        bool: True if the words overlap, False otherwise.
    """
    threshold = int(0.5 * max(len(word1), len(word2)))
    distance = Levenshtein.distance(word1, word2)
    return distance <= threshold

def count_overlapping_words(meaning1, meaning2):
    """
    Count the number of overlapping words between two meanings.

    Args:
        meaning1 (str): The first meaning.
        meaning2 (str): The second meaning.

    Returns:
        int: The number of overlapping words.
    """
    tokens1 = set(meaning1.lower().split())
    tokens2 = set(meaning2.lower().split())
    overlap = tokens1.intersection(tokens2)
    return len(overlap)

def is_false_friend(word1, word2):
    """
    Check if two words are false friends.

    Args:
        word1 (str): The first word.
        word2 (str): The second word.

    Returns:
        bool: True if the words are false friends, False otherwise.
    """
    print(word1, word2)

    if words_overlap(word1.lower(), word2.lower()):
        translated_meaning1 = dictionary.meaning('en', word1)
        translated_meaning2 = dictionary.meaning('es', word2)

        token1 = nlp(translated_meaning1[1].lower())  # Normalize to lowercase
        token2 = nlp(translated_meaning2[1].lower())  # Normalize to lowercase

        # If meanings are not found for either word, consider them potential false friends
        if len(translated_meaning2[0]) == 0:
            return True
        else:
            if token1.similarity(token2) < 0.80:
                return True

    return False


In [4]:
for x, y in translation_tuples:
    if is_false_friend(x, y):
        print(f"Word1 '{x}', Word2 '{y}")
        print("Potential false friends.")
        print()  # Add a blank line between each iteration
    else:
        print(f"Word1 '{x}', Word2 '{y}")
        print("NOT false friends.")
        print()  # Add a blank line between each iteration



Actual Actual


Word1 'Actual', Word2 'Actual
Potential false friends.

Parent Pariente
Word1 'Parent', Word2 'Pariente
Potential false friends.

Library Librería
Word1 'Library', Word2 'Librería
Potential false friends.

Pretend Pretender
Word1 'Pretend', Word2 'Pretender
Potential false friends.

Sympathy Simpatía
Word1 'Sympathy', Word2 'Simpatía
Potential false friends.

Embarrassed Embarazada
Word1 'Embarrassed', Word2 'Embarazada
Potential false friends.

Carpet Carpeta
Word1 'Carpet', Word2 'Carpeta
Potential false friends.

Fabricate Fabricar
Word1 'Fabricate', Word2 'Fabricar
Potential false friends.

Actualize Actualizar
Word1 'Actualize', Word2 'Actualizar
Potential false friends.

Sensation Sensación
Word1 'Sensation', Word2 'Sensación
Potential false friends.

Mayor Mayor
Word1 'Mayor', Word2 'Mayor
Potential false friends.

Excited Excitado
Word1 'Excited', Word2 'Excitado
Potential false friends.

Intoxicated Intoxicado
Word1 'Intoxicated', Word2 'Intoxicado
Potential false friends.

Fa

In [5]:
import requests
import json
from dotenv import load_dotenv
from pathlib import Path
import os
load_dotenv()
api_key = os.getenv("API_KEY")
def get_synset_meaning(synset_id):
    """
    Get the meaning of a synset based on its ID.

    Args:
        synset_id (str): The ID of the synset.

    Returns:
        str: The meaning of the synset.
    """
    url = f"https://babelnet.io/v8/getSynset?id={synset_id}&key={api_key}"
    response = requests.get(url)
    data = json.loads(response.text)
    return data['glosses'][0]['gloss'] if 'glosses' in data and data['glosses'] else ''


def get_word_meaning(word, lang):
    """
    Get the meaning of a word in a specific language.

    Args:
        word (str): The word.
        lang (str): The language code.

    Returns:
        str: The meaning of the word.
    """
    url = f"https://babelnet.io/v8/getSynsetIds?lemma={word}&searchLang={lang}&key={api_key}"
    response = requests.get(url)
    synset_ids = json.loads(response.text)
    for synset_id in synset_ids:
        meaning = get_synset_meaning(synset_id['id'])
        if meaning:
            return meaning
    return ''

def words_overlap(word1, word2):
    """
    Determine if two words overlap based on a similarity threshold.

    Args:
        word1 (str): The first word.
        word2 (str): The second word.

    Returns:
        bool: True if the words overlap, False otherwise.
    """
    threshold = int(0.5 * max(len(word1), len(word2)))
    distance = Levenshtein.distance(word1, word2)
    return distance <= threshold

def count_overlapping_words(meaning1, meaning2):
    """
    Count the number of overlapping words between two meanings.

    Args:
        meaning1 (str): The first meaning.
        meaning2 (str): The second meaning.

    Returns:
        int: The number of overlapping words.
    """
    tokens1 = set(meaning1.lower().split())
    tokens2 = set(meaning2.lower().split())
    overlap = tokens1.intersection(tokens2)
    return len(overlap)

def is_false_friend(word1, word2):
    """
    Check if two words are false friends.

    Args:
        word1 (str): The first word.
        word2 (str): The second word.

    Returns:
        bool: True if the words are false friends, False otherwise.
    """
    print(word1, word2)

    if words_overlap(word1.lower(), word2.lower()):
        token1 = nlp(word1.lower())  # Normalize to lowercase
        token2 = nlp(word2.lower())  # Normalize to lowercase

        # Tokenize and lemmatize the word
        lemma1 = token1[0].lemma_
        lemma2 = token2[0].lemma_

        translated_meaning1 = get_word_meaning(lemma1,"EN")
        translated_meaning2 = get_word_meaning(lemma2,"ES")

        tokenized_meaning1 = nlp(translated_meaning1.lower())
        tokenized_meaning2 = nlp(translated_meaning2.lower())

        # If meanings are not found for either word, consider them potential false friends
        if tokenized_meaning1.similarity(tokenized_meaning2) <  0.80:
            return True
        else:
            return False

    return False

for x, y in translation_tuples:
    if is_false_friend(x, y):
        print(f"Word1 '{x}', Word2 '{y}")
        print("Potential false friends.")
        print()  # Add a blank line between each iteration
    else:
        print(f"Word1 '{x}', Word2 '{y}")
        print("NOT false friends.")
        print()  # Add a blank line between each iteration

Actual Actual
Word1 'Actual', Word2 'Actual
Potential false friends.

Parent Pariente
Word1 'Parent', Word2 'Pariente
NOT false friends.

Library Librería
Word1 'Library', Word2 'Librería
Potential false friends.

Pretend Pretender
Word1 'Pretend', Word2 'Pretender
Potential false friends.

Sympathy Simpatía
Word1 'Sympathy', Word2 'Simpatía
NOT false friends.

Embarrassed Embarazada
Word1 'Embarrassed', Word2 'Embarazada
Potential false friends.

Carpet Carpeta
Word1 'Carpet', Word2 'Carpeta
Potential false friends.

Fabricate Fabricar
Word1 'Fabricate', Word2 'Fabricar
Potential false friends.

Actualize Actualizar
Word1 'Actualize', Word2 'Actualizar
Potential false friends.

Sensation Sensación
Word1 'Sensation', Word2 'Sensación
NOT false friends.

Mayor Mayor
Word1 'Mayor', Word2 'Mayor
Potential false friends.

Excited Excitado
Word1 'Excited', Word2 'Excitado
Potential false friends.

Intoxicated Intoxicado
Word1 'Intoxicated', Word2 'Intoxicado
Potential false friends.

Fabric