# Word Variants - French

This notebook will prepare the lookups for word variants and roots. 

The data was collected following this procedure: https://github.com/hbenbel/French-Dictionary, using the source of kaikki.org/dictionary/French/index.html

In [32]:
import pandas as pd
from tqdm import tqdm
import numpy as np
import math

In [59]:
rawPath = "../../../RawData/Variants/French/hbenbel/"

In [61]:
rawJSON = pd.read_json("../../../RawData/Variants/French/kaikki.org-dictionary-French.json", lines=True)

In [82]:
rawJSON = rawJSON[["word","senses","forms"]]

## Getting variants

We see that the tags are `form-of` or `alt-of` for variants, then if we grab the first link, it will go there (will need to do some checking

In [65]:
rawJSON[rawJSON.word == "eu"]["senses"][1692][0]

{'links': [['avoir', 'avoir#French']],
 'glosses': ['past participle of avoir'],
 'tags': ['form-of', 'participle', 'past'],
 'form_of': [{'word': 'avoir'}],
 'id': 'eu-fr-verb-FnRNr-Pp',
 'categories': []}

In [64]:
rawJSON[rawJSON.word == "eu"]["forms"][1692]

[{'form': 'eue', 'tags': ['feminine']},
 {'form': 'eus', 'tags': ['masculine', 'plural']},
 {'form': 'eues', 'tags': ['feminine', 'plural']}]

In [71]:
rawJSON[rawJSON.word == "eue"]["senses"][91186]

[{'links': [['eu', 'eu#French']],
  'glosses': ['feminine singular of eu'],
  'tags': ['feminine', 'form-of', 'participle', 'singular'],
  'form_of': [{'word': 'eu'}],
  'id': 'eue-fr-verb-g2BJqgFA',
  'categories': []}]

In [89]:
rawJSON[rawJSON.word == "eu"]["forms"][1692]

[{'form': 'eue', 'tags': ['feminine']},
 {'form': 'eus', 'tags': ['masculine', 'plural']},
 {'form': 'eues', 'tags': ['feminine', 'plural']}]

In [96]:
rawJSON[rawJSON.word == "eue"]["senses"][91186][0]["links"][0][0]

'eu'

In [54]:
rawJSON[rawJSON.word == "oeufs"]["senses"][149408][0]["tags"]

['alt-of', 'masculine', 'nonstandard']

In [113]:
rawJSON[rawJSON.word == "oeufs"]["senses"][149408][0]["links"]

[['œufs', 'œufs#French']]

In [53]:
rawJSON[rawJSON.word == "œufs"]["senses"][84185][0]["tags"]

['form-of', 'masculine', 'plural']

In [79]:
rawJSON[rawJSON.word == "avoir"]["senses"][1361][0]

{'examples': [{'text': 'Near-synonym: posséder'},
  {'text': 'J’aimerais avoir 20 dollars.',
   'english': 'I would like to have 20 dollars.',
   'type': 'example'}],
 'links': [['have', 'have']],
 'raw_glosses': ['(transitive) to have (to own; to possess)'],
 'glosses': ['to have (to own; to possess)'],
 'tags': ['transitive'],
 'id': 'avoir-fr-verb-3fPpXyVX',
 'categories': []}

In [81]:
rawJSON[rawJSON.word == "ait"]["senses"][8483]

[{'links': [['avoir', 'avoir#French']],
  'glosses': ['third-person singular present subjunctive of avoir'],
  'tags': ['form-of', 'present', 'singular', 'subjunctive', 'third-person'],
  'form_of': [{'word': 'avoir'}],
  'id': 'ait-fr-verb-bnz7GJlw',
  'categories': []}]

## variant mapper function

In [144]:
def variantMapper(row):
    word = row["word"]
    
    if (not ("tags" in row["senses"][0] and "links" in row["senses"][0])):
        return {"Variant" : word, "Root" : word}
    
    tags = row["senses"][0]["tags"] 
    isVariant = "form-of" in tags or "alt-of" in tags
    if isVariant:
        root = row["senses"][0]["links"][0][0]
    else:
        root = word
    return {"Variant" : word, "Root" : root}

In [145]:
variantMapper(rawJSON.iloc[149408])

{'Variant': 'oeufs', 'Root': 'œufs'}

In [146]:
variantMapper(rawJSON.iloc[1361])

{'Variant': 'avoir', 'Root': 'avoir'}

In [147]:
variantMapper(rawJSON.iloc[91186])

{'Variant': 'eue', 'Root': 'eu'}

In [148]:
variantMapper(rawJSON.iloc[1692])

{'Variant': 'eu', 'Root': 'avoir'}

In [141]:
variantMapper(rawJSON.iloc[8])

{'Variant': 'ablactation', 'Root': 'ablactation'}

In [149]:
variantMapper(rawJSON.iloc[375])

{'Variant': 'latin', 'Root': 'latin'}

## Note:

see that we have now essentially created a tree structure (assuming that no words are variants of each other), we should traverse this so that every word either is a variant or a root

## Generate root/variant tree (arbitrary depth)

In [150]:
tqdm.pandas()
rawMap = rawJSON.dropna(subset=['word']).progress_apply(variantMapper, axis = 1)

100%|███████████████████████████████████████████████████████████████████████| 392163/392163 [00:35<00:00, 10965.52it/s]


In [156]:
rawdf = pd.DataFrame.from_records(rawMap)

In [159]:
rawdf[rawdf.Variant != rawdf.Root].sample(10)

Unnamed: 0,Variant,Root
298916,bitcoins,bitcoin
23339,borderai,border
145270,intimidaient,intimider
337384,sérançassions,sérancer
112681,scintillâmes,scintiller
278150,taquineries,taquinerie
197123,répudiâtes,répudier
136629,survêtît,survêtir
268250,homœomorphisme,homéomorphisme
137002,recuirai,recuirer


## normalise so has depth 1

right now a "variant chain" can form, lets fix that, we set a recursive limit as there might be some circular roots, we'll ignore those for now

In [222]:
def create_variant_root_map(df):
    return df.set_index('Variant')['Root'].to_dict()

def find_ultimate_root(row, variant_root_map, max_depth=1000):
    variant = row['Variant']
    root = variant_root_map.get(variant, variant)
    
    depth = 0
    while root != variant and root in variant_root_map:
        if depth > max_depth:
            root = variant  # Set root equal to variant to break the loop
            break
        variant = root
        root = variant_root_map[variant]
        depth += 1
    
    return root

# Create the mapping dictionary
variant_root_map = create_variant_root_map(rawdf)

# Apply the function to each variant in the DataFrame with a max depth limit
rawdf['Ultimate_root'] = rawdf.progress_apply(lambda row: find_ultimate_root(row, variant_root_map, 1000), axis=1)

100%|███████████████████████████████████████████████████████████████████████| 392163/392163 [00:07<00:00, 52964.11it/s]


In [223]:
variant_root_map["servir"]

'servir'

In [224]:
rawdf[rawdf.Variant != rawdf.Root][rawdf.Root != rawdf.Ultimate_root]

  rawdf[rawdf.Variant != rawdf.Root][rawdf.Root != rawdf.Ultimate_root]


Unnamed: 0,Variant,Root,Ultimate_root
53,acceptant,accepter,acceptant
85,tables,table,tabler
97,représentant,représenter,représentant
155,servant,servir,servant
194,Philippines,Philippine,Philippin
...,...,...,...
391802,streamée,streamé,streamer
391807,streameurs,streameur,streamer
391908,Beauzelly,Bauzély,Baudile
391996,-ités,-ité,-té


## A problem

We see a problem, in that some circular ultimate roots have appeared, whilst a more in depth treamtment could get to the bottom of what is going on here... for now we'll just use the original root, at that seems to be useful in the cases we care about.

In [245]:
rawdf["MaybeBad1"] = rawdf.Root != rawdf.Ultimate_root 
rawdf["MaybeBad2"] = rawdf.Variant == rawdf.Ultimate_root

rawdf["Bad"] = rawdf["MaybeBad1"] & rawdf["MaybeBad2"]

In [247]:
rawdf[rawdf.Bad == True]

Unnamed: 0,Variant,Root,Ultimate_root,MaybeBad1,MaybeBad2,Bad
53,acceptant,accepter,acceptant,True,True,True
97,représentant,représenter,représentant,True,True,True
155,servant,servir,servant,True,True,True
431,infinitive,infinitif,infinitive,True,True,True
508,finale,final,finale,True,True,True
...,...,...,...,...,...,...
381432,déparasité,déparasiter,déparasité,True,True,True
385932,étymologisant,étymologiser,étymologisant,True,True,True
386594,Chams,Cham,Chams,True,True,True
387751,exondé,exonder,exondé,True,True,True


In [249]:
rawdf["Ultimate_root"] = np.where(rawdf["Bad"], rawdf["Root"], rawdf["Ultimate_root"])

In [252]:
rawdf[rawdf["Bad"]].sample(4)

Unnamed: 0,Variant,Root,Ultimate_root,MaybeBad1,MaybeBad2,Bad
90326,détaché,détacher,détacher,True,True,True
80718,décroissant,décroître,décroître,True,True,True
102301,félicitations,félicitation,félicitation,True,True,True
12797,doucette,doucet,doucet,True,True,True


## variant roots?

Some "roots" also look like variants, we'll include those for now, but be wary later.

## processed data

In [272]:
processed = rawdf[["Variant", "Ultimate_root"]].rename(columns = {"Ultimate_root" : "Root"}).drop_duplicates().reset_index(drop = True)

In [282]:
processed.sample(10)

Unnamed: 0,Variant,Root
208947,plafonnait,plafonner
106393,boitassent,boiter
1684,tricycle,tricycle
68151,imploriez,imploriez
329046,notabiliserez,notabiliser
271644,trépidé,trépider
360188,gourgourans,gourgouran
115237,détecteront,détecter
379610,bourguignote,bourguignote
362181,ne plus se connaître,ne plus se connaître


In [283]:
save_dir = "../../../ProcessedData/Variants/French/"
import os
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

In [284]:
processed.to_csv(save_dir + "kaikki.csv", encoding='utf-16', index=False)