# Top Frequent French words by POS

## How were the words selected ? [top_frequent_POS.csv files]

* The words were selected using __[OpenLexicon](http://www.lexique.org/shiny/openlexicon/)__ : a tool allowing to navigate databases and query them with regards to POS, lemma, etc.  
* For grammatical categories NOUN, VERB, ADJ and ADV, I selected all forms **ordered by frequency** in one of the source corpora of the lexicon (column **freqlemlivres**).  

For more information, see :  
Pallier, Christophe & New, Boris & Jessica Bourgin (2019) Openlexicon, GitHub repository, https://github.com/chrplr/openlexicon

### Post-treatment 
In order to facilitate our studies, we decided to distinguish, among these frequent words:  
- Words that are contained in Morphalou 
- Words that are contained in Morphalou, and whose grammatical category is not ambiguous just from their form (e.g "lit": can be a NOUN or a VERB, so it is ambiguous)

### [top_frequent_POS_Morphalou.csv files]

In [1]:
import pandas as pd

forms = pd.read_csv("top_frequent_NOUN.csv")
print(forms)
Morphalou = set(pd.read_csv("../FlauBERT_WE/all_nouns_we.csv")["Unnamed: 0"])

to_remove = [i for i in range(len(forms)) if forms.loc[i,"Word"] not in Morphalou]
forms = forms.drop(forms.index[to_remove])
print(forms)

forms.to_csv("top_frequent_NOUN_Morphalou.csv")
    

             Word      lemme cgram genre nombre  freqlemlivres
0           homme      homme   NOM     m      s        1398.85
1          hommes      homme   NOM     m      p        1398.85
2            jour       jour   NOM     m      s        1341.76
3           jours       jour   NOM     m      p        1341.76
4           temps      temps   NOM     m      s        1289.39
...           ...        ...   ...   ...    ...            ...
48282   évitement  évitement   NOM     m      s           0.00
48283   évènement  évènement   NOM     m      s           0.00
48284  évènements  évènement   NOM     m      p           0.00
48285     être-là    être-là   NOM     m    NaN           0.00
48286     îlotier    îlotier   NOM     m      s           0.00

[48287 rows x 6 columns]
                 Word           lemme cgram genre nombre  freqlemlivres
0               homme           homme   NOM     m      s        1398.85
1              hommes           homme   NOM     m      p        1398.85
2 

In [2]:
# also for VERBS
forms = pd.read_csv("top_frequent_VERB.csv")
Morphalou = set(pd.read_csv("../FlauBERT_WE/all_verb_we.csv")["Unnamed: 0"])
to_remove = [i for i in range(len(forms)) if forms.loc[i,"Word"] not in Morphalou]
forms = forms.drop(forms.index[to_remove])
forms.to_csv("top_frequent_VERB_Morphalou.csv")

In [3]:
# and adjectives
import pandas as pd

forms = pd.read_csv("top_frequent_ADJ.csv")
Morphalou = set(pd.read_csv("../FlauBERT_WE/all_adjectives_we.csv")["Unnamed: 0"])
to_remove = [i for i in range(len(forms)) if forms.loc[i,"Word"] not in Morphalou]
forms = forms.drop(forms.index[to_remove])
forms.to_csv("top_frequent_ADJ_Morphalou.csv")

### [top_frequent_pure_POS_Morphalou.csv files]

In [4]:
# removing nouns and verbs that are not "pure" nouns / verbs (POS ambiguity)
nouns_df = pd.read_csv("../Morphalou/all_nouns.csv", dtype= str)
nouns = []
for c in nouns_df.columns:
    if c not in ["Unnamed: 0", "gender", "category"]:
        nouns.extend(list(nouns_df[c]))
morphalou_nouns = set(nouns)

verbs_df = pd.read_csv("../Morphalou/all_verbs.csv", dtype= str)
verbs = []
for c in verbs_df.columns:
    if c != "Unnamed: 0":
        verbs.extend(list(verbs_df[c]))
morphalou_verbs = set(verbs)

adj_df = pd.read_csv("../Morphalou/all_adjectives.csv", dtype= str)
adj = []
for c in adj_df.columns:
    if c!= "Unnamed: 0":
        adj.extend(list(adj_df[c]))
morphalou_adj = set(adj)

noun_verbs = morphalou_nouns.intersection(morphalou_verbs)
noun_adj = morphalou_nouns.intersection(morphalou_adj)
verb_ajd = morphalou_verbs.intersection(morphalou_adj)
morphalou_ambig = noun_verbs.union(noun_adj, verb_ajd)
morphalou_ambig


{'prêches',
 'grandes',
 'spécialisée',
 'gourmandés',
 'syngenèse',
 'sud-vietnamiennes',
 'vogoul',
 'incréée',
 'occupés',
 'prédécoupées',
 'superfluide',
 'bordelais',
 'replantées',
 'suture',
 'sanctionnés',
 'mobilisant',
 'remâché',
 'bichons',
 'sumérien',
 'gallophobe',
 'vocatifs',
 'surdéterminé',
 'admise',
 'transfini',
 'couvrantes',
 'rhumier',
 'chaperonnier',
 'matabiches',
 'dîner',
 'aigri',
 'escroqué',
 'entaillés',
 'irréguliers',
 'récris',
 'diurétiques',
 'rengagé',
 'éligible',
 'routine',
 'songeurs',
 'fouinard',
 'fruitions',
 'misonéistes',
 'égyptienne',
 'épargne',
 'prédorsales',
 'forlonge',
 'accointes',
 'dévouées',
 'annulaires',
 'viroles',
 'assimilationnistes',
 'scolastiques',
 'pyrophyte',
 'tousseurs',
 'déraciné',
 'dernières',
 'aguichée',
 'récusant',
 'basales',
 'linotypiste',
 'délustrée',
 'agaçant',
 'occipitaux',
 'concédées',
 'dérobeuse',
 'creuser',
 'filoche',
 'chevrotant',
 'cyclocéphale',
 'senestres',
 'clocher',
 'paniqués'

In [5]:
forms = pd.read_csv("top_frequent_NOUN_Morphalou.csv")
to_remove = [i for i in range(len(forms)) if forms.loc[i,"Word"] in morphalou_ambig]
forms = forms.drop(forms.index[to_remove])
print(forms)

forms.to_csv("top_frequent_pure_NOUN_Morphalou.csv", index=False)

       Unnamed: 0            Word           lemme cgram genre nombre  \
0               0           homme           homme   NOM     m      s   
1               1          hommes           homme   NOM     m      p   
2               2            jour            jour   NOM     m      s   
3               3           jours            jour   NOM     m      p   
4               5            oeil            oeil   NOM     m      s   
...           ...             ...             ...   ...   ...    ...   
11348       48267        éthylène        éthylène   NOM     m      s   
11349       48269      étiquetage      étiquetage   NOM     m      s   
11350       48274  évangélisation  évangélisation   NOM     f      s   
11351       48283       évènement       évènement   NOM     m      s   
11352       48284      évènements       évènement   NOM     m      p   

       freqlemlivres  
0            1398.85  
1            1398.85  
2            1341.76  
3            1341.76  
4            1234.59

In [6]:
forms = pd.read_csv("top_frequent_VERB_Morphalou.csv")
to_remove = [i for i in range(len(forms)) if forms.loc[i,"Word"] in morphalou_ambig]
forms = forms.drop(forms.index[to_remove])
print(forms)

forms.to_csv("top_frequent_pure_VERB_Morphalou.csv", index=False)

      Unnamed: 0         Word        lemme cgram  freqlemlivres
0              0           es         être   VER       15085.47
2              2       furent         être   VER       15085.47
3              3          fus         être   VER       15085.47
4              5      fussent         être   VER       15085.47
7             13         sera         être   VER       15085.47
...          ...          ...          ...   ...            ...
8976       64773         sais         sais   VER           0.00
8980       64876       tiller       tiller   VER           0.00
8981       64896  télécharger  télécharger   VER           0.00
8982       64900  téléchargez  télécharger   VER           0.00
8984       64913       voulez       vouler   VER           0.00

[4220 rows x 5 columns]


In [7]:
forms = pd.read_csv("top_frequent_ADJ_Morphalou.csv")
to_remove = [i for i in range(len(forms)) if forms.loc[i,"Word"] in morphalou_ambig]
forms = forms.drop(forms.index[to_remove])
print(forms)

forms.to_csv("top_frequent_pure_ADJ_Morphalou.csv", index=False)

      Unnamed: 0           Word        lemme    cgram genre  freqlemlivres
18            26            sûr          sûr      ADJ     m         412.97
19            27           sûrs          sûr      ADJ     m         412.97
31            47          aucun        aucun  ADJ:ind     m         180.95
44            61          léger        léger      ADJ     m         151.01
60            78           cher         cher      ADJ     m         133.65
...          ...            ...          ...      ...   ...            ...
4801       22068  relationnelle  relationnel      ADJ     f           0.00
4803       22087      salariale     salarial      ADJ     f           0.00
4804       22108      sociétale     sociétal      ADJ     f           0.00
4805       22124   structurelle   structurel      ADJ     f           0.00
4807       22168          trine         trin      ADJ     f           0.00

[800 rows x 6 columns]


In [8]:
# adding number information
import pandas as pd
adj_df = pd.read_csv("top_frequent_pure_ADJ_Morphalou.csv")
morph = pd.read_csv("../Morphalou/all_adjectives.csv")
number = []

for form in list(adj_df["Word"]):
    j = 0
    if form == "publique":
            number.append("singular")
    elif form == "publiques":
        number.append("plural")
    else:
        for col in morph.columns:
            
                if col != "lemma" and form in list(morph[col]):
                    if "singular" in col:
                        number.append("singular")
                        j += 1
                    elif "plural" in col:
                        number.append("plural")
                        j += 1
                    else:
                        number.append("invariable")
                        j += 1
    if j > 1:
        print(form)

print(len(number), len(list(adj_df["Word"])))

800 800


In [9]:
adj_df["number"] = number

In [10]:
adj_df.to_csv("top_frequent_pure_ADJ_Morphalou.csv", index = False)

In [11]:
adj_df

Unnamed: 0.1,Unnamed: 0,Word,lemme,cgram,genre,freqlemlivres,number
0,26,sûr,sûr,ADJ,m,412.97,singular
1,27,sûrs,sûr,ADJ,m,412.97,plural
2,47,aucun,aucun,ADJ:ind,m,180.95,singular
3,61,léger,léger,ADJ,m,151.01,singular
4,78,cher,cher,ADJ,m,133.65,singular
...,...,...,...,...,...,...,...
795,22068,relationnelle,relationnel,ADJ,f,0.00,singular
796,22087,salariale,salarial,ADJ,f,0.00,singular
797,22108,sociétale,sociétal,ADJ,f,0.00,singular
798,22124,structurelle,structurel,ADJ,f,0.00,singular


## How to access the words ?

### Example: most frequent nouns

In [12]:
import pandas as pd
NOUNS = []

df = pd.read_csv("top_frequent_NOUN.csv", encoding="utf-8")
NOUNS = list(df["Word"])
print(NOUNS[:1000])

['homme', 'hommes', 'jour', 'jours', 'temps', 'oeil', 'oeils', 'yeux', 'main', 'mains', 'fois', 'chose', 'choses', 'peu', 'femme', 'femmes', 'heure', 'heures', 'tête', 'têtes', 'vie', 'vies', 'coup', 'coups', 'mère', 'mères', 'monde', 'mondes', 'nuit', 'nuits', 'enfant', 'enfants', 'père', 'pères', 'air', 'airs', 'moment', 'moments', 'an', 'ans', 'porte', 'portes', 'voix', 'fille', 'filles', 'maison', 'maisons', 'côté', 'côtés', 'visage', 'visages', 'rue', 'rues', 'soir', 'soirs', 'mot', 'mots', 'bras', 'pied', 'pieds', 'corps', 'place', 'places', 'eau', 'eaux', 'terre', 'terres', 'mort', 'morte', 'mortes', 'morts', 'regard', 'regards', 'chambre', 'chambres', 'gens', 'amour', 'amours', 'coeur', 'coeurs', 'peine', 'peines', 'bout', 'bouts', 'matin', 'matins', 'dieu', 'dieux', 'nom', 'noms', 'table', 'tables', 'fond', 'année', 'années', 'histoire', 'histoires', 'fait', 'faits', 'guerre', 'guerres', 'ami', 'amie', 'amies', 'amis', 'ville', 'villes', 'doute', 'doutes', 'force', 'forces', '