# Retrieving Gender Information in Nouns & Adjectives using Minimum Intra-class STD + Maximum extra-class average diff
In this Notebook : attempts to isolate Word Embeddings dimension coding for **gender information** (masculine/feminine) in a sample of Word Embeddings for **NOUNS**, **ADJECTIVES** and **both**.  

## 0. Data Loading

In [2]:
import pandas as pd

# WE loading
nouns = pd.read_csv('../Data/FlauBERT_WE/all_nouns_we.csv', index_col=0).drop(columns=["number", "gender"])
nouns['noun'] = 1
adjs = pd.read_csv('../Data/FlauBERT_WE/all_adjectives_we.csv', index_col=0).drop(columns = ["number", "gender"])
adjs['noun'] = 0
verbs = pd.read_csv('../Data/FlauBERT_WE/all_verb_we.csv', index_col=0)
verbs['noun'] = 0
all_we = pd.concat([nouns, adjs, verbs])

# Normalization
normalized_data = (all_we - all_we.min())/(all_we.max() - all_we.min())

## 1. Lowest intra-class standard deviation

Separate the data into two classes: noun and not noun.

In [8]:
noun_norm = normalized_data[normalized_data["noun"] == 1]
not_noun_norm = normalized_data[normalized_data["noun"] == 0]

Lowest std for class **noun**:

In [7]:
noun_norm.loc[:,:'511'].std().sort_values()[:10]

441    0.099424
157    0.100420
287    0.101737
365    0.101786
209    0.102275
277    0.103280
421    0.103544
314    0.103922
60     0.104299
83     0.104794
dtype: float64

Lowest std for class **not_noun**:

In [9]:
not_noun_norm.loc[:,:'511'].std().sort_values()[:10]

69     0.098447
157    0.098538
277    0.099792
422    0.099919
441    0.100257
314    0.100976
243    0.101522
287    0.101838
25     0.102170
293    0.102694
dtype: float64

## 2. Dimensions with greatest difference between avg_noun and avg_not_noun

This seems encouraging, as few dimensions present the highest differences for each of the considered datasets. We also see higher differences in adjectives.

In [20]:
# top 10 dimensions with the highest difference
abs(noun_norm.loc[:, :'511'].mean() - not_noun_norm.loc[:, :'511'].mean()).sort_values(ascending=False)[:10]

159    0.080651
409    0.070257
458    0.064973
504    0.064909
346    0.064635
480    0.064218
401    0.061211
212    0.059305
128    0.057418
305    0.057274
dtype: float64