## Hands-on Project 

### Topic: Measuring Bias in Crowdsourced Datasets

### Goals

The primary objective is to identify and measure social biases embedded in a crowdsourced NLP dataset. These biases might manifest as stereotypes associated with identity labels (e.g., gender, race, or occupation) and could harm the fairness of downstream tasks, such as classification.

By performing a bias audit, we aim to:

1. Explore how social stereotypes are encoded in datasets.
2. Use pointwise mutual information (PMI) to quantitatively assess these biases.
3. Understand the ethical implications of training models on biased data.

---

### Key Concepts

#### 1. Crowdsourcing in NLP
- Crowdsourcing is widely used in NLP to gather large-scale annotations or generated text.
- However, the human annotators’ biases can influence the dataset, embedding stereotypes and potentially harmful associations.
---
#### 2. Pointwise Mutual Information (PMI)
PMI is a statistical measure that captures how strongly two words (or terms) are associated in a dataset compared to what we’d expect if they were independent.

*Formula:*

<div align="center" style="font-size: 1.5em;">
    \[
    PMI(x, y) = \log \frac{P(x, y)}{P(x) \cdot P(y)}
    \]
</div>

*Where:*
- \( P(x, y) \): Probability of \(x\) and \(y\) co-occurring.
- \( P(x) \): Probability of \(x\) appearing.
- \( P(y) \): Probability of \(y\) appearing.

*A high PMI value indicates a strong association between two words, while a low or negative value suggests no significant association.*

---
#### Application of PMI for Bias Analysis
   - Use PMI to compute the associations between identity labels (e.g., "man", "woman", "Black", "White") and other terms in the dataset.
   - Example: If "doctor" has a higher PMI with "man" than "woman", it may suggest a gendered stereotype in the dataset.

---


### Overview of the Exercise

#### 1. Dataset:  

 - Use the **SNLI** (Stanford Natural Language Inference) dataset, which consists of crowdsourced text pairs labeled as "entailment," "neutral," or "contradiction," designed for studying natural language inference.
 - These labels are annotated by human contributors, whose perceptions and potential biases may influence the dataset.
 - Sentence 1: The premise that is provided to the annotators for evaluation.
 - Sentence 2: The hypothesis that the annotators generate based on the premise.
 - Annotator Labels: The labels (e.g., neutral, contradiction, entailment) assigned by human annotators to describe the relationship between the premise (sentence 1) and hypothesis (sentence 2).
 - Gold Labels: The reference labels assigned by the dataset creators to indicate the correct relationship between the premise and hypothesis.



#### 2. Identity Labels:  

 - Load the identity labels from the file `identity_labels.txt`, which contains a list of keywords commonly associated with demographic, social, and cultural identities.
 - Identity label data refers to terms and phrases associated with specific social groups, such as those related to gender, race, ethnicity, or other markers of identity. 
 - Identity labels based on Rudinger et al. 2017.



#### 3. Compute PMI:

  - For each identity label, calculate PMI values with other words in the dataset.
  - Identify terms with high PMI values for each label.


#### 4. Interpretation:

  - Analyze the results to see if the dataset encodes stereotypes.
  - This project aims to uncover patterns of bias, particularly stereotypes associated with identity labels, and assess how these biases could impact the fairness and reliability of NLP systems trained on the SNLI dataset.

**Load and prepare dataset**

In [1]:
#List of identity labels (based on Rudinger et al. 2017)
with open("C:/Users/Hp/Downloads/identity_labels.txt", 'r') as f:
    identity_labels = f.read().split("\n")
print(identity_labels)

['woman', 'women', 'man', 'men', 'girl', 'girls', 'boy', 'boys', 'she', 'he', 'her', 'him', 'his', 'female', 'male', 'mother', 'father', 'sister', 'brother', 'daughter', 'son', 'feminine', 'masculine', 'androgynous', 'trans', 'transgender', 'transsexual', 'nonbinary', 'non-binary', 'two-spirit', 'hijra', 'genderqueer', 'black', 'asian', 'hispanic', 'white', 'african', 'american', 'latino', 'latina', 'caucasian', 'africans', 'middle-eastern', 'australian', 'australians', 'asians', 'european', 'europeans', 'chinese', 'indian', 'indonesian', 'brazilian', 'pakistani', 'bangladeshi', 'russian', 'nigerian', 'japanese', 'mexican', 'filipino ', 'vietnamese ', 'german', 'egyptian', 'ethiopian', 'turkish', 'iranian', 'thai', 'congolese', 'french', 'british ', 'italian', 'korean', 'burmese', 'canadian ', 'australian ', 'spanish', 'dutch', 'swiss', 'saudi', 'argentinian ', 'taiwanese ', 'swedish ', 'belgian', 'polish', 'israeli', 'irish', 'greek', 'ukrainian ', 'jamaican ', 'mongolian', 'armenian'

In [2]:
#Load SNLI dataset
import pandas as pd
snli_data = pd.read_json(path_or_buf="C:/Users/Hp/Downloads/snli_1.0_train.jsonl", lines=True)
snli_data

Unnamed: 0,annotator_labels,captionID,gold_label,pairID,sentence1,sentence1_binary_parse,sentence1_parse,sentence2,sentence2_binary_parse,sentence2_parse
0,[neutral],3416050480.jpg#4,neutral,3416050480.jpg#4r1n,A person on a horse jumps over a broken down a...,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,A person is training his horse for a competition.,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...
1,[contradiction],3416050480.jpg#4,contradiction,3416050480.jpg#4r1c,A person on a horse jumps over a broken down a...,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,"A person is at a diner, ordering an omelette.",( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...
2,[entailment],3416050480.jpg#4,entailment,3416050480.jpg#4r1e,A person on a horse jumps over a broken down a...,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,"A person is outdoors, on a horse.","( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...
3,[neutral],2267923837.jpg#2,neutral,2267923837.jpg#2r1n,Children smiling and waving at camera,( Children ( ( ( smiling and ) waving ) ( at c...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,They are smiling at their parents,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...
4,[entailment],2267923837.jpg#2,entailment,2267923837.jpg#2r1e,Children smiling and waving at camera,( Children ( ( ( smiling and ) waving ) ( at c...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,There are children present,( There ( ( are children ) present ) ),(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...
...,...,...,...,...,...,...,...,...,...,...
550147,[contradiction],2267923837.jpg#3,contradiction,2267923837.jpg#3r1c,Four dirty and barefooted children.,( ( ( ( Four dirty ) and ) ( barefooted childr...,(ROOT (NP (NP (CD Four) (NNS dirty)) (CC and) ...,four kids won awards for 'cleanest feet',( ( four kids ) ( ( won awards ) ( ( ( for ` )...,(ROOT (S (NP (CD four) (NNS kids)) (VP (VBD wo...
550148,[neutral],2267923837.jpg#3,neutral,2267923837.jpg#3r1n,Four dirty and barefooted children.,( ( ( ( Four dirty ) and ) ( barefooted childr...,(ROOT (NP (NP (CD Four) (NNS dirty)) (CC and) ...,"four homeless children had their shoes stolen,...",( ( ( ( ( ( four ( homeless children ) ) ( had...,(ROOT (S (S (NP (CD four) (JJ homeless) (NNS c...
550149,[neutral],7979219683.jpg#2,neutral,7979219683.jpg#2r1n,A man is surfing in a bodysuit in beautiful bl...,( ( A man ) ( ( is ( surfing ( in ( ( a bodysu...,(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...,A man in a bodysuit is competing in a surfing ...,( ( ( A man ) ( in ( a bodysuit ) ) ) ( ( is (...,(ROOT (S (NP (NP (DT A) (NN man)) (PP (IN in) ...
550150,[contradiction],7979219683.jpg#2,contradiction,7979219683.jpg#2r1c,A man is surfing in a bodysuit in beautiful bl...,( ( A man ) ( ( is ( surfing ( in ( ( a bodysu...,(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...,A man in a business suit is heading to a board...,( ( ( A man ) ( in ( a ( business suit ) ) ) )...,(ROOT (S (NP (NP (DT A) (NN man)) (PP (IN in) ...


In [3]:
snli_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550152 entries, 0 to 550151
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   annotator_labels        550152 non-null  object
 1   captionID               550152 non-null  object
 2   gold_label              550152 non-null  object
 3   pairID                  550152 non-null  object
 4   sentence1               550152 non-null  object
 5   sentence1_binary_parse  550152 non-null  object
 6   sentence1_parse         550152 non-null  object
 7   sentence2               550152 non-null  object
 8   sentence2_binary_parse  550152 non-null  object
 9   sentence2_parse         550152 non-null  object
dtypes: object(10)
memory usage: 42.0+ MB


In [4]:
#De-duplicate
snli_data_sub = snli_data[['sentence1','sentence2']]
print(len(snli_data_sub))
snli_data_sub.drop_duplicates(subset=['sentence1', 'sentence2'], keep='last', inplace=True)
print(len(snli_data_sub))

550152
549526


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub.drop_duplicates(subset=['sentence1', 'sentence2'], keep='last', inplace=True)


**Data preparation**

In [5]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
nltk.download('punkt')

from nltk.corpus import stopwords
import string
nltk.download('stopwords')

from itertools import combinations
from tqdm import tqdm

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
#util functions
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list

def remove_stopwords(sent_tokens):
    filtered_words = [word for word in sent_tokens if word not in removal_list]
    return filtered_words

In [7]:
#Lowercase
snli_data_sub['sentence1'] = snli_data_sub['sentence1'].map(str.lower)
snli_data_sub['sentence2'] = snli_data_sub['sentence2'].map(str.lower)
snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1'] = snli_data_sub['sentence1'].map(str.lower)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2'] = snli_data_sub['sentence2'].map(str.lower)


Unnamed: 0,sentence1,sentence2
0,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.
1,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette."
2,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse."
3,children smiling and waving at camera,they are smiling at their parents
4,children smiling and waving at camera,there are children present
...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet'
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen,..."
550149,a man is surfing in a bodysuit in beautiful bl...,a man in a bodysuit is competing in a surfing ...
550150,a man is surfing in a bodysuit in beautiful bl...,a man in a business suit is heading to a board...


In [8]:
#Tokenize
snli_data_sub['sentence1_token'] = snli_data_sub['sentence1'].apply(nltk.word_tokenize)
snli_data_sub['sentence2_token'] = snli_data_sub['sentence2'].apply(nltk.word_tokenize)

snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token'] = snli_data_sub['sentence1'].apply(nltk.word_tokenize)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token'] = snli_data_sub['sentence2'].apply(nltk.word_tokenize)


Unnamed: 0,sentence1,sentence2,sentence1_token,sentence2_token
0,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, training, his, horse, for, a, ..."
1,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, at, a, diner, ,, ordering, an,..."
2,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, outdoors, ,, on, a, horse, .]"
3,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]"
4,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]"
...,...,...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet',"[four, dirty, and, barefooted, children, .]","[four, kids, won, awards, for, 'cleanest, feet..."
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen,...","[four, dirty, and, barefooted, children, .]","[four, homeless, children, had, their, shoes, ..."
550149,a man is surfing in a bodysuit in beautiful bl...,a man in a bodysuit is competing in a surfing ...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, bodysuit, is, competing, in, a..."
550150,a man is surfing in a bodysuit in beautiful bl...,a man in a business suit is heading to a board...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, business, suit, is, heading, t..."


In [9]:
#Remove stop-words
snli_data_sub['sentence1_token_nostopwords'] = snli_data_sub['sentence1_token'].apply(remove_stopwords)
snli_data_sub['sentence2_token_nostopwords'] = snli_data_sub['sentence2_token'].apply(remove_stopwords)

snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token_nostopwords'] = snli_data_sub['sentence1_token'].apply(remove_stopwords)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token_nostopwords'] = snli_data_sub['sentence2_token'].apply(remove_stopwords)


Unnamed: 0,sentence1,sentence2,sentence1_token,sentence2_token,sentence1_token_nostopwords,sentence2_token_nostopwords
0,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, training, his, horse, for, a, ...","[person, horse, jumps, broken, airplane]","[person, training, horse, competition]"
1,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, at, a, diner, ,, ordering, an,...","[person, horse, jumps, broken, airplane]","[person, diner, ordering, omelette]"
2,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, outdoors, ,, on, a, horse, .]","[person, horse, jumps, broken, airplane]","[person, outdoors, horse]"
3,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]","[children, smiling, waving, camera]","[smiling, parents]"
4,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]","[children, smiling, waving, camera]","[children, present]"
...,...,...,...,...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet',"[four, dirty, and, barefooted, children, .]","[four, kids, won, awards, for, 'cleanest, feet...","[four, dirty, barefooted, children]","[four, kids, awards, 'cleanest, feet]"
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen,...","[four, dirty, and, barefooted, children, .]","[four, homeless, children, had, their, shoes, ...","[four, dirty, barefooted, children]","[four, homeless, children, shoes, stolen, feet..."
550149,a man is surfing in a bodysuit in beautiful bl...,a man in a bodysuit is competing in a surfing ...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, bodysuit, is, competing, in, a...","[man, surfing, bodysuit, beautiful, blue, water]","[man, bodysuit, competing, surfing, competition]"
550150,a man is surfing in a bodysuit in beautiful bl...,a man in a business suit is heading to a board...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, business, suit, is, heading, t...","[man, surfing, bodysuit, beautiful, blue, water]","[man, business, suit, heading, board, meeting]"


In [10]:
print(len(snli_data_sub))
snli_data_sub['sentence1_token_nostopwords_str'] = snli_data_sub['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
snli_data_sub['sentence2_token_nostopwords_str'] = snli_data_sub['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))
snli_data_sub.drop_duplicates(subset=['sentence1_token_nostopwords_str', 'sentence2_token_nostopwords_str'], keep='last', inplace=True)
print(len(snli_data_sub))

549526


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token_nostopwords_str'] = snli_data_sub['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token_nostopwords_str'] = snli_data_sub['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))


547562


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub.drop_duplicates(subset=['sentence1_token_nostopwords_str', 'sentence2_token_nostopwords_str'], keep='last', inplace=True)


**Word association analysis**
```latex
PMI(w_i, w_j) = log_2 \frac{p(w_i, w_j)}{P(w_i)P(w_j)} = log_2\frac{N\cdot c(w_i, w_j)}{c(w_i)c(w_j)}
```


In [11]:
#Computing Unigram frequency
corpus = []
for sent in (snli_data_sub['sentence1_token_nostopwords'].tolist() + snli_data_sub['sentence2_token_nostopwords'].tolist()):
  corpus+=(sent)
unigram_frequency = Counter(corpus)
unigram_frequency.most_common(20)

[('man', 265138),
 ('woman', 137105),
 ('two', 121714),
 ('people', 120802),
 ('wearing', 80592),
 ('young', 61353),
 ('men', 60820),
 ('playing', 59220),
 ('girl', 59094),
 ('boy', 58041),
 ('white', 56785),
 ('shirt', 56204),
 ('black', 54824),
 ('dog', 53690),
 ('sitting', 53544),
 ('blue', 49040),
 ('standing', 46189),
 ('red', 43129),
 ('group', 43057),
 ('walking', 38678)]

In [12]:
#Remove words with less than 15 freq
from itertools import dropwhile
print(len(unigram_frequency))
for key, count in dropwhile(lambda key_count: key_count[1] >= 15, unigram_frequency.most_common()):
    del unigram_frequency[key]
print(len(unigram_frequency))

36420
10662


In [13]:
snli_data_sub['sentence1_sentence2_token_nostopwords'] = snli_data_sub['sentence1_token_nostopwords'] + snli_data_sub['sentence2_token_nostopwords']

def f7(seq):#deduplicate-without changing order in a list
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

snli_data_sub['sentence1_sentence2_token_nostopwords_dedup'] = snli_data_sub['sentence1_sentence2_token_nostopwords'].apply(f7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_sentence2_token_nostopwords'] = snli_data_sub['sentence1_token_nostopwords'] + snli_data_sub['sentence2_token_nostopwords']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_sentence2_token_nostopwords_dedup'] = snli_data_sub['sentence1_sentence2_token_nostopwords'].apply(f7)


In [14]:
bigram_doc=[]
for ele in tqdm(snli_data_sub['sentence1_sentence2_token_nostopwords_dedup']):
  bigram_doc.append(list(set(list(combinations(ele, 2)))))

bigram=[]
for doc in bigram_doc:
  bigram += doc
bigram_frequency = Counter(bigram)
bigram_frequency

100%|███████████████████████████████████████████████████████████████████████| 547562/547562 [00:26<00:00, 20983.78it/s]


Counter({('broken', 'airplane'): 4,
         ('person', 'horse'): 216,
         ('airplane', 'training'): 1,
         ('airplane', 'competition'): 2,
         ('person', 'jumps'): 263,
         ('jumps', 'training'): 10,
         ('horse', 'jumps'): 91,
         ('person', 'competition'): 106,
         ('jumps', 'competition'): 65,
         ('person', 'training'): 18,
         ('jumps', 'broken'): 7,
         ('horse', 'training'): 7,
         ('horse', 'competition'): 126,
         ('person', 'broken'): 29,
         ('training', 'competition'): 36,
         ('horse', 'broken'): 5,
         ('jumps', 'airplane'): 6,
         ('person', 'airplane'): 37,
         ('horse', 'airplane'): 23,
         ('broken', 'training'): 1,
         ('broken', 'competition'): 1,
         ('ordering', 'omelette'): 1,
         ('airplane', 'diner'): 1,
         ('broken', 'ordering'): 1,
         ('airplane', 'omelette'): 1,
         ('jumps', 'diner'): 1,
         ('person', 'diner'): 5,
         ('jumps

In [15]:
#Remove words-pair with less than 10 freq
from itertools import dropwhile
print(len(bigram_frequency))
for key, count in dropwhile(lambda key_count: key_count[1] >= 10, bigram_frequency.most_common()):
    del bigram_frequency[key]
print(len(bigram_frequency))

3670850
385251


In [16]:
#Calculate PMI
import math

def pmi(word1, word2, unigram_freq, bigram_freq):
  if word1 in unigram_freq.keys() and word2 in unigram_freq.keys():
    if (word1, word2) in bigram_freq.keys() or (word2, word1) in bigram_freq.keys():
      prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
      prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
      prob_word1_word2 = (bigram_freq[(word1, word2)]+bigram_freq[(word2, word1)]) / float(sum(bigram_freq.values()))
      if prob_word1_word2 >0:
        return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)


In [17]:
identity_labels

['woman',
 'women',
 'man',
 'men',
 'girl',
 'girls',
 'boy',
 'boys',
 'she',
 'he',
 'her',
 'him',
 'his',
 'female',
 'male',
 'mother',
 'father',
 'sister',
 'brother',
 'daughter',
 'son',
 'feminine',
 'masculine',
 'androgynous',
 'trans',
 'transgender',
 'transsexual',
 'nonbinary',
 'non-binary',
 'two-spirit',
 'hijra',
 'genderqueer',
 'black',
 'asian',
 'hispanic',
 'white',
 'african',
 'american',
 'latino',
 'latina',
 'caucasian',
 'africans',
 'middle-eastern',
 'australian',
 'australians',
 'asians',
 'european',
 'europeans',
 'chinese',
 'indian',
 'indonesian',
 'brazilian',
 'pakistani',
 'bangladeshi',
 'russian',
 'nigerian',
 'japanese',
 'mexican',
 'filipino ',
 'vietnamese ',
 'german',
 'egyptian',
 'ethiopian',
 'turkish',
 'iranian',
 'thai',
 'congolese',
 'french',
 'british ',
 'italian',
 'korean',
 'burmese',
 'canadian ',
 'australian ',
 'spanish',
 'dutch',
 'swiss',
 'saudi',
 'argentinian ',
 'taiwanese ',
 'swedish ',
 'belgian',
 'polish

In [18]:
def Sort(tuple):
    # reverse = None (Sorts in Ascending order)
    return(sorted(tuple, key = lambda a: a[1], reverse = True))

def get_pmi(identity_label,unigram_frequency,bigram_frequency):
  pmi_identity_label=[]
  for word in tqdm(unigram_frequency.keys()):
    pmi_score = pmi(word1=identity_label.lower(), word2=word,unigram_freq=unigram_frequency,bigram_freq=bigram_frequency)
    if pmi_score:
      pmi_identity_label.append((word,pmi_score))

  return Sort(pmi_identity_label)


In [19]:
male_pmi_data = get_pmi('man',unigram_frequency,bigram_frequency)
female_pmi_data = get_pmi('woman',unigram_frequency,bigram_frequency)
gay_pmi_data = get_pmi('gay',unigram_frequency,bigram_frequency)

100%|████████████████████████████████████████████████████████████████████████████| 10662/10662 [01:59<00:00, 88.90it/s]
100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [01:26<00:00, 123.39it/s]
100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 19248.51it/s]


In [20]:
search_term = "marble-looking"
mask = snli_data_sub["sentence1"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
snli_data_sub[mask][["sentence1","sentence2"]]

Unnamed: 0,sentence1,sentence2
61122,a woman with a large purse and boots walking in a marble-looking hall.,a woman with a large purse and wearing a dress walks down a hallway.
61123,a woman with a large purse and boots walking in a marble-looking hall.,a woman with a large bag a tall shoes is walking indoors.
61124,a woman with a large purse and boots walking in a marble-looking hall.,the woman is wearing sandals inside the louvre.
61125,a woman with a large purse and boots walking in a marble-looking hall.,a man is running on a track
61126,a woman with a large purse and boots walking in a marble-looking hall.,the woman is wearing flip-flops.
61127,a woman with a large purse and boots walking in a marble-looking hall.,a woman walks down a dimly lit alley.
61128,a woman with a large purse and boots walking in a marble-looking hall.,a woman is walking indoors.
61129,a woman with a large purse and boots walking in a marble-looking hall.,"the woman is at a museum, looking at paintings."
61130,a woman with a large purse and boots walking in a marble-looking hall.,the woman is walking inside with a large handbag.
61131,a woman with a large purse and boots walking in a marble-looking hall.,a lady with a big pocketbook walking in a hallway.


In [21]:
from collections import defaultdict as dd
pmi_identity_label = dd(list)
for identity_label in (identity_labels):
  print(identity_label)
  for word in tqdm(unigram_frequency.keys()):
    pmi_score = pmi(word1=identity_label, word2=word,unigram_freq=unigram_frequency,bigram_freq=bigram_frequency)
    #print("PMI:",pmi_score)
    if pmi_score:
      pmi_identity_label[identity_label].append((word,pmi_score))

woman


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [01:25<00:00, 124.43it/s]


women


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:39<00:00, 269.16it/s]


man


100%|████████████████████████████████████████████████████████████████████████████| 10662/10662 [02:14<00:00, 78.99it/s]


men


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:41<00:00, 254.31it/s]


girl


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:35<00:00, 297.48it/s]


girls


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:14<00:00, 716.72it/s]


boy


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:34<00:00, 306.00it/s]


boys


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:12<00:00, 825.07it/s]


she


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 936773.00it/s]


he


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 947691.56it/s]


her


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 878147.65it/s]


him


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 810182.97it/s]


his


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 858441.84it/s]


female


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:13<00:00, 779.90it/s]


male


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:13<00:00, 801.09it/s]


mother


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:07<00:00, 1361.24it/s]


father


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:05<00:00, 1869.04it/s]


sister


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:01<00:00, 7519.28it/s]


brother


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:01<00:00, 5548.63it/s]


daughter


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:03<00:00, 2905.40it/s]


son


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:04<00:00, 2174.60it/s]


feminine


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 844181.47it/s]


masculine


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 848731.62it/s]


androgynous


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 789179.92it/s]


trans


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 775803.99it/s]


transgender


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 843592.26it/s]


transsexual


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 752907.09it/s]


nonbinary


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 714029.53it/s]


non-binary


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 725273.99it/s]


two-spirit


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 629872.24it/s]


hijra


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 691826.57it/s]


genderqueer


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 666383.58it/s]


black


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [01:11<00:00, 148.41it/s]


asian


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:21<00:00, 485.29it/s]


hispanic


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:01<00:00, 10281.36it/s]


white


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [01:10<00:00, 150.75it/s]


african


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:08<00:00, 1208.71it/s]


american


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:08<00:00, 1212.57it/s]


latino


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 155389.93it/s]


latina


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 346935.73it/s]


caucasian


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:03<00:00, 3525.31it/s]


africans


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 141879.37it/s]


middle-eastern


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 53431.96it/s]


australian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 188822.84it/s]


australians


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 674840.71it/s]


asians


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 13444.15it/s]


european


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 18987.51it/s]


europeans


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 657197.62it/s]


chinese


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:03<00:00, 2800.99it/s]


indian


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:04<00:00, 2593.77it/s]


indonesian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 690416.68it/s]


brazilian


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 23421.43it/s]


pakistani


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 342296.51it/s]


bangladeshi


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 751149.23it/s]


russian


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 44614.11it/s]


nigerian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 696937.15it/s]


japanese


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:01<00:00, 6401.29it/s]


mexican


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 11901.53it/s]


filipino 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 616984.72it/s]


vietnamese 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 514362.09it/s]


german


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:01<00:00, 7448.11it/s]


egyptian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 316792.31it/s]


ethiopian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 675238.11it/s]


turkish


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 588703.31it/s]


iranian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 694447.94it/s]


thai


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 37733.44it/s]


congolese


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 706404.91it/s]


french


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 17667.25it/s]


british 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 425067.67it/s]


italian


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 15789.98it/s]


korean


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 30001.58it/s]


burmese


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 482069.61it/s]


canadian 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 507065.97it/s]


australian 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 368345.06it/s]


spanish


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 15252.77it/s]


dutch


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 646920.44it/s]


swiss


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 25354.88it/s]


saudi


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 674973.12it/s]


argentinian 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 663299.75it/s]


taiwanese 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 761159.95it/s]


swedish 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 724827.29it/s]


belgian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 691676.76it/s]


polish


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 24570.13it/s]


israeli


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 46832.67it/s]


irish


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 18152.53it/s]


greek


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 42904.88it/s]


ukrainian 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 700386.36it/s]


jamaican 


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 754036.95it/s]


mongolian


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 33231.38it/s]


armenian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 137321.38it/s]


disability


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 353490.02it/s]


disabled


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 148276.24it/s]


handicap


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 267393.37it/s]


handicapped


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 19718.56it/s]


mentally


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 668265.65it/s]


mental


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 332540.17it/s]


autistic


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 688407.96it/s]


autism


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 706862.71it/s]


lesbian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 145019.99it/s]


lesbians


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 220671.14it/s]


gay


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 16549.15it/s]


bisexual


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 749349.33it/s]


pansexual


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 653547.91it/s]


asexual


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 581870.66it/s]


queer


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 576655.95it/s]


straight


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:01<00:00, 7328.78it/s]


muslim


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 14437.56it/s]


christian


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 113264.82it/s]


jew


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 688333.78it/s]


jewish


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 25960.35it/s]


sikh


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 832954.65it/s]


buddhist


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 20919.07it/s]


hindu


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 349670.18it/s]


atheist


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 675125.97it/s]


muslims


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 280849.52it/s]


christians


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 334577.80it/s]


jews


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 773616.39it/s]


sikhs


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 663230.89it/s]


buddhists


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 517038.21it/s]


hindus


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 571271.05it/s]


atheists


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 507296.06it/s]


old


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:21<00:00, 505.24it/s]


elderly


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [00:11<00:00, 889.38it/s]


retired


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 317030.37it/s]


teenage


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:03<00:00, 3018.05it/s]


young


100%|███████████████████████████████████████████████████████████████████████████| 10662/10662 [01:11<00:00, 148.61it/s]


senior


100%|█████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 18315.86it/s]


seniors


100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 209321.57it/s]


teenager


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:02<00:00, 4976.01it/s]


teenagers


100%|██████████████████████████████████████████████████████████████████████████| 10662/10662 [00:02<00:00, 3773.41it/s]





100%|████████████████████████████████████████████████████████████████████████| 10662/10662 [00:00<00:00, 747021.07it/s]


In [22]:
for identity in pmi_identity_label.keys():
    pmi_identity_label[identity] = Sort(pmi_identity_label[identity])


In [23]:
for key in pmi_identity_label.keys():
    print(key, pmi_identity_label[key][:20])

woman [('marble-looking', 3.9246229431677464), ('edwards', 3.9246229431677464), ('bandidos', 3.9246229431677464), ('leaf-strewn', 3.9246229431677464), ('leaf-lined', 3.9246229431677464), ('leave-like', 3.9246229431677464), ('mirthlessly', 3.9246229431677464), ('coke-a-cola', 3.9246229431677464), ('ruts', 3.9246229431677464), ('wearubg', 3.9246229431677464), ('tableau', 3.9246229431677464), ('radishes', 3.9246229431677464), ('pull-overs', 3.9246229431677464), ('denomination', 3.9246229431677464), ('t-short', 3.9246229431677464), ('lavendar', 3.9246229431677464), ('buzzes', 3.9246229431677464), ('homemade-looking', 3.9246229431677464), ('event-', 3.9246229431677464), ('spheres', 3.9246229431677464)]
women [('disbelieving', 5.925517637795422), ('marley', 5.925517637795422), ('partially-drunk', 5.925517637795422), ('futon-style', 5.925517637795422), ('footrest', 5.925517637795422), ('hand-crafting', 5.925517637795422), ('headwraps', 5.861387300375705), ('blanked', 5.832408233403941), ('wel

In [24]:
search_term = "extravagant"#wolf-like"#slightly-untidy"
mask = snli_data_sub["sentence1"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
snli_data_sub[mask][["sentence1","sentence2"]]

Unnamed: 0,sentence1,sentence2
20205,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,the woman is preparing for her wedding.
20206,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"a woman dresses in western fashion, with unadorned fingers and head."
20207,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,an indian lady is wearing a beautiful outfit.
20208,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,an indian woman is wearing traditional styles.
20209,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a woman shows off to a crowd of people.
20210,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a woman is ready for a national cultural celebration.
20211,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a japanese woman is wearing a kimono.
20212,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"a very colorfully robed female with fingers painted red, who's culture appears to be indian, has on an elaborate head piece."
20213,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"a woman is dressed in traditional, ethnic clothing."
20214,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a woman wearing an old-fashioned apron holds an apple pie in her hands.
