# Exploratory Data Analysis - Bengali AI Dataset
# Grapheme Combinations

This dataset contains images of individual hand-written [Bengali characters](https://en.wikipedia.org/wiki/Bengali_alphabet). 
Bengali characters (graphemes) are written by combining three components: a grapheme_root
, vowel_diacritic, and consonant_diacritic. Your challenge is to classify the components of the grapheme in each
image. There are roughly 10,000 possible graphemes, of which roughly 1,000 are represented in the training set. The
test set includes some graphemes that do not exist in train but has no new grapheme components. It takes a lot of
volunteers filling out [sheets like this](https://github.com/BengaliAI/graphemePrepare/blob/master/collection/A4/form_1.jpg)
to generate a useful amount of real data; focusing the problem on the grapheme components rather than on recognizing
whole graphemes should make it possible to assemble a Bengali OCR system without handwriting samples for all 10,000
graphemes.

In [257]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
from IPython.display import Markdown, HTML
from src.jupyter import grid_df_display, combination_matrix

pd.set_option('display.max_columns',   500)
pd.set_option('display.max_colwidth', None)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Inspect Raw Data

In [258]:
dataset = pd.read_csv('./data/train.csv'); 
# for key in ['grapheme_root','vowel_diacritic','consonant_diacritic','grapheme']:
#     dataset[key] = dataset[key].astype('category')  # ensures groupby().count() shows zeros
dataset.head()

Unnamed: 0,image_id,grapheme_root,vowel_diacritic,consonant_diacritic,grapheme
0,Train_0,15,9,5,ক্ট্রো
1,Train_1,159,0,0,হ
2,Train_2,22,3,5,খ্রী
3,Train_3,53,2,2,র্টি
4,Train_4,71,9,5,থ্রো


## Question: How many unique graphemes are there?

There are 168 grapheme roots, 11 vowel diacritics, 7 consonant diacritics, and 1295 unique graphemes within the 20k training dataset. 

In [259]:
unique = dataset.apply(lambda col: col.nunique()); unique

image_id               200840
grapheme_root             168
vowel_diacritic            11
consonant_diacritic         7
grapheme                 1295
dtype: int64

## Question: Can all diacritics be used with any grapheme?

- Documentation claims 10,000+ possible graphemes, which is indeed `168 * 11 * 7 = 12936`

- Assuming that the training dataset is representative of common usage, 
  then certian combinations may never (or rarely) be used in practice).

- Unconfirmed Theory: the physics of the human mouth may make some combinations unpronouncable. Needs a native speaker of Bengali to 

- Conclusion: it may be able infer excluded combinations using simple logical rules

### Vowel / Consonant Combinations:
- Vowel #0 and Consonant #0 combine with everything
- Vowels #3, #5, #6, #8 have limited combinations with Consonants 
- Consonant #3 is never combined except with Vowel #0
- Consonant #6 only combineds with Vowels #0 and #1

In [260]:
combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='consonant_diacritic', unique=False).applymap(len)

consonant_diacritic,0,1,2,3,4,5,6
vowel_diacritic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,23960,768,6262,619,5413,4180,306
1,18799,2843,3838,0,6573,3752,1081
2,17449,464,3764,0,1255,3035,0
3,11391,0,2290,0,0,2471,0
4,11832,1215,1563,0,2206,2032,0
5,3794,0,297,0,784,422,0
6,3873,0,463,0,0,0,0
7,16991,1197,3778,0,4072,2685,0
8,3210,0,0,0,167,151,0
9,10727,774,1210,0,800,2521,0


### Grapheme Root Combinations:
- Vowel #0 and Consonant #0 combine with (nearly) everything
- ALL Roots combine with some Consonant #0
- Several Roots do NOT combine with Vowel #0 = [26, 28, 33, 34, 73, 82, 108, 114, 126, 152, 157, 158, 163]
- Several Roots do combine with ALL Vowels = [13, 23, 64, 72, 79, 81, 96, 107, 113, 115, 133, 147]}
- Only Root #107 combines with ALL Consonants

In [261]:
root_vowels            = dataset.groupby('grapheme_root')['vowel_diacritic'].unique().apply(sorted).to_frame().T
root_consonants        = dataset.groupby('grapheme_root')['consonant_diacritic'].unique().apply(sorted).to_frame().T
root_vowels_values     = root_vowels.applymap(len).values.flatten()
root_consonants_values = root_consonants.applymap(len).values.flatten()

display(root_vowels)
display({
    "mean":   root_vowels_values.mean(),
    "median": np.median( root_vowels_values ),
    "min":    root_vowels_values.min(),
    "max":    root_vowels_values.max(),
    "unique_vowels":    unique['vowel_diacritic'],
    "root_combine_0":   sum([ 0 in lst for lst in root_vowels.values.flatten() ]),
    "unique_roots":     unique['grapheme_root'],
    "root_combine_not_0": str([ index for index, lst in enumerate(root_vowels.values.flatten()) if 0 not in lst ]),    
    "root_combine_all":       [ index for index, lst in enumerate(root_vowels.values.flatten()) if len(lst) == unique['vowel_diacritic'] ],
})
print('--------------------')
display(root_consonants)
display({
    "mean":   root_consonants_values.mean(),
    "median": np.median( root_consonants_values ),
    "min":    root_consonants_values.min(),
    "max":    root_consonants_values.max(),
    "unique_consonants":  unique['consonant_diacritic'],
    "root_combine_0": sum([ 0 in lst for lst in root_consonants.values.flatten() ]),
    "unique_roots":   unique['grapheme_root'],
    "root_combine_not_0": str([ index for index, lst in enumerate(root_consonants.values.flatten()) if 0 not in lst ]),        
    "root_combine_all":       [ index for index, lst in enumerate(root_consonants.values.flatten()) if len(lst) == unique['consonant_diacritic'] ],
})

grapheme_root,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167
vowel_diacritic,[0],[0],"[0, 1]",[0],[0],[0],[0],[0],[0],"[0, 1]",[0],[0],[0],"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1, 2, 7, 9]","[0, 1, 2, 7, 9]","[0, 1, 2, 6, 7, 9]","[0, 1, 2, 7, 9]","[0, 1, 2, 3, 4, 7, 9, 10]","[0, 10]","[0, 3]","[0, 1, 2, 3, 7]","[0, 1, 2, 3, 4, 6, 7, 8, 9]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1]","[0, 1, 2, 3, 4, 7]",[7],"[0, 2]","[1, 2, 4, 9]","[0, 1, 2, 4, 5, 6, 7, 9]","[0, 2, 7]","[0, 1, 2, 4, 7]","[0, 1, 2, 3, 4, 6, 7, 9]",[2],"[1, 2]","[0, 1, 2]","[0, 1, 2, 3, 4, 7, 9]","[0, 2]","[0, 1, 2, 3, 4, 5, 7, 8, 9, 10]","[0, 1, 2, 4]","[0, 1, 2, 4, 7]","[0, 1]","[0, 1, 2, 4, 7, 9]","[0, 1, 2, 3, 4, 7, 8, 9]","[0, 1, 2, 3, 4, 7, 9]",[0],"[0, 1, 3, 7]","[0, 1, 2]","[0, 1, 2, 4, 7, 9]","[0, 1]","[0, 1, 2, 4, 7]","[0, 2]","[0, 1, 2, 3, 4, 7]","[0, 1, 2, 3, 4, 5, 7, 9]","[0, 1, 2, 9]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3]","[0, 1, 2, 4, 7, 9, 10]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 4]","[0, 1, 2, 7]","[0, 1, 2, 3, 4, 7]",[0],"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1, 2, 3, 7, 9]","[0, 1, 2, 3, 7]","[0, 1, 7]","[0, 2, 3, 7]","[0, 1, 2, 3, 7]","[0, 1, 2, 3, 7]","[0, 1, 2, 3, 4, 7, 8, 9]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]",[1],"[0, 1, 2, 3, 4, 7, 9, 10]","[0, 1, 2, 6, 7]","[0, 1, 2, 3, 4, 6, 7, 8, 9]","[0, 1, 2, 4, 5]","[0, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 7]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[3, 4]","[0, 1, 2, 4, 7, 9]","[0, 2, 7]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 4, 7, 9]",[0],"[0, 1, 2, 3, 7]","[0, 1, 2, 3, 4, 7, 9]","[0, 2, 3, 7]","[0, 1, 2, 3, 4, 7]","[0, 1, 2, 3, 4, 7]","[0, 2, 3, 7]","[0, 1, 2, 4, 7, 9]","[0, 1, 2, 3, 7]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1, 7]","[0, 1, 2, 7]","[0, 1, 7]","[0, 1, 3]","[0, 1, 2, 4, 7, 9]",[0],"[0, 1, 2, 3, 4, 7, 9, 10]",[0],[0],"[0, 1, 7, 9]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]",[2],"[0, 1, 2, 3, 4, 7]","[0, 2]","[0, 1, 2, 4]","[0, 1, 2, 4, 7]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]",[1],"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1, 3]","[0, 1, 2, 3, 4, 5, 6, 7, 9]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 4, 5, 7, 9]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 7]","[0, 1, 2, 3, 4, 5, 7, 9, 10]","[0, 1, 2, 3, 4, 5, 7, 8, 9, 10]","[0, 1, 2, 3, 4, 7, 9, 10]","[0, 1, 2, 3, 7]",[4],"[0, 1, 2, 4, 7, 9]","[0, 1, 2, 7, 9]","[0, 1, 2, 3, 7, 9]",[0],"[0, 2, 7]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1, 2, 4]","[0, 1, 7, 9]","[0, 1, 2, 7]","[0, 2, 3, 7]","[0, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 6, 7]","[0, 1, 2, 3, 4, 7]","[0, 1, 2, 3, 4, 7]","[0, 1, 4, 7]","[0, 1, 7]","[0, 9]","[0, 7]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[0, 1, 2, 4, 6, 7, 9]","[0, 1, 2, 3, 4, 7, 9]","[0, 1, 2, 3, 4, 5, 6, 7, 9]","[0, 1, 2, 5, 7]","[1, 2, 7, 9]","[0, 1, 2, 3, 6, 7, 9]","[0, 3, 4, 5, 9]","[0, 1, 2, 3, 7, 8]","[0, 1, 2, 3, 6]","[1, 9]",[4],"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[0, 2, 7]","[0, 1]","[0, 1, 3]",[1],[0],"[0, 1, 2, 3, 4, 7, 9]","[0, 7]","[0, 1, 2, 3, 4, 5, 7, 9]"


{'mean': 4.869047619047619,
 'median': 5.0,
 'min': 1,
 'max': 11,
 'unique_vowels': 11,
 'root_combine_0': 155,
 'unique_roots': 168,
 'root_combine_not_0': '[26, 28, 33, 34, 73, 82, 108, 114, 126, 152, 157, 158, 163]',
 'root_combine_all': [13, 23, 64, 72, 79, 81, 96, 107, 113, 115, 133, 147]}

--------------------


grapheme_root,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167
consonant_diacritic,[0],[0],"[0, 4]","[0, 1]","[0, 1]",[0],"[0, 1]",[0],[0],"[0, 1, 4]",[0],[0],[0],"[0, 1, 2, 4, 5, 6]",[0],"[0, 5]",[0],[0],"[0, 4]",[0],[0],"[0, 2]","[0, 1, 2, 4, 5]","[0, 1, 2, 4, 5, 6]",[0],"[0, 4]",[0],[0],"[0, 4]","[0, 1, 2, 3, 4, 5]",[0],[0],[0],[0],[0],[0],[0],[0],"[0, 1, 2, 4]",[0],[0],[0],"[0, 1, 2, 4]","[0, 2, 3, 4, 5]",[0],[0],[0],[0],"[0, 1, 2]",[0],[0],[0],[0],"[0, 2, 4, 5, 6]",[0],"[0, 1, 4]","[0, 2, 4, 5]",[0],"[0, 4]","[0, 2, 4]",[0],[0],[0],[0],"[0, 1, 2, 4, 5, 6]","[0, 2, 4]",[0],[0],[0],[0],"[0, 4]","[0, 2, 3, 4, 5]","[0, 1, 2, 4, 5, 6]",[0],"[0, 2]","[0, 2]","[0, 2, 4]","[0, 5]",[0],"[0, 1, 2, 4, 5]","[0, 2]","[0, 2, 4]",[0],"[0, 5]",[0],"[0, 5]","[0, 2, 4, 5]",[0],[0],"[0, 4, 5]",[0],"[0, 4, 5]","[0, 4]",[0],[0],[0],"[0, 1, 2, 4, 5, 6]",[0],[0],[0],[0],"[0, 4]",[0],"[0, 1, 2, 4, 5, 6]",[0],[0],"[0, 4]","[0, 1, 2, 3, 4, 5, 6]",[0],[0],[0],"[0, 2]","[0, 4]","[0, 1, 2, 4, 5]",[0],"[0, 2, 4, 5]",[0],"[0, 5]",[0],"[0, 5]","[0, 2]",[0],"[0, 2, 4]","[0, 1, 4]","[0, 2, 4]",[0],[0],[0],"[0, 2]",[0],[0],[0],[0],"[0, 2, 4, 5]",[0],[0],"[0, 2]",[0],[0],"[0, 2, 4]","[0, 5]","[0, 4, 5]","[0, 4]",[0],"[0, 5]",[0],[0],"[0, 1, 2, 4, 5]","[0, 4, 5]","[0, 2, 4, 5, 6]","[0, 5]","[0, 2, 4]",[0],"[0, 4, 5]",[0],[0],[0],[0],[0],"[0, 1, 2, 4, 5]",[0],[0],[0],[0],[0],[0],[0],"[0, 4]"


{'mean': 1.9583333333333333,
 'median': 1.0,
 'min': 1,
 'max': 7,
 'unique_consonants': 7,
 'root_combine_0': 168,
 'unique_roots': 168,
 'root_combine_not_0': '[]',
 'root_combine_all': [107]}

### Combination Matrices

This is the full list of which Grapheme Roots combine with which Vowels and Consonant Diacritics

In [262]:
combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='grapheme_root', format=', ')

consonant_diacritic,0,1,2,3,4,5,6
vowel_diacritic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,"0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 29, 30, 31, 32, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 74, 75, 76, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 153, 154, 155, 156, 159, 160, 161, 162, 164, 165, 166, 167","3, 4, 6, 9, 96","13, 21, 22, 23, 29, 38, 42, 43, 48, 53, 56, 59, 64, 65, 71, 72, 74, 75, 76, 79, 80, 81, 86, 96, 103, 107, 111, 113, 115, 120, 122, 124, 128, 133, 136, 139, 147, 149, 159","29, 43, 71, 107","13, 18, 22, 23, 38, 43, 53, 55, 58, 59, 64, 65, 70, 71, 72, 76, 79, 81, 86, 89, 91, 96, 107, 113, 115, 122, 124, 133, 139, 141, 142, 147, 151, 159","13, 15, 23, 29, 43, 53, 56, 64, 72, 85, 86, 89, 91, 96, 103, 107, 113, 115, 117, 119, 133, 141, 147, 148, 149, 150, 159","64, 72"
1,"13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 28, 29, 31, 32, 34, 35, 36, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 83, 85, 86, 88, 89, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101, 103, 106, 107, 109, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 127, 128, 129, 132, 133, 134, 135, 136, 139, 140, 141, 142, 143, 144, 147, 148, 149, 150, 151, 152, 153, 155, 156, 157, 159, 161, 162, 163, 165, 167","13, 22, 23, 29, 38, 42, 48, 55, 64, 72, 79, 96, 103, 107, 113, 123, 147, 159","13, 22, 23, 29, 38, 42, 43, 53, 56, 59, 64, 71, 72, 74, 79, 81, 107, 113, 115, 122, 124, 133, 139, 147, 151",,"2, 9, 13, 18, 22, 23, 25, 28, 29, 38, 42, 43, 53, 55, 56, 59, 64, 71, 72, 79, 81, 91, 92, 96, 101, 103, 106, 107, 112, 113, 115, 122, 123, 124, 133, 139, 141, 147, 148, 149, 153, 159, 167","13, 23, 29, 53, 56, 64, 72, 77, 83, 86, 89, 96, 103, 107, 113, 115, 119, 133, 144, 147, 149, 150, 153, 159","13, 23, 53, 96, 103, 107, 149"
2,"13, 14, 15, 16, 17, 18, 21, 22, 23, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 43, 44, 47, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 64, 65, 66, 68, 69, 70, 71, 72, 74, 75, 76, 77, 79, 81, 83, 84, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 101, 103, 107, 108, 109, 110, 111, 112, 113, 115, 117, 118, 119, 120, 122, 123, 124, 125, 127, 128, 129, 131, 132, 133, 134, 136, 137, 138, 139, 141, 142, 147, 148, 149, 150, 151, 152, 153, 155, 156, 159, 160, 165, 167","42, 96, 147","13, 23, 38, 43, 53, 56, 59, 64, 71, 72, 76, 79, 81, 96, 103, 107, 113, 115, 124, 133, 136, 139, 147, 159",,"23, 43, 64, 72, 79, 107, 133, 159","13, 22, 23, 53, 56, 64, 71, 72, 83, 86, 89, 96, 103, 107, 133, 140, 148, 149, 150, 153",
3,"13, 18, 20, 22, 23, 25, 32, 36, 38, 43, 44, 46, 52, 53, 55, 56, 57, 59, 62, 64, 65, 66, 68, 69, 70, 71, 72, 74, 76, 79, 81, 82, 85, 86, 88, 89, 90, 91, 92, 93, 95, 96, 100, 103, 107, 109, 113, 115, 116, 118, 119, 120, 122, 123, 124, 125, 129, 132, 133, 137, 138, 139, 141, 142, 147, 149, 150, 153, 154, 155, 156, 159, 162, 165, 167",,"13, 21, 23, 43, 59, 64, 65, 71, 72, 81, 115, 133, 139, 147",,,"13, 22, 23, 64, 72, 86, 89, 96, 103, 107, 117, 133, 140, 141, 149, 150",
4,"13, 18, 22, 23, 28, 29, 31, 32, 36, 38, 39, 40, 42, 43, 44, 48, 50, 52, 53, 55, 56, 58, 59, 60, 62, 64, 71, 72, 74, 76, 77, 79, 81, 82, 83, 85, 86, 89, 91, 92, 94, 96, 101, 103, 107, 109, 111, 112, 113, 115, 117, 118, 119, 120, 122, 123, 124, 126, 127, 132, 133, 134, 138, 139, 141, 142, 143, 147, 148, 149, 150, 154, 158, 159, 165, 167","13, 22, 23, 42, 48, 96, 103, 159","13, 23, 29, 43, 53, 64, 72, 107, 113, 115",,"13, 25, 38, 43, 53, 64, 72, 79, 81, 107, 113, 115, 133, 147","13, 23, 53, 64, 72, 79, 85, 96, 103, 107, 113, 133, 148",
5,"13, 23, 29, 38, 53, 64, 72, 77, 79, 81, 96, 113, 115, 117, 119, 122, 123, 133, 147, 150, 151, 154, 159, 167",,"113, 115",,"38, 81, 107, 113, 147","13, 72, 113",
6,"13, 16, 22, 23, 29, 32, 64, 72, 75, 76, 79, 81, 96, 107, 113, 115, 117, 133, 140, 147, 148, 150, 153, 156, 159",,"64, 72, 107",,,,
7,"13, 14, 15, 16, 17, 18, 21, 22, 23, 25, 26, 29, 30, 31, 32, 36, 38, 40, 42, 43, 44, 46, 48, 50, 52, 53, 55, 56, 58, 59, 61, 62, 64, 65, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 81, 83, 84, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 103, 106, 107, 109, 112, 113, 115, 117, 118, 119, 120, 121, 122, 123, 124, 125, 127, 128, 129, 131, 132, 133, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 146, 147, 148, 149, 150, 151, 152, 153, 155, 159, 160, 165, 166, 167","13, 23, 29, 38, 64, 96, 107, 159","13, 22, 23, 38, 43, 53, 56, 59, 64, 71, 72, 75, 79, 80, 81, 96, 107, 113, 115, 122, 133, 136, 139, 147",,"13, 18, 23, 38, 43, 53, 59, 64, 70, 71, 72, 79, 81, 89, 91, 107, 113, 115, 122, 124, 133, 139, 141, 147, 151, 159","13, 23, 29, 53, 56, 64, 72, 86, 89, 96, 103, 107, 133, 141, 147, 149, 150",
8,"13, 22, 23, 38, 43, 64, 71, 72, 76, 79, 81, 96, 107, 113, 115, 123, 133, 147, 155, 159",,,,43,64,
9,"13, 14, 15, 16, 17, 18, 22, 23, 28, 29, 32, 36, 38, 42, 43, 44, 48, 53, 54, 55, 56, 58, 59, 64, 65, 71, 72, 74, 76, 79, 81, 83, 85, 86, 89, 94, 96, 101, 103, 106, 107, 113, 115, 117, 118, 119, 120, 122, 123, 124, 127, 128, 129, 132, 133, 135, 138, 139, 145, 147, 148, 149, 150, 152, 153, 154, 157, 159, 165, 167","22, 23, 42, 55, 103","13, 23, 72, 81, 96, 107, 113, 115",,"43, 72, 89, 123, 147","13, 15, 23, 53, 56, 64, 71, 72, 83, 96, 103, 107, 133, 147, 149, 150",


## Sanity Checking = Found Dataset BUG!

This combination_matrix lists 1292 unique grapheme combinations, which is 3 less than the 1295 unique graphemes listed in the training dataset. Something is WRONG!

Found a discrepency BUG is in the dataset. The following root/vowel/consonant keys have multiple unicode graphemes renderings! 

{'64-3-2': ['র্তী', 'র্ত্রী'],
 '64-7-2': ['র্তে', 'র্ত্রে'],
 '72-0-2': ['র্দ্র', 'র্দ']}

In [329]:
from itertools import chain
{
"combinations": len(list(chain( *combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='grapheme_root').values.flatten() ))),
"unique_graphemes": unique['grapheme']
}

{'combinations': 1292, 'unique_graphemes': 1295}

- Confirm that there are no null or NaN values in the dataset

In [331]:
dataset.apply(lambda row: row.isnull()).sum()

image_id               0
grapheme_root          0
vowel_diacritic        0
consonant_diacritic    0
grapheme               0
dtype: int64

- Found the BUG! It is in the dataset!
- There is THREE sets of unique root/vowel/consonant keys that have multiple unicode renderings 

In [417]:
( 
    dataset
    .groupby(['grapheme_root', 'vowel_diacritic', 'consonant_diacritic'])
    .nunique(dropna=False) > 1 
).sum()

image_id               1292
grapheme_root             0
vowel_diacritic           0
consonant_diacritic       0
grapheme                  3
dtype: int64

In [418]:
( 
    dataset
    .groupby(['grapheme_root', 'vowel_diacritic', 'consonant_diacritic'])
    .nunique(dropna=False) > 1
).query("grapheme != False")

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,image_id,grapheme_root,vowel_diacritic,consonant_diacritic,grapheme
grapheme_root,vowel_diacritic,consonant_diacritic,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
64,3,2,True,False,False,False,True
64,7,2,True,False,False,False,True
72,0,2,True,False,False,False,True


In [419]:
multilabled_graphemes = {
    "64-3-2": dataset.query("grapheme_root == 64 & vowel_diacritic == 3 & consonant_diacritic == 2")['grapheme'].unique().tolist(),
    "64-7-2": dataset.query("grapheme_root == 64 & vowel_diacritic == 7 & consonant_diacritic == 2")['grapheme'].unique().tolist(),
    "72-0-2": dataset.query("grapheme_root == 72 & vowel_diacritic == 0 & consonant_diacritic == 2")['grapheme'].unique().tolist(),
}
multilabled_graphemes

{'64-3-2': ['র্তী', 'র্ত্রী'],
 '64-7-2': ['র্তে', 'র্ত্রে'],
 '72-0-2': ['র্দ্র', 'র্দ']}

In [421]:
multilabled_grapheme_list = list(chain(*multilabled_graphemes.values())); multilabled_grapheme_list

['র্তী', 'র্ত্রী', 'র্তে', 'র্ত্রে', 'র্দ্র', 'র্দ']

This simply counts how many times each of these multi-keyed unicode graphemes is listed in the database

In [422]:
dataset[ dataset['grapheme'].isin(multilabled_grapheme_list) ].groupby(['grapheme']).count()

Unnamed: 0_level_0,image_id,grapheme_root,vowel_diacritic,consonant_diacritic
grapheme,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
র্তী,144,144,144,144
র্তে,153,153,153,153
র্ত্রী,145,145,145,145
র্ত্রে,150,150,150,150
র্দ,146,146,146,146
র্দ্র,151,151,151,151
