# Blog Post 1

##### Alex Berry, Jason Chan, Hyunjoon Lee

Brown University Data Science Initiative\
DATA 2040: Deep Learning\
February 18th, 2020

In this blog post, we post the exploratory data analysis of Bengali AI project.

## Exploratory Data Analysis (EDA) in the Training Set

There are 2 types of training data: labels and images. For each id, we have an image data in **train_image_data.parquet** and label data in **train.csv**. There are total of 200840 samples, where each each image of Bengali consists of 32330 pixels (approx. 137 $\times$ 236), and 3 labels: **grapheme_root**, **vowel_diacritic**, and **consonant_diacritic**. We downloaded *kalpurush font ** to make sure the graphemes are proper

```python
train_df = pd.read_csv('train.csv')
train_df.head()
```
![alt text](../figures/df_head.png)


### Unique Values

We explored the number of unique grapheme roots, vowel diacritic, and consonant diacritic.

```python
# number of unique values
print(f'Number of unique grapheme roots: {train_df["grapheme_root"].nunique()}')
print(f'Number of unique vowel diacritic: {train_df["vowel_diacritic"].nunique()}')
print(f'Number of unique consonant diacritic: {train_df["consonant_diacritic"].nunique()}')
```
Number of unique grapheme roots: 168 \
Number of unique vowel diacritic: 11 \
Number of unique consonant diacritic: 7

###  Most/Least Common Characters for each Component

#### 1. Top 10 Grapheme Roots

We looked at the graphemes with top 10 most common roots. Their grapheme root IDs were [72, 64, 13, 107, 23, 96, 113, 147, 133, 115].

```python
def get_n(df, field, n, ascend=False):
  return pd.DataFrame(df[field].value_counts(ascending=ascend))[:n].rename_axis('id').reset_index()

get_n(train_df, "grapheme_root", 10)
```
![alt text](/figures/top10_roots.png)

```python
# top 10 most common grapheme roots
top_10_roots = get_n(train_df, 'grapheme_root', 10)
sorted_indices = top_10_roots.sort_values(by="grapheme_root", ascending=False)["id"].tolist()
# would have prefered to use "component" instead of "index", but unicode isn't supported in the plot
sns.barplot(x="id", y="grapheme_root", data=top_10_roots, order = sorted_indices).set_title("10 Most Common Roots by ID")
```
![alt text](top10_roots_bar.png)

The following figure is 5 samples of grapheme for each IDs of 10 most commont roots.

```python
def make_contact_sheet(images, labels, ex_labels, num_samples):
    '''
      Prints a grid of images with labels of type ex_labels. 
      Each column corresponds to a different ex_label and there are nrows total.
      Inputs:
        - images: A list of images to sample from
        - labels: A list where the ith element corresponds to the label of image i in images
        - ex_labels: An array specifying the labels we wish to sample 
        - num_samples: A nonnegtive integer specifying how many samples we want for each label
    '''
    indices = [[0] * len(ex_labels) for i in range(num_samples)]
    samples = [[0] * len(ex_labels) for i in range(num_samples)]
    
    for i in range(len(ex_labels)):
        for j in range(num_samples):
            indices[j][i] = np.where(labels == ex_labels[i])[0][j]
            
    for i in range(len(ex_labels)):
        for j in range(num_samples):
            samples[j][i] = images.iloc[indices[j][i],1:].to_numpy().astype(int).reshape(137,236)
            
    f, axs = plt.subplots(num_samples, len(ex_labels), sharey = True, figsize = (20, 10))
    
    for i in range(len(samples)):
        for j in range(len(samples[0])):
            axs[i, j].imshow(samples[i][j])
            axs[i, j].axis("off")
            
top10_root_labels = [72, 64, 13, 107, 23, 96, 113, 147, 133, 115]
make_contact_sheet(train_images_0, train_label['grapheme_root'], top10_root_labels, 5)
```

![alt text](top_grapheme_roots.png)

#### 2. Bottom 10 Grapheme Roots

Then, we looked at the graphemes with bottom 10 least common roots. Their grapheme root IDs were [63, 0, 12, 1, 130, 45, 158, 102, 33, 73].

![alt text](bottom10_roots_bar.png)

The following figure is 5 samples of grapheme for each IDs of 10 least commont roots.

![alt text](bottom_grapheme_roots.png)

#### 3. Top 5 Vowels

After the grapheme roots, we looked at top 5 most common vowels. Their vowel diacritic IDs were [0, 1, 7, 2, 4].

![alt text](top5_vowels_bar.png)

The following figure is 5 samples of grapheme for each IDs of 5 most commont vowels.

![alt text](vowels.png)

#### 4. All 7 Consonants

After the grapheme roots, we looked at all 7 consonants. Their vowel diacritic IDs were [0, 2, 5, 4, 1, 6, 3].

![alt text](top5_consonants_bar.png)

The following figure is 5 samples of grapheme for each IDs of 7 consonantss.

![alt text](consonant.png)

#### 5. Top 10 Combinations

Lastly, we looked at the graphemes in which are the 10 most common combinations of **grapheme_root**, **vowel_diacritic**, and **consonant_diacritic**.

**Quantifying the Most Common Combinations**

```python
combo_tally = coll.Counter()
for _, row in tqdm(train_df.iterrows()):
  combo = (row['grapheme_root'], row['vowel_diacritic'], row['consonant_diacritic']) #change strings to tuples
  combo_tally[combo] += 1
combo_tally.most_common(10)

output:
[((64, 7, 2), 303),
 ((72, 0, 2), 297),
 ((64, 3, 2), 289),
 ((167, 7, 0), 283),
 ((74, 1, 0), 178),
 ((29, 0, 0), 178),
 ((48, 4, 0), 177),
 ((107, 7, 0), 177),
 ((103, 1, 0), 177),
 ((96, 9, 5), 177)]
```

![alt text](top10_combos.png)

The following figure is a sample for each most common combination of componenets.

![alt text](combinations.png)

### Conclusion of Component EDA

In the training set, the most common root graphemes are 72, 64, and 13 and the last common are 63, 0, and 12. The most common vowel diacritics are 0, 1, and 7. The most common consonant diacritics are 0, 2, and 5. The most common combinations of components are mostly just combinations of the most common components, which is expected. The most common combinations are 64-7-2, 72-0-2, and 64-3-2.