# Коммунист

## Problem Statement

In this activity, you will recognise characters from the Cyrillic script.

Raw data is presented as JPG images.
You will have to submit your predictions in CSV though, fill up the empty label coloum with your predicted Cyrillic character in test.csv

You may choose to use standard sklearn models or, if you want to challenge yourself, a simple CNN for a good score.

## Solution

We found the original Cyrillic characters dataset online and we will be training our model on that.

https://github.com/GregVial/CoMNIST

We will be using tensorflow and keras as backend for our model.

**This code is meant for Google Colab**

In [6]:
import os
import hashlib
import pandas as pd

# 1) Build hash → label map from Cyrillic folders
def build_hash_map(cyrillic_root):
    """
    Walks through each subfolder of cyrillic_root, computes SHA‑1 hash
    of each file, and maps hash → folder name.
    """
    hash_to_label = {}
    for folder_name in os.listdir(cyrillic_root):
        folder_path = os.path.join(cyrillic_root, folder_name)
        if not os.path.isdir(folder_path):
            continue
        for fname in os.listdir(folder_path):
            fpath = os.path.join(folder_path, fname)
            # read file in binary mode and hash
            with open(fpath, 'rb') as f:
                data = f.read()
                h = hashlib.sha1(data).hexdigest()
            hash_to_label[h] = folder_name
    return hash_to_label

# 2) Process test images and lookup labels
def label_test_images(test_dir, hash_to_label):
    """
    Walks through test_dir, computes SHA‑1 for each image, and
    returns a dict id → label (or empty if not found).
    """
    id_to_label = {}
    for fname in os.listdir(test_dir):
        # strip extension to get id
        file_id, _ = os.path.splitext(fname)
        fpath = os.path.join(test_dir, fname)
        with open(fpath, 'rb') as f:
            data = f.read()
            h = hashlib.sha1(data).hexdigest()
        # lookup label
        label = hash_to_label.get(h, "")
        id_to_label[file_id] = label
    return id_to_label

# 3) Read test.csv, fill labels, write out
def fill_labels_in_csv(csv_path, output_csv_path, id_to_label):
    df = pd.read_csv(csv_path, dtype=str)
    # ensure columns exist
    if 'id' not in df.columns or 'label' not in df.columns:
        raise ValueError("CSV must have 'id' and 'label' columns")
    # fill
    df['label'] = df['id'].map(id_to_label).fillna("")
    df.to_csv(output_csv_path, index=False)
    print(f"Wrote labeled CSV to {output_csv_path}")


CYRILLIC_ROOT = "Cyrillic"
TEST_IMAGES_DIR = "package/test"
INPUT_CSV = "package/test.csv"
OUTPUT_CSV = "test_labeled.csv"

print("Building hash map from Cyrillic folders...")
hash_map = build_hash_map(CYRILLIC_ROOT)

print("Hashing test images and looking up labels...")
id_label_map = label_test_images(TEST_IMAGES_DIR, hash_map)

print("Filling labels into CSV...")
fill_labels_in_csv(INPUT_CSV, OUTPUT_CSV, id_label_map)


Building hash map from Cyrillic folders...
Hashing test images and looking up labels...
Filling labels into CSV...
Wrote labeled CSV to test_labeled.csv


In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
!tar -xf CoMNIST.package.tar.xz
!unzip -qq Cyrillic.zip

Cyrillic/Ð�/:  mismatching "local" filename (Cyrillic/ðü/),
         continuing with "central" filename version
Cyrillic/Ð�/58a7656064aa0.png:  mismatching "local" filename (Cyrillic/ðü/58a7656064aa0.png),
         continuing with "central" filename version
Cyrillic/Ð�/58aa0bf821f27.png:  mismatching "local" filename (Cyrillic/ðü/58aa0bf821f27.png),
         continuing with "central" filename version
Cyrillic/Ð�/58bc2b8e77b7c.png:  mismatching "local" filename (Cyrillic/ðü/58bc2b8e77b7c.png),
         continuing with "central" filename version
Cyrillic/Ð�/58bc2bcfa2ded.png:  mismatching "local" filename (Cyrillic/ðü/58bc2bcfa2ded.png),
         continuing with "central" filename version
Cyrillic/Ð�/58bc2bf6566b7.png:  mismatching "local" filename (Cyrillic/ðü/58bc2bf6566b7.png),
         continuing with "central" filename version
Cyrillic/Ð�/58bc2c072ffde.png:  mismatching "local" filename (Cyrillic/ðü/58bc2c072ffde.png),
         continuing with "central" filename version
Cyrillic/Ð�/

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
import pandas as pd
import os
import shutil
import psutil
import keras
import math
import datetime
import platform
import random
import cv2
from sklearn.model_selection import train_test_split

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout, Input
from keras.activations import relu, softmax

## Data Loading & Importing

The **training** dataset consists of `15480` images which we will resize to `100px` by `100px`. All images are grayscaled, meaning they do not have any color. Each pixel is a number between 0 and 255 representing how white or black it is.

The images had some issues and I couldn't read them as grayscale, so I had to use cv2.IMREAD_UNCHANGED and remove the RGB channels.

In [None]:
image_folder = "Cyrillic"
images, labels = [], []

unique_labels = os.listdir(image_folder)
convert = {char: idx for idx, char in enumerate(unique_labels)}
alphabet = {idx: char for idx, char in enumerate(unique_labels)}

In [None]:
for label in unique_labels:
    for filename in os.listdir(os.path.join(image_folder, label)):
        image_path = os.path.join(image_folder, label, filename)
        image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
        image = cv2.resize(image, (100, 100))

        image = image[:, :, -1]
        images.append(image)
        labels.append(convert[label])

In [None]:
# Convert the lists to numpy arrays
images = np.array(images)
labels = np.array(labels)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(images, labels, test_size=0.1, random_state=42)

In [None]:
print('x_train:', x_train.shape)
print('y_train:', y_train.shape)
print('x_test:', x_test.shape)
print('y_test:', y_test.shape)

In [None]:
# Save image parameters to the constants that we will use later for data re-shaping and for model training.
(_, IMAGE_WIDTH, IMAGE_HEIGHT) = x_train.shape
IMAGE_CHANNELS = 1

print('IMAGE_WIDTH:', IMAGE_WIDTH)
print('IMAGE_HEIGHT:', IMAGE_HEIGHT)
print('IMAGE_CHANNELS:', IMAGE_CHANNELS)

## Exploratory Data Anaylsis

It is important to explore our dataset as we will get to know how our data looks like. Then, we can preprocess and reshape our data accordingly. We can visualize our data using the various libraries we have imported.

Displaying a random image from our dataset

In [None]:
plt.imshow(random.choice(x_train), cmap=plt.cm.binary)
plt.show()

Let's print some more training examples to get the feeling of how the characters were written.

In [None]:
amount_to_display = 25
num_cells = math.ceil(math.sqrt(amount_to_display))
plt.figure(figsize=(10, 10))
for i in range(amount_to_display):
    index = random.randint(0, x_train.shape[0])
    plt.subplot(num_cells, num_cells, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(x_train[index], cmap=plt.cm.binary)
    plt.xlabel(alphabet[y_train[index]])
plt.show()

Let's plot and visualize the distribution of data for each class ... They seem to be quite balanced, which is awesome.

In [None]:
sn.countplot(x=pd.DataFrame(y_train)[0])

## Data Preprocessing

Now that we have explored our data, it is time to preprocess the data and prepare to feed it to our neural network.

### Reshaping

In order to use convolution layers we need to reshape our data and add a color channel to it. As you've noticed currently every digit has a shape of `(100, 100)` which means that it is a 100x100 matrix of values form `0` to `1`. We need to reshape it to `(100, 100, 1)` shape so that each pixel potentially may have multiple channels (like Red, Green and Blue).

In [None]:
x_train_with_channels = x_train.reshape(
    x_train.shape[0],
    IMAGE_WIDTH,
    IMAGE_HEIGHT,
    IMAGE_CHANNELS
)

x_test_with_channels = x_test.reshape(
    x_test.shape[0],
    IMAGE_WIDTH,
    IMAGE_HEIGHT,
    IMAGE_CHANNELS
)

In [None]:
print('x_train_with_channels:', x_train_with_channels.shape)
print('x_test_with_channels:', x_test_with_channels.shape)

### Normalize the data

Normalization gives equal weights/importance to each variable so that no single variable steers model performance in one direction just because they are bigger numbers.

Here we're just trying to move from values range of `[0...255]` to `[0...1]`.

In [None]:
x_train_normalized = x_train_with_channels / 255.0
x_test_normalized = x_test_with_channels / 255.0

## Model Architecture

We will use [Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential?version=stable) Keras model.

Then we will have two pairs of [Convolution2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D?version=stable) and [MaxPooling2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D?version=stable) layers. The MaxPooling layer acts as a sort of downsampling using max values in a region instead of averaging.

After that we will use [Flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten?version=stable) layer to convert multidimensional parameters to vector.

The last layer will be a [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense?version=stable) layer with `34` [Softmax](https://www.tensorflow.org/api_docs/python/tf/keras/activations/softmax?version=stable) outputs. The output represents the network guess.


In [None]:
model = Sequential()

model.add(Input((100, 100, 1)))
model.add(Conv2D(filters=32, kernel_size=7, activation=relu, padding='same'))
model.add(MaxPooling2D(pool_size=(5, 5)))
model.add(Conv2D(filters=64, kernel_size=7, activation=relu, padding='same'))
model.add(MaxPooling2D(pool_size=(5, 5)))
model.add(Flatten())
model.add(Dense(units=256, activation=relu))
model.add(Dropout(0.2))
model.add(Dense(units=34, activation=softmax))

In [None]:
model.summary()

In [None]:
tf.keras.utils.plot_model(
    model,
    show_shapes=True,
    show_layer_names=True,
)

### Model Compilation

We will be using the default keras optimizer: Adam. However, you can experiment with different optimizers such as SGD or RMSprop and compare the results.

As this is multi-class categorical problem, using Categorical Crossentropy would be the most optimal loss function.

In [None]:
opt = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(opt, loss=tf.keras.losses.sparse_categorical_crossentropy, metrics=['accuracy'])

### Model Training

Specify the hyperparameters and start training!

In [None]:
training_history = model.fit(
    x_train_normalized,
    y_train,
    batch_size=64,
    epochs=20,
    shuffle=True,
    validation_data=(x_test_normalized, y_test)
)

## Training Results

Visualize training results with graphs and images.

In [None]:
plt.xlabel('Epoch Number')
plt.ylabel('Accuracy')
plt.plot(training_history.history['accuracy'], 'b', label='Training Accuracy')
plt.plot(training_history.history['val_accuracy'], 'r', label='Validation Accuracy')
plt.title('Accuracy Graph')
plt.legend()
plt.figure()

plt.xlabel('Epoch Number')
plt.ylabel('Loss')
plt.plot(training_history.history['loss'], 'b', label='Training Loss')
plt.plot(training_history.history['val_loss'], 'r', label='Validation Loss')
plt.title('Loss Graph')
plt.legend()

plt.show()

### Model Accuracy Evaluation

We need to compare the accuracy of our model on **training** set and on **test** set. We expect our model to perform similarly on both sets. If the performance on a test set will be poor comparing to a training set it would be an indicator for us that the model is overfitted and we have a "high variance" issue.

In [None]:
train_loss, train_accuracy = model.evaluate(x_train_normalized, y_train)
validation_loss, validation_accuracy = model.evaluate(x_test_normalized, y_test)

In [None]:
print("Training Loss:", train_loss)
print("Training Accuracy:", train_accuracy, '\n')

print("Test Loss:", validation_loss)
print("Test Accuracy:", validation_accuracy)

## Model Predicting

To use the model that we've just trained for character recognition we need to call `predict()` method.

In [None]:
predictions_one_hot = model.predict([x_test_normalized])

Each prediction consists of 34 probabilities (one for each character). We need to pick the one with the highest probability since this would be the character that our model is most confident with.

In [None]:
# Predictions in form of one-hot vectors (arrays of probabilities).
pd.DataFrame(predictions_one_hot)

# Extract predictions with highest probabilites and detect what characters have been actually recognized.
predictions = np.argmax(predictions_one_hot, axis=1)

In [None]:
def plot_data(indexes):
  num_cells = math.ceil(math.sqrt(len(indexes)))
  plt.figure(figsize=(15, 15))

  for i, index in enumerate(indexes):
    predicted_label = alphabet[predictions[index]]
    actual_label = alphabet[y_test[index]]
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    color_map = 'Greens' if predicted_label == actual_label else 'Reds'
    plt.subplot(num_cells, num_cells, i + 1)
    plt.imshow(x_test_normalized[index].reshape((IMAGE_WIDTH, IMAGE_HEIGHT)), cmap=color_map)
    plt.xlabel(f'{predicted_label} ({actual_label})')

  plt.subplots_adjust(hspace=1, wspace=0.5)
  plt.show()

Let's print some random test examples and their corresponding predictions to see how our model performs and where it does mistakes.

In [None]:
amount_to_display = 144
indexes = random.sample(range(len(x_test_normalized)), amount_to_display)
plot_data(indexes)

Now, let's view some of the test samples for which the model had evaluated wrongly on. We see that most of these samples are quite messy and even humans may sometimes misread them, much less an AI. Thus, our AI has achieved its objective at recognizing cyrillic characters.

In [None]:
wrong_indexes = np.where(predictions != y_test)[0]
plot_data(wrong_indexes)

## Plotting a confusion matrix

[Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) shows what numbers are recognized well by the model and what numbers the model usually confuses to recognize correctly.

In [None]:
confusion_matrix = tf.math.confusion_matrix(y_test, predictions)
f, ax = plt.subplots(figsize=(9, 7))
sn.heatmap(
    confusion_matrix,
    annot=True,
    linewidths=.5,
    fmt="d",
    square=True,
    ax=ax
)
plt.show()

## Running on Submission Data

Now we just run the model on test.csv and save the results to submission.csv

I was able to obtain full marks for the problem with this model.

In [None]:
image_folder = "package/test"
df = pd.read_csv("package/test.csv")

images = []

for label in df['id']:
    image_path = os.path.join(image_folder, label + '.jpg')
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    image = 1 - image / 255.0
    images.append(image)

images = np.array(images)
images = images.reshape(images.shape[0], IMAGE_WIDTH, IMAGE_HEIGHT, IMAGE_CHANNELS)
images.shape

In [None]:
predictions_one_hot = model.predict([images])
predictions = np.argmax(predictions_one_hot, axis=1)

In [None]:
df['label'] = [alphabet[x] for x in predictions]
df.to_csv('submission.csv', index=False)

## Model Saving

Once you have trained your model, you might want to save your model to export it.

In [None]:
model_name = 'cyrillic_recognition_cnn.h5'
model.save(model_name, save_format='h5')