# Exploratory Data Analysis
First I will try to run summary statistics on the train folder, i.e. how many images we have, how many annotations we have etc

# **Dataset Description and Guidelines for Code Competition**

This document provides a detailed description of the dataset used in this competition, and also describes the structure and rules of the competition.

## **Dataset Overview**

The dataset for this competition includes approximately 65,000 annotated scientific figures of four types: bar graphs (both horizontal and vertical), dot plots, line graphs, and scatter plots. These figures are a mix of synthetically generated images and a few thousand figures extracted from professionally-produced sources. The challenge of this competition is to predict the data series represented in the test set figures.

## **Competition Format**

This competition follows a Code Competition structure. The actual test set is hidden and the public version only includes some sample data drawn from the training set to aid you in developing your solutions. When your submission is scored, this sample test data will be replaced with the actual test set.

## **File and Field Descriptions**

The dataset is organized into the following folders and files:

- `train/annotations/`: This folder contains JSON image annotations that describe the figures, including:
    - `source`: Whether the figure is generated or extracted.
    - `chart-type`: The type of chart (dot, horizontal_bar, vertical_bar, line, or scatter).
    - `plot-bb`: The bounding box of the plot within the figure, specified by height, width, x0, and y0.
    - `text/id`: An identifier for a text item within the figure.
    - `text/polygon`: The region bounding the text item in the image.
    - `text/text`: The actual text.
    - `text/role`: The function of the text in the image (for example, chart_title, axis_title, tick_label, etc.).
    - `axes/{x|y}-axis/ticks/id`: An identifier that matches the tick to the associated text element id.
    - `axes/{x|y}-axis/ticks/tick_pt`: Coordinates of each tick in the figure.
    - `axes/{x|y}-axis/tick-type`: The graphical depiction of the tick element.
    - `axes/{x|y}-axis/values-type`: The data type of the values represented by the tick element. This can be either categorical or numerical.
    - `visual-elements`: Part of the figure representing the data series.
    - `data-series/{x|y}`: The x and y coordinates of the values depicted in the figure. This is the target to be predicted for the test set images.

- `train/images/`: This folder contains a collection of figures in JPG format to be used as training data.
- `test/images/`: This folder contains a collection of figures to be used as test data. You are expected to predict the corresponding data series for each figure in this folder.
- `sample_submission.csv`: A sample submission file in the correct format.

## **Data Splits**

The full test set comprises about 4,000 figures extracted from professionally-produced sources. It does not contain any generated figures. The distribution of chart types in the public and private test sets may not be identical.

## **Additional Resources**

The competition is inspired by the CHART-Info competition series. Resources from these competitions may be beneficial to participants.

In [1]:
import os
import json

# Define directory
dir_path = 'train/annotations'

# Create an empty dictionary to hold the counts
chart_types = {}

# Iterate through each JSON file in the directory
for file_name in os.listdir(dir_path):
    if file_name.endswith('.json'):
        file_path = os.path.join(dir_path, file_name)

        # Open the file and load as a JSON
        with open(file_path, 'r') as f:
            data = json.load(f)

        # Get the chart type
        chart_type = data.get('chart-type', None)
        if chart_type is not None:
            # If this chart type is already in the dictionary, increment the count
            if chart_type in chart_types:
                chart_types[chart_type] += 1
            # Otherwise, add it to the dictionary with a count of 1
            else:
                chart_types[chart_type] = 1

# Calculate total number of charts
total_charts = sum(chart_types.values())

# Print raw counts and percentages
for chart_type, count in chart_types.items():
    percentage = (count / total_charts) * 100
    print(f'Chart type: {chart_type}, Count: {count}, Percentage: {percentage:.2f}%')


Chart type: scatter, Count: 11243, Percentage: 18.56%
Chart type: vertical_bar, Count: 19189, Percentage: 31.68%
Chart type: dot, Count: 5131, Percentage: 8.47%
Chart type: line, Count: 24942, Percentage: 41.17%
Chart type: horizontal_bar, Count: 73, Percentage: 0.12%


In [3]:
#Data Partition
import os
import json
import shutil
import random

# Define the paths
orig_annotation_path = 'train/annotations'
orig_image_path = 'train/images'
new_base_path = 'data_classification'
new_train_annotation_path = os.path.join(new_base_path, 'train/annotations')
new_train_image_path = os.path.join(new_base_path, 'train/images')
new_test_annotation_path = os.path.join(new_base_path, 'test/annotations')
new_test_image_path = os.path.join(new_base_path, 'test/images')

# Create new directories if they don't exist
os.makedirs(new_train_annotation_path, exist_ok=True)
os.makedirs(new_train_image_path, exist_ok=True)
os.makedirs(new_test_annotation_path, exist_ok=True)
os.makedirs(new_test_image_path, exist_ok=True)

# Store files by chart type
chart_files = {chart_type: [] for chart_type in chart_types}

# Iterate through each JSON file in the directory
for file_name in os.listdir(orig_annotation_path):
    if file_name.endswith('.json'):
        file_path = os.path.join(orig_annotation_path, file_name)

        # Open the file and load as a JSON
        with open(file_path, 'r') as f:
            data = json.load(f)

        # Add the file to the appropriate list
        chart_type = data['chart-type']
        chart_files[chart_type].append(file_name)

# Now split the files into train and test sets and move them
for chart_type, files in chart_files.items():
    # Shuffle the list for randomness
    random.shuffle(files)

    # Find the index that splits the files into about 60/40
    split_index = int(0.6 * len(files))

    # Split the files
    train_files = files[:split_index]
    test_files = files[split_index:]

    # Move the files
    for file_name in train_files:
        # Move annotation
        shutil.copy(os.path.join(orig_annotation_path, file_name),
                    os.path.join(new_train_annotation_path, file_name))
        # Move image
        shutil.copy(os.path.join(orig_image_path, file_name.replace('.json', '.jpg')),
                    os.path.join(new_train_image_path, file_name.replace('.json', '.jpg')))

    for file_name in test_files:
        # Move annotation
        shutil.copy(os.path.join(orig_annotation_path, file_name),
                    os.path.join(new_test_annotation_path, file_name))
        # Move image
        shutil.copy(os.path.join(orig_image_path, file_name.replace('.json', '.jpg')),
                    os.path.join(new_test_image_path, file_name.replace('.json', '.jpg')))
