# Statistics of data

In [5]:
import  pandas as pd
df15 = pd.read_csv("trainLabels15.csv")
df_pivot = df15.pivot(columns='level', values='image')
df_pivot.count()

level
0    25810
1     2443
2     5292
3      873
4      708
dtype: int64

In [7]:
df19 = pd.read_csv("trainLabels19.csv")
df_pivot = df19.pivot(columns='diagnosis', values='id_code')
df_pivot.count()

diagnosis
0    1805
1     370
2     999
3     193
4     295
dtype: int64

The data exhibits a class imbalance, where class 0 has a significantly higher number of images compared to classes 1, 2, 3, and 4. This can negatively impact machine learning models by causing them to favor the majority class (class 0) during training.

### Addressing Class Imbalance
The proposed strategy to counter this imbalance is to rotate the images in classes 1, 2, 3, and 4. This would create new images that could be added to respective classes. Specifically, the strategy suggests rotating the images by 90 degrees, 180 degrees, and 270 degrees. This would triple the number of images in each of the classes being rotated.


## Merging 2019 and 2015 datasets

In [9]:
df19.rename(columns = {'id_code':'image', 'diagnosis':'level'}, inplace = True)
df19

Unnamed: 0,image,level
0,000c1434d8d7,2
1,001639a390f0,4
2,0024cdab0c1e,1
3,002c21358ce6,0
4,005b95c28852,0
...,...,...
3657,ffa47f6a7bf4,2
3658,ffc04fed30e6,0
3659,ffcf7b45f213,2
3660,ffd97f8cd5aa,0


In [10]:
df = pd.merge(df15, df19, how='outer')
df_pivot = df.pivot(columns='level', values='image')
df_pivot.count()

level
0    27615
1     2813
2     6291
3     1066
4     1003
dtype: int64

# Datasets
The two datasets are now merged in one dataframe
further actions
- **Generate CSV**: Create a CSV file containing the merged dataframe for easy access and reference.
- **Organizing the images**: Organize the images into their respective directories based on their class labels for efficient management and access.
- **Augmentation of Images**: Rotate images belonging to classes 1, 2, 3, and 4 to counter the imbalance within these classes, ensuring a more balanced representation across all classes.

In [40]:
df.to_csv("train19&15.csv", index = False)

##### Dividing images to respective directories according to their classes

In [49]:
import os
import shutil
for label in df['level'].unique():
    class_dir = os.path.join('class ' + str(label))
    if not os.path.exists(class_dir):
        os.makedirs(class_dir)
for index, row in df.iterrows():
    image_name = row['image'] + '.jpg'
    class_label = row['level']

    for directory in ['../../../Downloads/archive/resized train 15', '../../../Downloads/archive/resized train 19']:
        if image_name in os.listdir(directory):
            image_to_copy = os.path.join(directory, image_name)
            destination = os.path.join('class ' + str(class_label))
            shutil.copy(image_to_copy, destination)
            break
        if directory == '../../../Downloads/archive/resized train 19':
            print("IMAGE NOT FOUND")

##### Augmenting the images of class 1, 2, 3, 4

In [27]:
from tqdm.notebook import tqdm
import cv2

Augmenting_df = df[df['level'] != 0]
Augmented_df = df

for index, row in tqdm(Augmenting_df.iterrows()):
    image_name = row['image'] + '.jpg'
    class_label = row['level']
    directory = 'class ' + str(class_label)
    Original_Image = cv2.imread(str(os.path.join(directory, image_name)))
    rotated_image1 = cv2.rotate(Original_Image, cv2.ROTATE_180)
    rotated_image2 = cv2.rotate(Original_Image, cv2.ROTATE_90_CLOCKWISE)
    rotated_image3 = cv2.rotate(Original_Image, cv2.ROTATE_90_COUNTERCLOCKWISE)
    for image, name in [(rotated_image1, row['image'] + '_rotate180.jpg'), (rotated_image2, row['image'] + '_rotate270.jpg' ),
                        (rotated_image3, row['image'] + '_rotate90.jpg')]:
         
        new_index= Augmented_df.index[-1] + 1
        rotated_row = pd.DataFrame({'image' : name, 'level' : class_label}, index = [new_index] )
        Augmented_df = pd.merge(Augmented_df, rotated_row, how='outer')
        cv2.imwrite(str(os.path.join(directory, name)), image)

    

0it [00:00, ?it/s]

In [30]:
Augmented_df

Unnamed: 0,image,level
0,000c1434d8d7,2
1,000c1434d8d7_rotate180.jpg,2
2,000c1434d8d7_rotate270.jpg,2
3,000c1434d8d7_rotate90.jpg,2
4,001639a390f0,4
...,...,...
72302,ffd97f8cd5aa,0
72303,ffec9a18a3ce,2
72304,ffec9a18a3ce_rotate180.jpg,2
72305,ffec9a18a3ce_rotate270.jpg,2


In [31]:
df_pivot = Augmented_df.pivot(columns='level', values='image')
df_pivot.count()

level
0    27615
1    11252
2    25164
3     4264
4     4012
dtype: int64

In [29]:
Augmented_df.to_csv("AugmentedTrain19&15.csv", index = False)