# Statistics of data

In [27]:
import  pandas as pd
df15 = pd.read_csv("trainLabels15.csv")
df_pivot = df15.pivot(columns='level', values='image')
df_pivot.count()

level
0    25810
1     2443
2     5292
3      873
4      708
dtype: int64

In [28]:
df19 = pd.read_csv("trainLabels19.csv")
df_pivot = df19.pivot(columns='diagnosis', values='id_code')
df_pivot.count()

diagnosis
0    1805
1     370
2     999
3     193
4     295
dtype: int64

The data exhibits a class imbalance, where class 0 has a significantly higher number of images compared to classes 1, 2, 3, and 4. This can negatively impact machine learning models by causing them to favor the majority class (class 0) during training.

### Addressing Class Imbalance
The proposed strategy to counter this imbalance is to rotate the images in classes 1, 2, 3, and 4. This would create new images that could be added to respective classes. Specifically, the strategy suggests rotating the images by 90 degrees, 180 degrees, and 270 degrees. This would triple the number of images in each of the classes being rotated.


## Merging 2019 and 2015 datasets

In [29]:
df19.rename(columns = {'id_code':'image', 'diagnosis':'level'}, inplace = True)
df19

Unnamed: 0,image,level
0,000c1434d8d7,2
1,001639a390f0,4
2,0024cdab0c1e,1
3,002c21358ce6,0
4,005b95c28852,0
...,...,...
3657,ffa47f6a7bf4,2
3658,ffc04fed30e6,0
3659,ffcf7b45f213,2
3660,ffd97f8cd5aa,0


In [34]:
df = pd.merge(df15, df19, how='outer')
df_pivot = df.pivot(columns='level', values='image')
df_pivot.count()

level
0    27615
1     2813
2     6291
3     1066
4     1003
dtype: int64

# Datasets
The two datasets are now merged in one dataframe
further actions
- **Generate CSV**: Create a CSV file containing the merged dataframe for easy access and reference.
- **Augmentation of Images**: Organize the images into their respective directories based on their class labels for efficient management and access.
- **Address Class Imbalance**: Rotate images belonging to classes 1, 2, 3, and 4 to counter the imbalance within these classes, ensuring a more balanced representation across all classes.

In [40]:
df.to_csv("train19&15.csv", index = False)

##### Dividing images to respective directories according to their classes

In [42]:
import os

for label in df['level'].unique():
    class_dir = os.path.join('class ' + str(label))
    if not os.path.exists(class_dir):
        os.makedirs(class_dir)