**Instructions:**

Tasks in this homework are based on what is covered in laboratory exercises 3 and 4.

When you finish, download and upload the notebook file in .ipynb format to c3 homework 2 assignment section.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Course: Fundamental Concepts of AI**

#**Homework 2: Working with datasets**



**Student name and surname:*
Almir Mustafic
**Student index:**
20114
**Date:**
December 2, 2024

#**Comment on Homework 2*
As I was not sure wether or not we need to include the exercises from the Assignment (PDF), I am providing a [LINK](https://colab.research.google.com/drive/1wWv1Qpo3e3-maNM3F1Ru9fhsFiddlT7F) to the colab where I complete the class exercises:

### **Assignment 1: Stellar Classification Dataset - SDSS17** (2 points)

**Classification of Stars, Galaxies, and Quasars**  
The dataset for this task comes from the Sloan Digital Sky Survey (SDSS). It contains observations of celestial objects and their spectral characteristics. You can find the dataset and its description at this [Kaggle link](https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data). Download the dataset from the link and load it like you did in the labs.

### **Instructions**  
Perform the following tasks:

---

### **1. Dataset Overview**  
1. What is the dataset about?  
2. What is the format/type of this dataset?  
3. For which task is this dataset used (e.g., classification, regression, etc.)?  
4. What are the inputs and what are the outputs?  
5. How many samples/instances are in this dataset?  
6. List all features and explain their meanings.  
7. List all targets/labels in the dataset.  
8. Draw a black-box input-output diagram for the dataset.  
9. What are the data types for each column in the dataset?  

---

### **2. Data Exploration**  
1. Display the first 5 and last 5 rows of the dataset.  
2. Check if there are any missing values in the dataset for each column.  
3. What is the class distribution in the dataset? Use the `data['class'].value_counts()` function to calculate the number of samples for each class.  

---

### **3. Statistical Analysis**  
1. Use the `data.describe()` method to display the minimum, maximum, mean, standard deviation, and percentiles (20%, 50%, and 75%) for numerical features.  
2. Plot a bar plot for the class distribution. Ensure your plot has:  
   - Axis labels  
   - A descriptive title  
   - Color and name for each class bar
3. Choose two features (e.g., `delta` and `alpha`) and create a 2D scatter plot to visualize patterns in the dataset. Color the points according to their class. Make sure to add:  
   - Axis labels  
   - A descriptive title  
   - A legend  

---

### **4. Data Cleaning and Preparation**  
1. Discard all columns that represent IDs or metadata (e.g., `obj_ID`, `run_ID`, `plate`, etc.) because those are not particullary useful for ML models. Retain only the following columns:  
   - `u`, `g`, `r`, `i`, `z`, `redshift`, and `class`.  
2. Print the cleaned dataset.  

---

### **5. Data Saving**

1. Save the cleaned dataset to a file named stellar_cleaned.csv, using a semicolon (;) as the separator.
2. Verify that the saved file can be reloaded correctly and print the first few rows to confirm.

Submit your answers with detailed explanations, well-documented code, and all required plots included.

In [None]:
import json
import os
import pandas as pd
from google.colab import files
import matplotlib.pyplot as plt

# install kaggle and set the API key (token)
# os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/Colab Notebooks/'
# !kaggle datasets list
# ! pip install -q kaggle
# files.upload()
# ! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/
# ! chmod 600 ~/.kaggle/kaggle.json
# ! kaggle datasets list

# Load the dataset
stars = '/content/drive/MyDrive/Colab Notebooks/star_classification.csv'
data = pd.read_csv(stars)

# 1.1) What is the dataset about?
# The dataset contains data on astronomical objects from the Sloan Digital Sky Survey (SDSS). The dataset includes different
# attributes (e.g., photometric magnitudes in different bands, redshift, etc.) for astronomical objects classified into three
# categories galaxies, stars and quasars.

# 1.2) What is the format/type of this dataset?
# The dataset is in CSV file and of tabular format, where each row represents an astronomical object, and each column contains a specific
# attribute of the object.

# 1.3) For which task is this dataset used (e.g., classification, regression, etc.)?
# The dataset is primarily used for classification tasks with the goal to classify the astronomical objects into one of the classes
 # (galaxy, star, qso). The classification is based on various input features.

# 1.4) What are the inputs and what are the outputs?
# The inputs are the columns that describe the characteristics of the astronomical objects: u, g, r, i, z: Magnitudes in different photometric bands.
# The redshift of the object is a measure of how much the object's light has been stretched due to the expansion of the universe.
# The output is the class column that indicates the class of the astronomical object.

# 1.5) How many samples/instances are in this dataset?
# 100000    code to check: data.shape[0]

# 1.6) List all features and explain their meanings.
# u, g, r, i, z are magnitudes in different photometric bands measured by the SDSS telescope. They correspond to different wavelengths
# in the electromagnetic spectrum, that is ultraviolet to infrared.
# The redshift is a measure of how much the wavelength of light from an astronomical object has been stretched due to the expansion of the universe.
# This measure helps estimate the distance of a specific object from Earth.
# Class represents the type or classification of the astronomical object.

# 1.7) List all targets/labels in the dataset.
# Galaxy, QSO and star.

# 1.8) Draw a black-box input-output diagram for the dataset.
#    +------------------------------------+
#    |                                    |
#    |           Input Features           |
#    |                                    |
#    |  u, g, r, i, z, redshift          |
#    |                                    |
#    +------------------------------------+
#               |     (Classification)
#               v
#    +------------------------------------+
#    |                                    |
#    |         Output (Class Label)       |
#    |                                    |
#    |          GALAXY, QSO, STAR         |
#    |                                    |
#    +------------------------------------+

# 1.9) What are the data types for each column in the dataset?
# Code to check:
# data.dtypes
# obj_ID:	float64, alpha:	float64, delta:	float64, u: float64, g:	float64, r:	float64, i:	float64, z:	float64
# run_ID:	int64, rerun_ID:	int64, cam_col:	int64, field_ID:	int64, spec_obj_ID:	float64, class:	object,
# redshift:	float64, plate:	int64, MJD:	int64, fiber_ID:	int64

# 2.1) Display the first 5 and last 5 rows
print("EXERCISE 2.1")
print("First 5 rows:")
print(data.head())
print("\nLast 5 rows:")
print(data.tail())

# 2.2) Check for missing values in each column
print("\nEXERCISE 2.2")
print("Missing values in each column:")
print(data.isnull().sum())

# 2.3) Check the class distribution i.e. number of samples for each class
print("\nEXERCISE 2.3")
print("Class distribution:")
print(data['class'].value_counts())

# 3.1.) # Display summary statistics for numerical features
print("\nEXERCISE 3.1")
print(data.describe()) # by default provides 25%, 50%, 75%
print(data.describe(percentiles=[.25, .3, .5, .75]))

# 3.2) Plot a bar plot for the class distribution. Ensure your plot has: axis labels, descriptive title,
# color and name for each class bar
# Calculate the class distribution
print("\nEXERCISE 3.2")
class_counts = data['class'].value_counts()
plt.figure(figsize=(8,6))
bars = class_counts.plot(kind='bar', color=['#3498db', '#e74c3c', '#2ecc71'])
plt.xlabel('Class', fontsize=12)
plt.ylabel('Number of Samples', fontsize=12)
plt.title('Class Distribution in the Dataset', fontsize=14)
# Add the names (class labels) for each bar
for i, v in enumerate(class_counts):
    plt.text(i, v + 5, str(v), ha='center', color='black', fontsize=12)
plt.xticks(rotation=0)
plt.show()


# 3.3) Choose two features (e.g., delta and alpha) and create a 2D scatter plot to visualize patterns in the
# dataset. Color the points according to their class. Make sure to add: axis labels, descriptive title, legend
print("\nEXERCISE 3.3")
feature_x = 'delta'
feature_y = 'alpha'

plt.figure(figsize=(8,6))
for class_value in data['class'].unique():
    class_data = data[data['class'] == class_value]
    plt.scatter(class_data[feature_x], class_data[feature_y], label=f'Class {class_value}', alpha=0.6)

plt.xlabel(feature_x, fontsize=12)
plt.ylabel(feature_y, fontsize=12)
plt.title(f'{feature_x} vs {feature_y} by Class', fontsize=14)

plt.legend(title='Class')
plt.show()

# 4.1) Discard all columns that represent IDs or metadata (e.g., obj_ID, run_ID, plate, etc.) because those
# are not particullary useful for ML models. Retain only the following columns: u, g, r, i, z, redshift, and class.
print("\nEXERCISE 4.1")
columns_to_keep = ['u', 'g', 'r', 'i', 'z', 'redshift', 'class']
data_cleaned = data[columns_to_keep]
print(data_cleaned.head(20))





### **Assignment 2: MEDMNIST Dataset** (2 points)

You are already familiar with the MNIST digits dataset. MEDMNIST is a collection of medical image datasets designed for machine learning tasks in the medical domain. Explore the [MEDMNIST website](https://medmnist.com/) and select a dataset of your choice. Perform the following tasks:

---

### **Instructions**  
1. **Dataset Overview**  
   - What is the dataset you chose about?  
   - What is the format/type of the data (e.g., images, tabular data, etc.)?  
   - What are the inputs and outputs in this dataset?  
   - What is the targeted task (e.g., classification, regression, etc.)?  
   - Draw a black-box input-output diagram to illustrate the dataset's structure.  

2. **Dataset Exploration**  
   - How many instances/samples does the dataset have?  
   - In which Python data type is the data stored (e.g., NumPy arrays, Pandas DataFrame, etc.)?  
   - What are the shapes of the samples?

---

Make sure to include detailed explanations and any relevant code to justify your answers.

Below is the code you can use to get the data.

In [None]:
!pip install medmnist

In [68]:
import numpy as np
import medmnist
from medmnist import INFO, Evaluator
from medmnist.dataset import BloodMNIST
# from medmnist.dataset import SynapseMNIST3D

In [None]:
# data_flag = 'bloodmnist' # change this string to get the dataset you want, e.g. 'bloodmnist', 'dermamnist', 'pathmnist' ...
data_flag = 'bloodmnist'
info = INFO[data_flag]
task = info['task']
DataClass = getattr(medmnist.dataset, info['python_class'])

classes = info['label']

# Print classes
for label, name in classes.items():
    print(f"Class {label}: {name}")

In [None]:
# load the data
train_dataset = DataClass(split='train', download=True)
test_dataset = DataClass(split='test', download=True)
val_dataset = DataClass(split='val', download=True)

In [None]:
print("TRAIN DATASET")
print(train_dataset)
print("===================")
print("TEST DATASET")
print(test_dataset)
print("===================")
print("VAL DATASET")
print(val_dataset)

You can access the images and the labels using train_dataset.imgs and train_dataset.labels. The example is shown below.

In [None]:
train_dataset.imgs                  # see all images 5-dimensional array
# train_dataset.imgs[0]             # see individual image
# train_dataset.imgs[0][0]          # see 28 rows with 3 rgb propertiesof an image
# train_dataset.imgs[0][0][0]       # see individual row with 3 rgb properties
# train_dataset.imgs[0][0][0][0]    # see a single property of a row

In [None]:
train_dataset.labels

In [None]:
# visualization
train_dataset.montage(length=20)

1.1) What is the dataset you chose about?
According to the [documentation](https://medmnist.com/), the BloodMNIST dataset consists of 17,092 images of blood cell types (8 classes). The outputs of print() method when train_dataset, test_dataset and val_dataset specifies that images are 3×28×28 pixels. The dataset is split into train (11,959), val (1,712) and test (3,421) sets. The task is multi-class classification. There are the following classes within this dataset:
- Basophil
- Eosinophil
- Erythroblast
- Immature granulocytes (myelocytes, metamyelocytes, promyelocytes)
- Lymphocyte
- Monocyte
- Neutrophil
- Platelet

Code to see the dataset classes
data_flag = 'bloodmnist'
info = INFO[data_flag]
classes = info['label']

for label, name in classes.items():
    print(f"Class {label}: {name}")

Data description
The dataset consists of 5 levels
The first level contains individual images as described in the documentation (3x28x28). The next level contains 28 arrays, each of which has a structure of 3 elements and 28 rows and it can be seen via this code val_dataset.imgs[0][0] and each row can be seen via this code val_dataset.imgs[0][0][0]. Finally, the rgb values can be seen via this code val_dataset.imgs[0][0][0][1]. Therefore, if we want to see the color of individual dot we can do it via al_dataset.imgs[0][0][0]. See the following link for more info https://www.rapidtables.com/web/color/RGB_Color.html

1.2) What is the format/type of the data (e.g., images, tabular data, etc.)?
The data is in the format of images. Each image is represented as a 3D array with dimensions (28, 28, 3), where 28x28 represents the pixel grid, and 3 corresponds to the RGB color channels.

1.3) What are the inputs and outputs in this dataset?
Inputs: Images, represented as 3D arrays of shape (28, 28, 3) (RGB pixel values), as explained above.
Outputs: Class labels, representing the categories (e.g., Basophil, Eosinophil, Erythroblast, etc.), see 1 above for more details.

1.4) What is the targeted task (e.g., classification, regression, etc.)?
The targeted task is classification, not regression. The goal is to classify each input image into one of the predefined classes, see 1 above.

1.5) Draw a black-box input-output diagram to illustrate the dataset's structure (click to open and see the black box).
+-------------------+
|   Input: Image    |
|  (28x28x3 Array)  |
+-------------------+
          |
          v
  +---------------+
  |    Model      |
  | (Classification)|
  +---------------+
          |
          v
+-------------------------------+
| Output: Class Label           |
| (Basophil, Eosinophil,         |
| Erythroblast, Immature         |
| granulocytes, Lymphocyte,      |
| Monocyte, Neutrophil, Platelet)|
+-------------------------------+

2.1) How many instances/samples does the dataset have?
The dataset is split into train (11,959), val (1,712) and test (3,421) sets. The task is multi-class classification.

2.2) In which Python data type is the data stored (e.g., NumPy arrays, Pandas DataFrame, etc.)?
The data is stored in NumPy arrays.

2.3) What are the shapes of the samples?
The shape of each sample is (28, 28, 3), see some of the questions and answers above for more details.

### **Assignment 3: Dataset of your choice** (2 points)

Choose a dataset of your choice and repeat the tasks similar to the first two assignments depending of the data format you choose.


---




In [119]:
# !pip install medmnist

import numpy as np
import medmnist
from medmnist import INFO, Evaluator
from medmnist.dataset import DermaMNIST

# data_flag = 'bloodmnist' # change this string to get the dataset you want, e.g. 'bloodmnist', 'dermamnist', 'pathmnist' ...
data_flag = 'dermamnist'
info = INFO[data_flag]
task = info['task']
DataClass = getattr(medmnist.dataset, info['python_class'])

classes = info['label']

# Print classes
for label, name in classes.items():
    print(f"Class {label}: {name}")

# load the data
train_dataset = DataClass(split='train', download=True)
test_dataset = DataClass(split='test', download=True)
val_dataset = DataClass(split='val', download=True)

print("TRAIN DATASET")
print(train_dataset)
print("===================")
print("TEST DATASET")
print(test_dataset)
print("===================")
print("VAL DATASET")
print(val_dataset)


train_dataset.imgs                  # see all images 5-dimensional array
# train_dataset.imgs[0]             # see individual image
# train_dataset.imgs[0][0]          # see 28 rows with 3 rgb propertiesof an image
# train_dataset.imgs[0][0][0]       # see individual row with 3 rgb properties
# train_dataset.imgs[0][0][0][0]    # see a single property of a row

# labels
train_dataset.labels

# visualization
train_dataset.montage(length=20)


Class 0: actinic keratoses and intraepithelial carcinoma
Class 1: basal cell carcinoma
Class 2: benign keratosis-like lesions
Class 3: dermatofibroma
Class 4: melanoma
Class 5: melanocytic nevi
Class 6: vascular lesions
Using downloaded and verified file: /root/.medmnist/dermamnist.npz
Using downloaded and verified file: /root/.medmnist/dermamnist.npz
Using downloaded and verified file: /root/.medmnist/dermamnist.npz
TRAIN DATASET
Dataset DermaMNIST of size 28 (dermamnist)
    Number of datapoints: 7007
    Root location: /root/.medmnist
    Split: train
    Task: multi-class
    Number of channels: 3
    Meaning of labels: {'0': 'actinic keratoses and intraepithelial carcinoma', '1': 'basal cell carcinoma', '2': 'benign keratosis-like lesions', '3': 'dermatofibroma', '4': 'melanoma', '5': 'melanocytic nevi', '6': 'vascular lesions'}
    Number of samples: {'train': 7007, 'val': 1003, 'test': 2005}
    Description: The DermaMNIST is based on the HAM10000, a large collection of multi-so

array([[158, 111, 117],
       [161, 116, 121],
       [164, 121, 130],
       [167, 127, 135],
       [166, 133, 142],
       [169, 139, 147],
       [172, 146, 155],
       [174, 151, 159],
       [185, 162, 168],
       [189, 164, 168],
       [192, 165, 170],
       [197, 167, 169],
       [201, 167, 168],
       [202, 163, 164],
       [196, 154, 155],
       [186, 146, 147],
       [185, 153, 156],
       [185, 156, 160],
       [190, 161, 165],
       [193, 164, 168],
       [194, 165, 169],
       [193, 164, 168],
       [190, 161, 165],
       [188, 159, 163],
       [190, 161, 165],
       [189, 160, 164],
       [187, 158, 160],
       [186, 157, 159]], dtype=uint8)

1.1) What is the dataset you chose about?
According to the [documentation](https://medmnist.com/), the DermaMNIST dataset consists of 10,015 dermatoscopic images of common pigmented skin lesions, categorized into 7 classes. The images are resized from their original dimensions of 600x450 pixels to 28x28 pixels and have 3 color channels. The dataset is split into three sets: training (7,007 samples), validation (1,003 samples), and testing (2,005 samples), following a 7:1:2 ratio. The task is multi-class classification, where each image is classified into one of the seven lesion types. There are the following classes within this dataset:
- actinic keratoses and intraepithelial carcinoma
- basal cell carcinoma
- benign keratosis-like lesions
- dermatofibroma
- melanoma
- melanocytic nevi
- vascular lesions

Code to see the dataset classes
data_flag = 'dermamnist'
info = INFO[data_flag]
classes = info['label']

for label, name in classes.items():
    print(f"Class {label}: {name}")

Data description
The dataset consists of 5 levels
The first level contains individual images as described in the documentation (3x28x28). The next level contains 28 arrays, each of which has a structure of 3 elements and 28 rows and it can be seen via this code val_dataset.imgs[0][0] and each row can be seen via this code val_dataset.imgs[0][0][0]. Finally, the rgb values can be seen via this code val_dataset.imgs[0][0][0][1]. Therefore, if we want to see the color of individual dot we can do it via al_dataset.imgs[0][0][0]. See the following link for more info https://www.rapidtables.com/web/color/RGB_Color.html

1.2) What is the format/type of the data (e.g., images, tabular data, etc.)?
Just like in the previous exercise, the data is in the format of images. Each image is represented as a 3D array with dimensions (28, 28, 3), where 28x28 represents the pixel grid, and 3 corresponds to the RGB color channels.

1.3) What are the inputs and outputs in this dataset?
The same as in the previous exercise:
Inputs are the images, represented as 3D arrays of shape (28, 28, 3) (RGB pixel values), as explained above.
Outputs are the class labels, representing the categories (e.g.dermatofibroma, melanoma, melanocytic nevi, etc.), see 1 above for more details.

1.4) What is the targeted task (e.g., classification, regression, etc.)?
The targeted task is classification, not regression. The goal is to classify each input image into one of the predefined classes, see 1 above.

1.5) Draw a black-box input-output diagram to illustrate the dataset's structure (click to open and see the black box).
+-------------------+
|   Input: Image    |
|  (28x28x3 Array)  |
+-------------------+
          |
          v
  +---------------+
  |    Model      |
  | (Classification)|
  +---------------+
          |
          v
+-------------------------------+
| Output: Class Label           |
| (Actinic Keratoses and        |
| Intraepithelial Carcinoma     |
| Basal Cell Carcinoma          |
| Benign Keratosis-Like Lesions |
| Dermatofibroma                |
| Melanoma                      |
| Melanocytic Nevi              |
| Vascular Lesions)             |
+-------------------------------+


2.1) How many instances/samples does the dataset have?
The dataset is split into three sets: training (7,007 samples), validation (1,003 samples), and testing (2,005 samples), following a 7:1:2 ratio.The task is multi-class classification (7).

2.2) In which Python data type is the data stored (e.g., NumPy arrays, Pandas DataFrame, etc.)?
The data is stored in NumPy arrays.

2.3) What are the shapes of the samples?
The shape of each sample is (28, 28, 3), see some of the questions and answers above for more details.