# Personal Information
Name: Yuan Su

StudentID: 14534053

Email: yuan.su2@student.uva.nl

Submitted on: **19.03.2023**

# Data Context
**To serve the purpose of the testing SET-alternative algorithm. The data is important to be simple and refined. Therefore, clean and balanced datasets are the first choice. Conveniently, the datasets that come with machine learning APi in python have provided some exceptional ones. For the purpose to test universality of this algorithm, the data selections are mainly chosen to train artificial neural networks(ANN) and convolutional neural networks(CNN). Thus, some regression data and picture data are chosen.**



# Data Description

**For regression tasks, in the API ‘sklearn’, there are some good datasets for regression purposes. ‘California housing dataset’, ‘diabetes dataset’ and ‘wine dataset’ are chosen as those data have various dimensionalities indicating different complexity. Additionally, the total number of instances also vary, which is suitable to test the running time difference between ANNs and sparse neural networks(SNNs) in different algorithms. More importantly, those dataset do not contain missing values and the correlations amongst features also vary, indicating a good universality if training on different architectures of neural networks. For CNNs classification tasks, in ‘torch’ API, ‘MNIST’ dataset is chosen as this dataset has enough training instance to construct a CNN but the complexity of this CNN is capable of being run on a personal device without using transfer learning. Also, ‘FMNIST’ dataset is chosen as this dataset is much more complex than ‘MNIST’ in terms of features difference and label difference. ** 

In [1]:
# Imports
import os
import numpy as np 
import pandas as pd
from sklearn.datasets import fetch_california_housing,load_diabetes,load_wine,fetch_openml
import seaborn as sns
import matplotlib.pyplot as plt 
import torch
from torchvision import datasets
import torchvision

### Data Loading

In [2]:
california = fetch_california_housing()
diabetes = load_diabetes()
wine = load_wine()
mnist = fetch_openml('mnist_784',version = 1)



In [None]:
#MNIST data
train_dataset_m = torchvision.datasets\
.MNIST(root='./data',
       train=True, 
       transform=None,
       target_transform=None,
       download=True)
test_dataset_m = torchvision.datasets\
.MNIST(root='./data',
       train=False,
       transform=None,
       target_transform=None,
       download=True)


In [None]:
#FMNIST data
train_dataset_f = torchvision.datasets\
.FashionMNIST(root='./data', 
              train=True,
              transform=None,
              target_transform=None,
              download=True)
test_dataset_f = torchvision.datasets\
.FashionMNIST(root='./data',
              train=False,
              transform=None, 
              target_transform=None,
              download=True)

In [None]:
california_df = pd.DataFrame(california.data, columns=california.feature_names)
diabetes_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)

### Analysis for regression tasks: 
The general description of the dataset for regression tasks is presented, indicating the amount of instances and features.


In [None]:
overview = {'data_name':['california','diabetes','wine_df'],
            'instance':[california_df.shape[0],diabetes_df.shape[0],wine_df.shape[0]],
            'features':[california_df.shape[1],diabetes_df.shape[1],wine_df.shape[1]]
           }
overview = pd.DataFrame(overview)
overview

### Analysis correlations of regression tasks:

correlations amoungst features with targets are vary. 

In [None]:
co_california = california_df.corr()
co_diabetes = diabetes_df.corr()
co_wine = wine_df.corr()

In [None]:
fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(10, 5))

sns.heatmap(co_california, cmap='coolwarm', ax=axs.flat[0])
axs.flat[0].set_title('Correlation California')

sns.heatmap(co_diabetes, cmap='coolwarm', ax=axs.flat[1])
axs.flat[1].set_title('Correlation Diabetes')

sns.heatmap(co_wine, cmap='coolwarm', ax=axs.flat[2])
axs.flat[2].set_title('Correlation Wine')
plt.subplots_adjust(hspace=0.8, wspace=0.8)

### Analysis for classification tasks:

In [None]:
mnist_x = mnist.data
mnist_y =mnist.target


label distribution of MNIST

In [None]:
label_counts = np.bincount(mnist_y.astype(int))
sns.countplot(x=mnist_y)
plt.title('Label distribution')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

Some sample images from MNIST and instances

In [None]:
print('Training set size:', len(train_dataset_m))
print('Test set size:', len(test_dataset_m))

fig, ax = plt.subplots(1, 5, figsize=(15, 3))
for i in range(5):
    img, label = train_dataset_m[i]
    ax[i].imshow(img, cmap='gray')
    ax[i].set_title('Label: {}'.format(label))
plt.show()


Label distribution of FMNIST

In [None]:
fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True)
fmnist_train_labels = fmnist_trainset.targets
labels, counts = torch.unique(fmnist_train_labels, return_counts=True)
label_dist = dict(zip(labels.numpy(), counts.numpy()))
plt.bar(label_dist.keys(), label_dist.values())
plt.xlabel('Label')
plt.ylabel('Count')
plt.title('Fashion-MNIST Label Distribution')
plt.show()


Some sample images from FMNIST and instances

In [None]:
print('Training set size:', len(train_dataset_f))
print('Test set size:', len(test_dataset_f))
fig, ax = plt.subplots(1, 5, figsize=(15, 3))
for i in range(5):
    img, label = train_dataset_f[i]
    ax[i].imshow(img, cmap='gray')
    ax[i].set_title('Label: {}'.format(label))
plt.show()