## 7. Group Assignment & Presentation



__You should be able to start up on this exercise after Lecture 1.__

*This exercise must be a group effort. That means everyone must participate in the assignment.*

In this assignment you will solve a data science problem end-to-end, pretending to be recently hired data scientists in
a company. To help you get started, we've prepared a checklist to guide you through the project. Here are the main steps
that you will go through:

1. Frame the problem and look at the big picture
2. Get the data
3. Explore and visualise the data to gain insights
4. Prepare the data to better expose the underlying data patterns to machine learning algorithms
5. Explore many different models and short-list the best ones
6. Fine-tune your models
7. Present your solution (video presentation) 

In each step we list a set of questions that one should have in mind when undertaking a data science project. The list
is not meant to be exhaustive, but does contain a selection of the most important questions to ask. We will be available
to provide assistance with each of the steps, and will allocate some part of each lesson towards working on the projects.

Your group must submit a _**single**_ Jupyter notebook, structured in terms of the first 6 sections listed above
(the seventh will be a video uploaded to some streaming platform, e.g. YouTube, Vimeo, etc.).


In [205]:
# Importing the necessary libraries for this project
# Standard Library modules
import os
import re
import shutil

# Dependencies
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

### 1. Analysis: Frame the problem and look at the big picture
1. Find a problem/task that everyone in the group finds interesting
2. Define the objective in business terms
3. How should you frame the problem (supervised/unsupervised etc.)?
4. How should performance be measured?

## 1. Analysis: Frame the problem and look at the big picture
TODO: Write a description of the problem and the business objective. Define the problem in business terms.

The following project will consist of a machine-learning system that identifies and classifies Pokemon from images,
allowing the users to:

##### 1. Verify if the image displays a Pokemon:
Determine if an uploaded image contains a Pokemon or not, so that the identification of a Pokemon can be automated,
saving time with large datasets.

##### 2. Type classification:
Additionally, the system should be able to predict the displayed Pokemon type/s'. This would allow the end user to achieve
greater personalization, reaching a point in which it could recommend strategies, based on the opponent's Pokemon team.


#### Framing the problem
Supervised learning is the most appropiate approach, since a complete dataset can be supplied. Moreover, since the data
is not continuous, classification will be used:

- Is it a Pokemon?
To answer this question, Binary classification we will need to implement a machine learning model that can perform
image classification. In order to train this model, a dataset containing images of Pokemon and Digimon will be provided.
To measure the performance of this model, there are a few metrics to take into account:
  - Accuracy: Measurement of how many guesses are correct. Above 90% should be an acceptable value.
  - Precision: Fraction of the predicted Pokemon that are actually Pokemon:
  $$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$
  - Recall: Fraction of actual Pokémon images correctly identified:
  $$\text{Recall}= \frac {\text{True Positives}}{\text{True Positives + False Negatives}} $$
  - F1 score: since the dataset is imbalanced, a harmonic mean of precision and recall is also useful:
  $$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

- Which is this species' type?
To answer this question, we will need a machine learning model capable of labelling the images accordingly, using many
different labels, as the expected output could consist of multiple elements, since some Pokemon species possess 2 types.

To measure the performance of this model, there are a few metrics to take into account:
  - Hamming Loss: Fraction of incorrectly predicted labels (either false positive or false negative):
  $$\text{Hamming Loss} = \frac{\text{Number of Incorrect Labels}}{\text{Total Labels}}$$
  - Precision: Fraction of predicted types that are correct (averaged across labels).
  - Recall: Fraction of actual types correctly identified:
  $$\text{Recall}= \frac {\text{True Positives}}{\text{True Positives + False Negatives}} $$
  - F1 score: since the dataset is imbalanced, a harmonic mean of precision and recall is also useful:
  $$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
  - Mean Average Precision (mAP): Average precision computed for each type, capturing how well the model ranks true positives.

### 2. Get the data
1. Find and document where you can get the data from
2. Get the data
3. Check the size and type of data (time series, geographical etc)

## 2. Get the data

### Data Sources

To train the model for Pokémon classification, two datasets have been selected
to cover the primary tasks: identifying Pokémon and distinguishing them from
non-Pokémon images. These datasets provide a robust foundation for the project.

- The PokeAPI dataset is an extensive dataset containing all sorts of information
about the Pokemon universe, including information on all Pokemon species, images
and other information such as moves, abilities, and locations. The dataset is
available on GitHub [here](https://github.com/PokeAPI/pokeapi.git) and the
corresponding submodules.

- The Digimon dataset is a collection of images scraped from the
[Wikimon.net](https://wikimon.net/Visual_List_of_Digimon) website and contains
about 1000 images of Digimon species and variations, since this only serves as
negative examples of Pokemon, we only need their images and none of the additional
information about them.

The web scrapper used to gather the Digimon data can be found on GitHub
[here](https://github.com/lorenzo-stacchio/Digimon_Dataset), although there
exists a Google Drive link with the data already gathered, which can be found
[here](https://drive.google.com/drive/folders/1tmcdsoX67NvmAgtmGJgo6kb3N6SlJeLu?usp=share_link)


### 3. Explore the data
1. Create a copy of the data for explorations (sampling it down to a manageable size if necessary)
2. Create a Jupyter notebook to keep a record of your data exploration
3. Study each feature and its characteristics:
    * Name
    * Type (categorical, int/float, bounded/unbounded, text, structured, etc)
    * Percentage of missing values
    * Check for outliers, rounding errors etc
4. For supervised learning tasks, identify the target(s)
5. Visualise the data
6. Study the correlations between features
7. Identify the promising transformations you may want to apply (e.g. convert skewed targets to normal via a log transformation)
8. Document what you have learned

## Data exploration

- PokeAPI Dataset:
    - The dataset contains images for all Pokemon species, one for each game
    they have been in among many alternate forms and color variations.
    - The dataset also includes CSV files with information about Pokemon, where
    to find them, their moves and abilities, etc...
- Digimon Dataset
    - The dataset contains images scraped from the Digimon wiki, this includes
    roughly about a thousand images of Digimon species and variations.

In [206]:
def read_properties(path):
    props = {}
    property_regex = re.compile(r'#{0}(.+)[:=]([^\n\r#]+)#?')
    with open(path, 'r') as f:
        lines = f.readlines()
        for line in lines:
            match = property_regex.match(line)
            if match:
                key = match.group(1).strip()
                value = match.group(2).strip()
                #print('Found property: {}={}'.format(key, value))
                props[key] = value
    return props

In [207]:
props = read_properties('paths.env')

pokeAPI_data_repo = props['pokeAPI-data']
pokeAPI_sprites_repo = props['pokeAPI-sprites']
digimon_datasource = props['digimon-images']

output_dir = props['dataset-dir']
display(props)

{'pokeAPI-data': '""',
 'pokeAPI-sprites': '""',
 'digimon-images': '""',
 'dataset-dir': './Dataset'}

In [208]:
# Using __ as a prefix to clarify that variables are not meant to be used later
# PokeAPI csv data
__pokeAPI_data_root = os.path.join(pokeAPI_data_repo, 'data/v2/csv')
# PokeAPI sprites folder
__pokeAPI_sprites_folder = os.path.join(pokeAPI_sprites_repo, 'sprites')
# Digimon images folder
__digimon_images_folder = os.path.join(digimon_datasource, 'images')

In [209]:
print('Checking for data...')
__pokeAPI_data_url = 'https://github.com/PokeAPI/pokeapi.git'
__pokeAPI_sprites_url = 'https://github.com/PokeAPI/sprites.git'
__digimon_data_url = 'https://drive.google.com/drive/folders/1tmcdsoX67NvmAgtmGJgo6kb3N6SlJeLu?usp=share_link'

__missing_sources = []
if not os.path.exists(__pokeAPI_data_root):
    print('You must download the PokeAPI data first from the following git: {}'.format(__pokeAPI_data_url))
    __missing_sources.append('PokeAPI data')
else:
    print('Retrieving PokeAPI data from: {}'.format(__pokeAPI_data_root))

if not os.path.exists(__pokeAPI_sprites_folder):
    print('You must download the PokeAPI sprites first from the following git: {}'.format(__pokeAPI_sprites_url))
    __missing_sources.append('PokeAPI sprites')
else:
    print('Retrieving PokeAPI sprites from: {}'.format(__pokeAPI_sprites_folder))

if not os.path.exists(__digimon_images_folder):
    print('You must download the Digimon data first from the following Google Drive: {}'.format(__digimon_data_url))
    __missing_sources.append('Digimon images')
else:
    print('Retrieving Digimon images from: {}'.format(__digimon_images_folder))

if len(__missing_sources) > 1:
    print('You must download the following sources: {}'.format(', '.join(__missing_sources)))
    # raise FileNotFoundError('Missing data sources')

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Checking for data...
You must download the PokeAPI data first from the following git: https://github.com/PokeAPI/pokeapi.git
You must download the PokeAPI sprites first from the following git: https://github.com/PokeAPI/sprites.git
You must download the Digimon data first from the following Google Drive: https://drive.google.com/drive/folders/1tmcdsoX67NvmAgtmGJgo6kb3N6SlJeLu?usp=share_link
You must download the following sources: PokeAPI data, PokeAPI sprites, Digimon images


In [210]:
# Verify the datasource hasn't generated earlier
dataset_dir = os.path.join(output_dir, 'dataset')
dataset_path = os.path.join(dataset_dir, 'pokemon.csv')
generate_dataset = True
if os.path.exists(dataset_path):
    generate_dataset = False
    print('The dataset has already been generated, please delete to regenerate')

The dataset has already been generated, please delete to regenerate


In [211]:
def import_data():
    original_pokemon = pd.read_csv(os.path.join(__pokeAPI_data_root, 'pokemon.csv'))
    original_types = pd.read_csv(os.path.join(__pokeAPI_data_root, 'types.csv'))
    original_pokemon_types = pd.read_csv(os.path.join(__pokeAPI_data_root, 'pokemon_types.csv'))
    return original_pokemon, original_types, original_pokemon_types

## Trimming the data

The PokeAPI dataset is quite extensive, it contains all Pokemon (currently 1025)
and also many alternative form information. This information is assigned an ID
that is greater than 10000 so that it does not interfere with the original Pokemon
IDs, as such, we must filter out all IDs greater than 10000.

Additionally, we will remove a lot of the columns that are not relevant to our
current analysis, such as the foreign keys pointing to relationships outside our
scope and some of the data irrelevant to us such as height and weight.

There's also some types that are not relevant to our analysis, because they are
only used for specific mechanics in the games and do not represent a Pokemon's
type, such as the "shadow" type. We will remove these types from the dataset,
thankfully, the same as Pokemon IDs apply, these types have IDs greater than 10000.

We will also rename some of the columns to make them more readable and to avoid
confusion later on.

In [212]:
def clean_source_datasets(original_pokemon, original_types, original_pokemon_types):
    id_cutoff = 10000

    pokemon_types = original_pokemon_types[original_pokemon_types['pokemon_id'] < id_cutoff]
    pokemon = original_pokemon[original_pokemon['id'] < id_cutoff]
    types = original_types[original_types['id'] < id_cutoff]
    types = types.drop(columns=['damage_class_id', 'generation_id'])
    pokemon = pokemon.drop(columns=['species_id', 'height', 'weight', 'base_experience', 'order', 'is_default'])
    pokemon = pokemon.rename(columns={'identifier': 'name'})
    types = types.rename(columns={'identifier': 'type_label'})
    return pokemon, types, pokemon_types

In [213]:
def merge_dataset(pokemon, types, pokemon_types):
    merged = pokemon_types.merge(types, left_on='type_id', right_on='id').drop(columns=['id'])
    merged = merged.rename(columns={'identifier': 'type', 'slot': 'type_slot'})
    merged = pokemon.merge(merged, left_on='id', right_on='pokemon_id').drop(columns=['id'])
    dataset_target = os.path.join(dataset_dir, 'pokemon.csv')
    if not os.path.exists(dataset_dir):
        os.makedirs(dataset_dir)
    pokemon_merged.to_csv(dataset_target, index=False)
    print('Pokemon data saved to {} for saving'.format(dataset_target))
    return pokemon_merged

In [214]:
# Generate or read the dataset based on the previous checks
if generate_dataset:
    pokemon_merged = merge_dataset(*clean_source_datasets(*import_data()))
else:
    print('Reading dataset from {}'.format(dataset_path))
    pokemon_merged = pd.read_csv(dataset_path)
pokemon_merged

Reading dataset from ./Dataset/dataset/pokemon.csv


Unnamed: 0,name,pokemon_id,type_id,type_slot,type_label
0,bulbasaur,1,12,1,grass
1,bulbasaur,1,4,2,poison
2,ivysaur,2,12,1,grass
3,ivysaur,2,4,2,poison
4,venusaur,3,12,1,grass
...,...,...,...,...,...
1546,iron-crown,1023,9,1,steel
1547,iron-crown,1023,14,2,psychic
1548,terapagos,1024,1,1,normal
1549,pecharunt,1025,4,1,poison


In [215]:
# Check if images have already been extracted:
images_dir = os.path.join(output_dir, 'images')
extract_images = {}
pokemon_images_dir = os.path.join(images_dir, 'pokemon')
digimon_images_dir = os.path.join(images_dir, 'digimon')
extract_images['pokemon'] = not os.path.exists(pokemon_images_dir)
extract_images['digimon'] = not os.path.exists(digimon_images_dir)

In [216]:
def extract_pokemon_images():
    source_images = os.path.join(__pokeAPI_sprites_folder, 'pokemon/other/official-artwork')
    poke_image_files = os.listdir(source_images)
    print('Preparing {} Pokemon images...'.format(len(poke_image_files)))
    os.makedirs(pokemon_images_dir)
    # Like the data, images with ID higher than 10000 are not relevant
    valid_image_pattern = re.compile(r'^[0-9]{1,4}\.png')
    for file in poke_image_files:
        if valid_image_pattern.match(file):
            source = os.path.join(source_images, file)
            target = os.path.join(pokemon_images_dir, file)
            shutil.copy(source, target)
            print('Copied {} to {}'.format(file, target))

In [217]:
def extract_digimon_images():
    digimon_image_files = os.listdir(__digimon_images_folder)
    print('Preparing {} Digimon images...'.format(len(digimon_image_files)))
    os.makedirs(digimon_images_dir)
    for file in digimon_image_files:
        source = os.path.join(__digimon_images_folder, file)
        target = os.path.join(digimon_images_dir, file)
        shutil.copy(source, target)
        print('Copied file\n\tfrom: {}\n\tto: {}'.format(file, target))

In [218]:
if extract_images['pokemon']:
    extract_pokemon_images()
else:
    print('Pokemon images already extracted')
if extract_images['digimon']:
    extract_digimon_images()
else:
    print('Digimon images already extracted')

Pokemon images already extracted
Digimon images already extracted


In [219]:
# TODO: Add some visualization here to show some images and maybe some graphs

### 4. Prepare the data
Notes:
* Work on copies of the data (keep the original dataset intact).
* Write functions for all data transformations you apply, for three reasons:
    * So you can easily prepare the data the next time you run your code
    * So you can apply these transformations in future projects
    * To clean and prepare the test set

1. Data cleaning:
    * Fix or remove outliers (or keep them)
    * Fill in missing values (e.g. with zero, mean, median, regression ...) or drop their rows (or columns)
2. Feature selection (optional):
    * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).
3. Feature engineering, where appropriate:
    * Discretize continuous features
    * Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features
4. Feature scaling: standardise or normalise features

In [220]:
# TODO: This is where the images must be processed to be used in a model

In [221]:
from PIL import Image
import os

# Configuration
output_folder = f"{images_dir}/new"  # Replace with your output folder path
new_size = (128, 128)  # Set the desired size (width, height)

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

def resize_images(input_folder, output_folder, new_size):
    for filename in os.listdir(input_folder):
        input_path = os.path.join(input_folder, filename)
        output_path = os.path.join(output_folder, filename)
        file_extension = os.path.splitext(filename)[1].lower()
        print(file_extension)
        try:
            with Image.open(input_path) as img:
                img.thumbnail(new_size)  # Resizes while maintaining aspect ratio
                if file_extension in ['.jpg', '.jpeg']:
                    img = img.convert('RGB')  # Ensure JPEG images are in RGB mode
                    output_path = os.path.splitext(output_path)[0] + '.png'  # Change extension to .png
                    img.save(output_path, 'PNG')
                else:
                    img.save(output_path, file_extension[1:].upper())
                print(f"Resized and saved: {filename}")
        except Exception as e:
            print(f"Failed to process {filename}: {e}")
# Check if output_folder is empty
if not os.listdir(f"{output_folder}/pokemon") and not os.listdir(f"{output_folder}/digimon"):
    print(f"The output folder {output_folder} is empty.")
    resize_images(pokemon_images_dir, f"{output_folder}/pokemon", new_size)
    resize_images(digimon_images_dir, f"{output_folder}/digimon", new_size)
else:
    print(f"The output folder {output_folder} is not empty.")


The output folder ./Dataset/images/new is not empty.


### 5. Short-list promising models
We expect you to do some additional research and train at **least one model per team member**.

1. Train mainly quick and dirty models from different categories (e.g. linear, SVM, Random Forests etc) using default parameters
2. Measure and compare their performance
3. Analyse the most significant variables for each algorithm
4. Analyse the types of errors the models make
5. Have a quick round of feature selection and engineering if necessary
6. Have one or two more quick iterations of the five previous steps
7. Short-list the top three to five most promising models, preferring models that make different types of errors

In [222]:
# TODO: This is where we train two models !!! One for each task OR remove one of the tasks?

### FNN

#### Load the data

In [223]:
# Dataset properties
IMG_SIZE = 128
BATCH_SIZE = 32
BUFFER_SIZE = 2000

In [224]:
def data_loader(path):
    return tf.keras.preprocessing.image_dataset_from_directory(
        path,
        interpolation='area',
        image_size=(IMG_SIZE, IMG_SIZE),
        shuffle=False,
        batch_size=None
    )

In [225]:
# Split the dataset into training, validation, and test sets
train_ds = data_loader(f"{images_dir}/new").take(160).batch(BATCH_SIZE).prefetch(buffer_size=BUFFER_SIZE)
val_ds = data_loader(f"{images_dir}/new").skip(160).take(200).batch(BATCH_SIZE).prefetch(buffer_size=BUFFER_SIZE)
test_ds = data_loader(f"{images_dir}/new").skip(180).take(200).batch(BATCH_SIZE).prefetch(buffer_size=BUFFER_SIZE)

Found 2152 files belonging to 2 classes.
Found 2152 files belonging to 2 classes.
Found 2152 files belonging to 2 classes.


In [226]:
# remember class names
class_names = ['digimon', 'pokemon']

# optimize performance
train_ds = train_ds.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat().prefetch(tf.data.AUTOTUNE)
val_ds = val_ds.cache().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.cache().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [227]:
# build model
fnn = models.Sequential([
    layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3)), # 3 channels for RGB
    layers.Rescaling(1./255), # normalize pixel values to [0, 1]
    layers.Flatten(), # flatten the 128x128x3 input images
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.25),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.125),
    layers.Dense(1, activation='sigmoid') # output layer, binary classification
])

# compile model
fnn.compile(
    optimizer=tf.keras.optimizers.AdamW(learning_rate=0.0001),
    loss='binary_crossentropy',
    metrics=['accuracy'],
    jit_compile=True
)

# inspect model
fnn.summary()

In [228]:
# stop when val_accuracy doesnt improve
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',
    patience=5,
    restore_best_weights=True,
    verbose=1
)

# training loop
training = fnn.fit(
    train_ds,
    validation_data=val_ds,
    epochs=25,
    steps_per_epoch=2000 // BATCH_SIZE,
    callbacks=[early_stopping]
)

# evaluate on test set
test_loss, test_acc = fnn.evaluate(test_ds)
print(f"\nTest accuracy: {test_acc:.4f}")
print(f"Test loss: {test_loss:.4f}")

Epoch 1/25


ValueError: Exception encountered when calling Sequential.call().

[1mInvalid input shape for input Tensor("data:0", shape=(None, None, 128, 128, 3), dtype=float32). Expected shape (None, 128, 128, 3), but input has incompatible shape (None, None, 128, 128, 3)[0m

Arguments received by Sequential.call():
  • inputs=tf.Tensor(shape=(None, None, 128, 128, 3), dtype=float32)
  • training=True
  • mask=None

### 6. Fine-tune the system
1. Fine-tune the hyperparameters
2. Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error

In [None]:
# TODO: Optimizing the chosen models

### 7. Present your solution
1. Document what you have done
2. Create a nice 15 minute video presentation with slides
    * Make sure you highlight the big picture first
3. Explain why your solution achieves the business objective
4. Don't forget to present interesting points you noticed along the way:
    * Describe what worked and what did not
    * List your assumptions and you model's limitations
5. Ensure your key findings are communicated through nice visualisations or easy-to-remember statements (e.g. "the median income is the number-one predictor of housing prices")
6. Upload the presentation to some online platform, e.g. YouTube or Vimeo, and supply a link to the video in the notebook.

In [None]:
# TODO: Documentation

Géron, A. 2017, *Hands-On Machine Learning with Scikit-Learn and Tensorflow*, Appendix B, O'Reilly Media, Inc., Sebastopol.