## 7. Group Assignment & Presentation



__You should be able to start up on this exercise after Lecture 1.__

*This exercise must be a group effort. That means everyone must participate in the assignment.*

In this assignment you will solve a data science problem end-to-end, pretending to be recently hired data scientists in a company. To help you get started, we've prepared a checklist to guide you through the project. Here are the main steps that you will go through:

1. Frame the problem and look at the big picture
2. Get the data
3. Explore and visualise the data to gain insights
4. Prepare the data to better expose the underlying data patterns to machine learning algorithms
5. Explore many different models and short-list the best ones
6. Fine-tune your models
7. Present your solution (video presentation) 

In each step we list a set of questions that one should have in mind when undertaking a data science project. The list is not meant to be exhaustive, but does contain a selection of the most important questions to ask. We will be available to provide assistance with each of the steps, and will allocate some part of each lesson towards working on the projects.

Your group must submit a _**single**_ Jupyter notebook, structured in terms of the first 6 sections listed above (the seventh will be a video uploaded to some streaming platform, e.g. YouTube, Vimeo, etc.).

### 1. Analysis: Frame the problem and look at the big picture
4. How should performance be measured?

**Problem**: Categorizing clothing product images to verify if they align with their corresponding categories.

**Objective**: Improve inventory management and enhance customer experience by ensuring product images accurately 
represent their assigned categories.

**Framing the problem**: Since labels are unavailable, the problem can be framed as an unsupervised learning task:
- *Clustering*: Use neural networks like autoencoders to group similar images and identify potential mismatches within clusters.
- *Representation Learning*: Extract features using convolutional neural networks (CNNs) and perform clustering in the feature space.

**Performance**: ??? (How should performance be measured?)

### 2. Get the data
1. Find and document where you can get the data from
2. Get the data
3. Check the size and type of data (time series, geographical etc)

In [None]:
import pandas as pd

# Load the styles.csv file, skipping problematic lines
styles_df = pd.read_csv('../../../../../../OneDrive - ViaUC/MAL1/Assignment 7/fashion-dataset/styles.csv', on_bad_lines='skip')

styles_df.head()


Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


In [31]:
import os

# Create a new DataFrame with only 'masterCategory' and 'id' columns
new_df = styles_df[['masterCategory', 'id']]

# Define the path to the images directory
images_dir = 'fashion-dataset/images'

# Create a new column 'image_path' in new_df with the corresponding image paths
new_df['image_path'] = new_df['id'].apply(lambda x: os.path.join(images_dir, f"{x}.jpg"))

# Display the first few rows of the updated DataFrame
new_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['image_path'] = new_df['id'].apply(lambda x: os.path.join(images_dir, f"{x}.jpg"))


Unnamed: 0,masterCategory,id,image_path
0,Apparel,15970,fashion-dataset/images/15970.jpg
1,Apparel,39386,fashion-dataset/images/39386.jpg
2,Accessories,59263,fashion-dataset/images/59263.jpg
3,Apparel,21379,fashion-dataset/images/21379.jpg
4,Apparel,53759,fashion-dataset/images/53759.jpg


In [23]:
# 224x224 is most common image size for deep learning

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_df, test_df = train_test_split(new_df, test_size=0.2, random_state=42)

# Extract the image paths and labels for training and testing sets
train_images = train_df['image_path'].tolist()
train_labels = train_df['masterCategory'].tolist()
test_images = test_df['image_path'].tolist()
test_labels = test_df['masterCategory'].tolist()

# Display the lengths of the training and testing sets
print(f"Number of training images: {len(train_images)}")
print(f"Number of testing images: {len(test_images)}")

Number of training images: 35539
Number of testing images: 8885


In [None]:
from PIL import Image
import os

# Convert image paths to Image objects for training images
train_images = [Image.open(img_path) for img_path in train_images if os.path.exists(img_path)]

# Convert image paths to Image objects for testing images
test_images = [Image.open(img_path) for img_path in test_images if os.path.exists(img_path)]

(35535, 8884)

### 3. Explore the data
1. Create a copy of the data for explorations (sampling it down to a manageable size if necessary)
2. Create a Jupyter notebook to keep a record of your data exploration
3. Study each feature and its characteristics:
    * Name
    * Type (categorical, int/float, bounded/unbounded, text, structured, etc)
    * Percentage of missing values
    * Check for outliers, rounding errors etc
4. For supervised learning tasks, identify the target(s)
5. Visualise the data
6. Study the correlations between features
7. Identify the promising transformations you may want to apply (e.g. convert skewed targets to normal via a log transformation)
8. Document what you have learned

### 4. Prepare the data
Notes:
* Work on copies of the data (keep the original dataset intact).
* Write functions for all data transformations you apply, for three reasons:
    * So you can easily prepare the data the next time you run your code
    * So you can apply these transformations in future projects
    * To clean and prepare the test set
    
    
1. Data cleaning:
    * Fix or remove outliers (or keep them)
    * Fill in missing values (e.g. with zero, mean, median, regression ...) or drop their rows (or columns)
2. Feature selection (optional):
    * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).
3. Feature engineering, where appropriate:
    * Discretize continuous features
    * Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features
4. Feature scaling: standardise or normalise features

### 5. Short-list promising models
We expect you to do some additional research and train at **least one model per team member**.

1. Train mainly quick and dirty models from different categories (e.g. linear, SVM, Random Forests etc) using default parameters
2. Measure and compare their performance
3. Analyse the most significant variables for each algorithm
4. Analyse the types of errors the models make
5. Have a quick round of feature selection and engineering if necessary
6. Have one or two more quick iterations of the five previous steps
7. Short-list the top three to five most promising models, preferring models that make different types of errors

### 6. Fine-tune the system
1. Fine-tune the hyperparameters
2. Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error

### 7. Present your solution
1. Document what you have done
2. Create a nice 15 minute video presentation with slides
    * Make sure you highlight the big picture first
3. Explain why your solution achieves the business objective
4. Don't forget to present interesting points you noticed along the way:
    * Describe what worked and what did not
    * List your assumptions and you model's limitations
5. Ensure your key findings are communicated through nice visualisations or easy-to-remember statements (e.g. "the median income is the number-one predictor of housing prices")
6. Upload the presentation to some online platform, e.g. YouTube or Vimeo, and supply a link to the video in the notebook.

Géron, A. 2017, *Hands-On Machine Learning with Scikit-Learn and Tensorflow*, Appendix B, O'Reilly Media, Inc., Sebastopol.