As the first step of this tutorial, we will format our data.

For our dataset, we will be using a dataset from Kaggle: https://www.kaggle.com/paramaggarwal/fashion-product-images-small

Note: You will need to create an account with Kaggle in order to download the data. Once you have done that, download the data to wherever you are going to train your model.

In [1]:
%load_ext autoreload

%autoreload 2

In [2]:
import os
import sys
sys.path.append('../../')

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

Read in data
===

Note: change this to wherever you stored your data

In [4]:
DATA_DIR = '/home/ec2-user/fashion_dataset'

In [5]:
df = pd.read_csv(f'{DATA_DIR}/styles.csv', error_bad_lines=False)

b'Skipping line 6044: expected 10 fields, saw 11\nSkipping line 6569: expected 10 fields, saw 11\nSkipping line 7399: expected 10 fields, saw 11\nSkipping line 7939: expected 10 fields, saw 11\nSkipping line 9026: expected 10 fields, saw 11\nSkipping line 10264: expected 10 fields, saw 11\nSkipping line 10427: expected 10 fields, saw 11\nSkipping line 10905: expected 10 fields, saw 11\nSkipping line 11373: expected 10 fields, saw 11\nSkipping line 11945: expected 10 fields, saw 11\nSkipping line 14112: expected 10 fields, saw 11\nSkipping line 14532: expected 10 fields, saw 11\nSkipping line 15076: expected 10 fields, saw 12\nSkipping line 29906: expected 10 fields, saw 11\nSkipping line 31625: expected 10 fields, saw 11\nSkipping line 33020: expected 10 fields, saw 11\nSkipping line 35748: expected 10 fields, saw 11\nSkipping line 35962: expected 10 fields, saw 11\nSkipping line 37770: expected 10 fields, saw 11\nSkipping line 38105: expected 10 fields, saw 11\nSkipping line 38275: ex

Some of the data seems to be corrupt so we will just skip those bad lines for now. 

In [6]:
df.shape

(44424, 10)

In [7]:
df.head()

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


Each line in the csv represents a single product, and we have information about who the product is for, what type of product it is, and some attributes of the products, all as separate columns. 

We also have the product name, which will use as the text input to our models.

Each product also has a corresponding image in the `images` folder with the name `{id}.jpg`.

Creating Image URL
===

We need to create a column indicating where the images are stored.

Note: change this to wherever you stored your data

In [8]:
df['image_urls'] = df['id'].apply(lambda x: f'{DATA_DIR}/images/{x}.jpg')

Let's check the first one to make sure it worked.

In [9]:
df.loc[0]['image_urls']

'/home/ec2-user/fashion_dataset/images/15970.jpg'

In [10]:
df.head()

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName,image_urls
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt,/home/ec2-user/fashion_dataset/images/15970.jpg
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans,/home/ec2-user/fashion_dataset/images/39386.jpg
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch,/home/ec2-user/fashion_dataset/images/59263.jpg
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants,/home/ec2-user/fashion_dataset/images/21379.jpg
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt,/home/ec2-user/fashion_dataset/images/53759.jpg


Some of the rows do not have matching image files, so we need to remove those.

We'll check if the file exists and if not, then we'll throw out those rows.

In [11]:
bad_ids = []

for i, row in df.iterrows():
    if not os.path.isfile(row['image_urls']):
        print(row['id'])
        print(row['image_urls'])
        bad_ids.append(row['id'])

39403
/home/ec2-user/fashion_dataset/images/39403.jpg
39410
/home/ec2-user/fashion_dataset/images/39410.jpg
39401
/home/ec2-user/fashion_dataset/images/39401.jpg
39425
/home/ec2-user/fashion_dataset/images/39425.jpg
12347
/home/ec2-user/fashion_dataset/images/12347.jpg


In [12]:
df.shape

(44424, 11)

In [13]:
df = df[~df['id'].isin(bad_ids)]

In [14]:
df.shape

(44419, 11)

We also need to throw out rows without associated text.

In [15]:
df = df[df['productDisplayName'].notnull()]

In [16]:
df.shape

(44412, 11)

Split into task specific datasets
===

Part of the power of Tonks comes from the fact that you can train one model on different datasets for each task. That way you don't have to label all of your items with the exact same labels.

We will fake having different datasets by subsampling from this main df. For this first version, we'll train on the `gender` and `season` columns.

In [17]:
gender_df = df.sample(frac=0.8)

In [18]:
gender_df.gender.value_counts()

Men       17777
Women     14852
Unisex     1695
Boys        677
Girls       529
Name: gender, dtype: int64

In [19]:
season_df = df.sample(frac=0.6)

In [20]:
season_df.season.value_counts()

Summer    12834
Fall       6931
Winter     5093
Spring     1777
Name: season, dtype: int64

Format gender data
===

Now we will put the data in the format needed for Tonks.

In [21]:
gender_df.shape

(35530, 11)

First, we'll get rid of any null values because Tonks will throw an error if given a null value.

In [22]:
gender_df = gender_df[gender_df['gender'].notnull()]

In [23]:
gender_df.shape

(35530, 11)

We set the `gender` column to be a pandas category type

In [24]:
gender_df['gender'] = gender_df['gender'].astype('category')

We create a new column called `gender_cat` that contains the category to predict as an integer.
This is an important step because Tonks does not natively handle categories as text.

In [25]:
gender_df['gender_cat'] = gender_df['gender'].cat.codes

In [26]:
# Category mapping
{label: i for i, label in enumerate(gender_df['gender'].cat.categories)}

{'Boys': 0, 'Girls': 1, 'Men': 2, 'Unisex': 3, 'Women': 4}

Note: you will need to save this mapping for later so that you can use your model to make predictions.
Since this is a simple model, we won't save it, but for a real project, we would store it somewhere.

Now our gender data is in the right format for Tonks so we'll split it and save it again.

Note: the columns required for training the models are:
- gender_cat
- image_urls
- productDisplayName
- id (for bookkeeping)

We keep the other columns for convenience, but could drop them if we wanted to.

In [27]:
gender_train_df, gender_valid_df = train_test_split(gender_df, train_size=0.75, random_state=17)

In [28]:
gender_train_df.to_csv(f'{DATA_DIR}/gender_train.csv', index=False)

In [29]:
gender_valid_df.to_csv(f'{DATA_DIR}/gender_valid.csv', index=False)

Format season data
===

In [30]:
season_df.shape

(26647, 11)

First, we'll get rid of any null values

In [31]:
season_df = season_df[season_df['season'].notnull()]

In [32]:
season_df.shape

(26635, 11)

We set the `season` column to be a pandas category type

In [33]:
season_df['season'] = season_df['season'].astype('category')

We create a new column called `season_cat` that contains the category to predict as an integer.
This is an important step because Tonks does not natively handle categories as text.

In [34]:
season_df['season_cat'] = season_df['season'].cat.codes

In [35]:
# Category mapping
{label: i for i, label in enumerate(season_df['season'].cat.categories)}

{'Fall': 0, 'Spring': 1, 'Summer': 2, 'Winter': 3}

Note: you will need to save this mapping for later so that you can use your model to make predictions.
Since this is a simple model, we won't save it, but for a real project, we would store it somewhere.

Now our season data is in the right format for Tonks so we'll split it and save it again.

In [36]:
season_train_df, season_valid_df = train_test_split(season_df, train_size=0.75, random_state=17)

In [37]:
season_train_df.to_csv(f'{DATA_DIR}/season_train.csv', index=False)

In [38]:
season_valid_df.to_csv(f'{DATA_DIR}/season_valid.csv', index=False)

Now that we have some data, we can train a model with it! Move onto the notebook `Step2_train_image_model.ipynb` to see how.