# Step 2: Data Collection & Initial EDA of Candidate Datasets
## Dataset 2: PaperDoll/Chictopia

### MLE Capstone: Outfit Recommender - Spring 2021
### By: Bazeley, Mikiko 
### GH: [@mmbazel](https://github.com/MMBazel)  

In this notebook, I'll be exploring one of three datasets (DeepFashion, Paperdoll, and iMaterialist). 

Specifically in this notebook we'll: 

☑️ Load the data

☑️ Explore the dimensions of the dataset

☑️ Understand what categories are being represented

☑️ Explore samples of the data (the meta data dictionary with catgories & attributes labels, the Train 🚂 file, and finally the images 📸 themselves)

☑️ Understand distributions of categories, attributes

<hr style="border-top: 5px solid black; margin-top: 1px; margin-bottom: 1px"></hr>

## Explanation of the data, according to the dataset page here: 
https://github.com/kyamagu/paperdoll/tree/master/data/chictopia

### Files


### ⚠️📝 Notes (About the Notebooks) ⚠️📝 

My guiding principles:
* ➡️ Be overly communicative = While that leads to verbose commenting, I hope that means I catch a bunch of questions early)  
* ➡️ Human-readable over witty-optimization = For the most part I try to make everything I'm doing obvious
* ➡️ Write as much code as needed, and no more = There's a time and place for error-catching & object-oriented code & there are ways to make the notebook reproducible. That's not quite the goal for this notebook (or any of the other notebooks in the early stages of the project) and my goal was to write just the code needed to get this step done.  

<hr style="border-top: 5px solid black; margin-top: 1px; margin-bottom: 1px"></hr>

# <span style='background :red' > Step 1: Proper set-up & installation of necessary libraries & packages </span> 

1. Ensure you're using the right flavor of commands and that you have sqlite3 available to you. It's easy enough to pull up the terminal or whatever shell version youre using to check. 


In [10]:
#!pip3 install lmdb 

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9680 sha256=c64acaff6bce3dd1bec05890c69dc1e3522a8e24a9c1dd0ce1927392ca0cda04
  Stored in directory: /Users/mikikobazeley/Library/Caches/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
##################### [TODO] SETUP #####################
# [TODO] Import any utilities functions


import json
import os
import sys


import io
import lmdb
import sqlite3
import pandas as pd
from PIL import Image

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

print('Packages Imported')

modules = dir()

print(modules)
#print(os.environ)

# [TODO] Package install/load

Packages Imported
['Image', 'In', 'Out', '_', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i2', '_ih', '_ii', '_iii', '_oh', 'exit', 'get_ipython', 'io', 'json', 'lmdb', 'np', 'os', 'pd', 'plt', 'quit', 'sqlite3', 'sys']


In [3]:
# Confirm all the right libraries are present
# This is an important step because there's a good chance
# that for some pckaes where you use pip or pip3 install
# they could download to the wrong directory if you're not
# using the right pip executable

!conda list

# packages in environment at /Users/mikikobazeley/opt/anaconda3/envs/SPRINGBOARD_MLE_CAPSTONE_ENV:
#
# Name                    Version                   Build  Channel
_py-xgboost-mutex         2.0                       cpu_0  
_pytorch_select           0.1                       cpu_0    anaconda
_tflow_select             2.3.0                       mkl  
absl-py                   0.11.0             pyhd3eb1b0_1  
appnope                   0.1.2           py37hecd8cb5_1001  
argon2-cffi               20.1.0           py37haf1e3a3_1    anaconda
astor                     0.8.1            py37hecd8cb5_0  
async_generator           1.10             py37h28b3542_0    anaconda
attrs                     20.3.0             pyhd3eb1b0_0  
backcall                  0.2.0              pyhd3eb1b0_0  
blas                      1.0                         mkl  
bleach                    3.3.0              pyhd3eb1b0_0  
blis                      0.7.4                    pypi_0    pypi

In [4]:
# Confirm path of working directory
!pwd 

/Users/mikikobazeley/Github/personal/MMBazel/Wardrobe-Recommender/notebooks/Step2_EDA


<hr style="border-top: 5px solid black; margin-top: 1px; margin-bottom: 1px"></hr>

# <span style='background :red' > Step 2: Download the Chictopia Datasets</span> 

⚠️ Best way to not mess this process up is to follow the insteructions here exactly: https://github.com/kyamagu/paperdoll/tree/master/data/chictopia

Make sure you have installed: 
➡️ wget (I used homebrew to install & manage so it's globally available on my mac as opposed to just the conda env => !brew install wget)

## <span style='background :orange' > Clone the Github Repo </span> 

In [5]:
#!git clone https://github.com/kyamagu/paperdoll /Volumes/MiniGator/Projects/Datasets/Paperdoll

In [6]:
# Confirm zipfile has been downloaded to the repo
# Replace path with the path-to-dowloaded-dataset
!ls -a -F /Volumes/MiniGator/Projects/Datasets/Paperdoll

[30m[43m.[m[m/              [30m[43m.git[m[m/           [31m.gitignore[m[m*     [31mREADME.md[m[m*
[30m[43m..[m[m/             [31m.gitattributes[m[m* [31mLICENSE.txt[m[m*    [30m[43mdata[m[m/


In [7]:
!ls -a -F /Volumes/MiniGator/Projects/Datasets/Paperdoll/data

[30m[43m.[m[m/          [30m[43m..[m[m/         [31m.gitignore[m[m* [30m[43mchictopia[m[m/


In [8]:
!ls -a -F /Volumes/MiniGator/Projects/Datasets/Paperdoll/data/chictopia

[30m[43m.[m[m/         [30m[43m..[m[m/        [31mREADME.md[m[m*


In [11]:
!wget https://s3-ap-northeast-1.amazonaws.com/kyamagu-public/chictopia2/photos.lmdb.tar -P /Volumes/MiniGator/Projects/Datasets/Paperdoll/data/chictopia

zsh:1: command not found: wget


In [None]:
# https://stackoverflow.com/questions/13707429/decompress-gzip-file-to-specific-directory
!tar xf /Volumes/MiniGator/Projects/Datasets/Paperdoll/data/chictopia/photos.lmdb.tar -C /Volumes/MiniGator/Projects/Datasets/Paperdoll/data/chictopia/photos/

In [None]:
gunzip -c chictopia.sql.gz | sqlite3 chictopia.sqlite3


# <span style='background :pink' > ⬆️ LEFT OFF ABOVE ⬆️  </span> 

In [None]:
# Unzip to the same directory
# ! NOTE: If no path is provided, unzip will just unzip files 
# to your current-working-directory & NOT the same folder


# [UNCOMMENT BELOW] To run the command within the Jupyter notebook
# !unzip /Volumes/MiniGator/Projects/Datasets/iMaterialist/imaterialist-fashion-2020-fgvc7.zip -d /Volumes/MiniGator/Projects/Datasets/iMaterialist/

In [None]:
# List all files and folders, confirming we have all the data 
!ls -a -F /Volumes/MiniGator/Projects/Datasets/iMaterialist

<hr style="border-top: 5px solid black; margin-top: 1px; margin-bottom: 1px"></hr>

# <span style='background :red' > Step 3: Use CLI to start initial exploration 🔬  </span> 

We want to understand:

☑️ The data repository structure

☑️ Number of files, types of files

☑️ Type of information being captured & how it's represented 



<span style='background :yellow' > 💡 Given the size of the dataset, doing initial inspection with shell is fast & easy.
No need to break Jupyter before even getting started.  </span> 

In [None]:
# Sample list of images in train folder
!ls /Volumes/MiniGator/Projects/Datasets/iMaterialist/train | head -n 20

In [None]:
# Total number of images in train folder
!ls -a /Volumes/MiniGator/Projects/Datasets/iMaterialist/train | wc -l

In [None]:
# Sample list of images in test folder
!ls /Volumes/MiniGator/Projects/Datasets/iMaterialist/test | head -n 20

In [None]:
# Total number of images in test folder
!ls -a /Volumes/MiniGator/Projects/Datasets/iMaterialist/test | wc -l

In [None]:
# First 20 rows of the sample submission file included, for use in Kaggle competition
! head -n 20 /Volumes/MiniGator/Projects/Datasets/iMaterialist/sample_submission.csv

In [None]:
# First 20 rows of train metadata
! head -n 20 /Volumes/MiniGator/Projects/Datasets/iMaterialist/train.csv

In [None]:
# First 100 lines of the label_descriptions.json
! head -n 100 /Volumes/MiniGator/Projects/Datasets/iMaterialist/label_descriptions.json

<hr style="border-top: 5px solid black; margin-top: 1px; margin-bottom: 1px"></hr>

# <span style='background :red' > Step 4A: 🔬 Deeper EDA & Exploration of Dictionary 📚</span> 

☑️ Explore samples of the data (the meta data dictionary with catgories & attributes labels, the Train 🚂 file, and finally the images 📸 themselves)

☑️ Understand distributions of categories, attributes

## <span style='background :orange' > Loading the categories and attributes dictionary from JSON </span>


In [None]:
# Loading file & transposng to get the right shape

label_descriptions_df = (pd.read_json('/Volumes/MiniGator/Projects/Datasets/iMaterialist/label_descriptions.json',orient='index').T)

label_descriptions_df.head(20)

## <span style='background :orange' > High-Level Describing </span>

In [None]:
label_descriptions_df.info()

In [None]:
label_descriptions_df.describe()

## <span style='background :orange' > Examining individual cells & entries </span>

In [None]:
# This corresponds to the first row & first column (categories)
label_descriptions_df.loc[0][0]

In [None]:
# This corresponds to the first row & second column (attributes)
label_descriptions_df.loc[0][1]

In [None]:
# This corresponds to the second row & first column (categories)
label_descriptions_df.loc[1][0]

In [None]:
# This corresponds to the second row & second column (attributes)
label_descriptions_df.loc[1][1]

In [None]:
# This corresponds to the 16th row & first column (categories)
label_descriptions_df.loc[15][0]

## <span style='background :orange' > Examining Nulls </span>

⚠️ The creators of the dataset did something interesting -- they stuck two separate dictioanies together in a single JSON file. So while it looks like there are nulls in the categories column past 46 corresponding to attributes, that's not quite accurate. 

🤔 As we need to explore the "nulls" we need to remember they're not "true nulls" but the result of the dataset author's decision to save space & memory by dumping two lists of separate lengths together.

In [None]:
# Let's locate all rows of categories that don't have attributes

label_desc_isNull_df = label_descriptions_df.loc[pd.isnull(label_descriptions_df['categories'])]
label_desc_isNull_df

In [None]:
label_desc_isNull_df.head(20)

In [None]:
#Let's check out some specfic exaples
label_desc_isNull_df.loc[46][1]

In [None]:
label_desc_isNull_df.loc[62][1]

In [None]:
label_desc_isNull_df.loc[65][1]

In [None]:
label_desc_isNull_df.tail(20)

In [None]:
label_desc_isNull_df.loc[293][1]

In [None]:
label_desc_isNull_df.loc[274][1]

In [None]:
label_desc_isNull_df.loc[287][1]

## <span style='background :orange' > Cleaning Up Categories & Attributes Dicts </span>

### <span style='background :yellow' > Cleaning Up Categories </span>

We're going to pull out and split the categories list from the attributes list. We'll also make the columns easier to track by renaming, etc.

In [None]:
# We'll save off a copy of the dataframe for inspection. 
# It'll be deleted manually later once we're happy with the quality.

categories_notNull_df = label_descriptions_df.loc[pd.notnull(label_descriptions_df['categories'])]
categories_notNull_df = pd.json_normalize(categories_notNull_df['categories'])
categories_notNull_df.to_csv('../../data/interim/categories_notNull_df.csv')
categories_notNull_df

In [None]:
categories_notNull_df = categories_notNull_df.rename(
                                columns={'id':'id_categories',
                                        'name':'name_categories',
                                        'supercategory':'supercategory_categories',
                                        'level':'level_categories'})
categories_notNull_df = categories_notNull_df.set_index('id_categories')
categories_notNull_df

In [None]:
#Checking out the distribution of the categories 
categories_notNull_df['supercategory_categories'].value_counts().plot(kind='bar')

### <span style='background :yellow' > Cleaning Up Attributes </span>

Same exact process as we did with categories above. 

In [None]:
# We'll save off a copy of the dataframe for inspection. 
# It'll be deleted manually later once we're happy with the quality.

attributes_df = pd.json_normalize(label_descriptions_df['attributes'])
attributes_df.to_csv('../../data/interim/attributes_df.csv')
attributes_df

In [None]:
# Especially for attributes, it's important we set the id_attributes column as the index
# as we intend to use the dataframes for merging later.
# The reason why we'd want to set the index instead of just using the default pandas index?
# You'll see later but essentially the dataset creators, when creating the JSON file,
# skipped numbering the attribute ID's i.e. starting from Attribute ID=281 skip
# ahead by 40+. 
# Lucikly this seems to align with the attribute ID's captured in the Train file.
# Keep going, more explanation below.


attributes_df = attributes_df.rename(columns={'id':'id_attributes',
                                        'name':'name_attributes',
                                        'supercategory':'supercategory_attributes',
                                        'level':'level_attributes'})
attributes_df = attributes_df.set_index('id_attributes')
attributes_df

In [None]:
attributes_df['supercategory_attributes'].value_counts().plot(kind='bar')

<hr style="border-top: 5px solid black; margin-top: 1px; margin-bottom: 1px"></hr>

# <span style='background :red' > Step 4B: 🔬 Deeper EDA & Exploration of Train 🚂 File </span> 

## <span style='background :orange' > Loading the train.csv file 🚂 </span> 

We can note the following:
1. The same image is repeated multiple times but with a unique:
    * ClassId
    * Encoded Pixels
    * AttributesIds
    
2. Not all ImageIds have corresponding AttributesIds
3. AtributeIds are off, starting at AttributeId = 281, which jumps from the prior entry of AttributeId = 234. We also see that the column AttributeIds in the train data_df include AttrbiuteId values of 300+. As part of futher processing we'd need to visually confirm that the train images with AttributeId's of 235-293 are correctly labeled. 

In [None]:
# We load one chunk to visually examine (so that we dont overload the notebook memory)
with pd.read_csv("/Volumes/MiniGator/Projects/Datasets/iMaterialist/train.csv",chunksize=100) as reader:
    print(reader.get_chunk(20))

In [None]:
# Now I'll load the file straight; for bigger files howver, it can make sense
# to still do lazy loading

train_df = pd.read_csv("/Volumes/MiniGator/Projects/Datasets/iMaterialist/train.csv")

## <span style='background :orange' > Examine the Train file 🚂</span> 

In [None]:
train_df.head(20)

In [None]:
train_df.info()

## <span style='background :orange' > Get Top-Level Counts of Combined Train 🚊  File & Classes Dict 📚 </span> 

In [None]:
# In the Train data:
# ClassId corresponds to the category ID
# Each row in train contains a category label & multiple attribute labels

train_df[['ClassId','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['ClassId']).count().merge(categories_notNull_df,how='left',left_on='ClassId',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_categories')

### <span style='background :yellow' > Visualize entire distribution of supercategory categories </span> 

In [None]:
train_df[['ClassId','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['ClassId']).count().merge(categories_notNull_df,how='left',left_on='ClassId',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('supercategory_categories').plot(kind='barh',y='counts_',use_index=True)

### <span style='background :yellow' > Visualize top 15 categories by supercategory count</span> 

In [None]:
train_df[['ClassId','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['ClassId']).count().merge(categories_notNull_df,how='left',left_on='ClassId',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('supercategory_categories').iloc[:15].plot(kind='barh',y='counts_',use_index=True)

### <span style='background :yellow' > Visualize entire distribution of fine category (not super category) count </span> 

In [None]:
train_df[['ClassId','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['ClassId']).count().merge(categories_notNull_df,how='left',left_on='ClassId',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_categories').plot(kind='barh',y='counts_',use_index=True)

### <span style='background :yellow' > Visualize top 15 of fine category (not super category) count </span> 

In [None]:
train_df[['ClassId','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['ClassId']).count().merge(categories_notNull_df,how='left',left_on='ClassId',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_categories').iloc[:15].plot(kind='barh',y='counts_',use_index=True)

## <span style='background :orange' > Get Top-Level Counts of Combined Train 🚊  File & Attributes Dict 📚 </span> 

### <span style='background :yellow' > Select out columns to start analysis of attributes</span> 

In [None]:
# Remember: The same image is represented multiple times with a unique ClassId 
# (aka one image can have multiple items of clothing) and each ClassId (item of clothing)
# can have multiple attributes (details like tpye of fabric, buttons, etc)
# This is why the AttributesIds column has a list of values

train_with_attributes = train_df[['ImageId','ClassId','AttributesIds']]
train_with_attributes

### <span style='background :yellow' > Explode out attributes to make a long train_attributes dataframe </span> 

In [None]:
# Not all images have associated detailed attributes 
# If we just try to explode out the NaN values, we'll get an error
# So we need to convert the AttributesIds into a list of values (str)
# such that we can then encapsulate as a list, use pd.explode, 
# and then recast as int so we can merge on the indices.
# Phew!

train_with_attributes['AttributesIds'] = train_with_attributes['AttributesIds'].replace(np.nan,-1000).astype(str)
train_with_attributes['AttributesIds'] = train_with_attributes['AttributesIds'].apply(lambda x: list(x.split(",")))

train_with_attributes_long = train_with_attributes.explode('AttributesIds')
train_with_attributes_long['AttributesIds'] = train_with_attributes_long['AttributesIds'].astype(int)
train_with_attributes_long

In [None]:
# Now we have a count of number of times (not images but occurrences) of the attributes
train_with_attributes_long[['AttributesIds','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['AttributesIds']).count().merge(attributes_df,how='left',left_on='AttributesIds',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_attributes')

In [None]:
# This should be an ugly mess of a chart -- long tail but we also have a bunch of the 
# top-level categories being repeated
train_with_attributes_long[['AttributesIds','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['AttributesIds']).count().merge(attributes_df,how='left',left_on='AttributesIds',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_attributes').plot(kind='barh',y='counts_',use_index=True)

In [None]:
# Grabbing just the first 15, we see there are a bunch of occurrences of NaN
# This isnt surprising as the dataset creators described how only a subset of the original data
# had additional detailed attributes information.
train_with_attributes_long[['AttributesIds','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['AttributesIds']).count().merge(attributes_df,how='left',left_on='AttributesIds',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_attributes').iloc[:15].plot(kind='barh',y='counts_',use_index=True)

In [None]:
# Ignoring Nan, checking out the top 30 attributes
train_with_attributes_long[['AttributesIds','ImageId']].rename(columns={'ImageId':'counts_'}).groupby(['AttributesIds']).count().merge(attributes_df,how='left',left_on='AttributesIds',right_index=True).reset_index().sort_values('counts_',ascending=False).set_index('name_attributes').iloc[1:30].plot(kind='barh',y='counts_',use_index=True)

## <span style='background :orange' > Finishing EDA 🔬 by Previewing Some Random Images 📸 </span> 

💡 Ideally, it would be best to do a random selection of the images to preview by grabbing a list of the image_file names, creating a list of random numbers, then picking the image whose index corresponds to the random number element. 

🤔 Given that this is just meant to be a cursory preview, we'll do that in later noteboks for the dataset that ultimately gets selected.

In [None]:
path_to_dir = '/Volumes/MiniGator/Projects/Datasets/iMaterialist/train'

In [None]:
listOfImageNames = ['00000663ed1ff0c4e0132b9b9ac53f6e.jpg',
                    '0000fe7c9191fba733c8a69cfaf962b7.jpg',
                    '0002ec21ddb8477e98b2cbb87ea2e269.jpg',
                    '0002f5a0ebc162ecfb73e2c91e3b8f62.jpg',
                    '0004467156e47b0eb6de4aa6479cbd15.jpg',
                    '00048c3a2fb9c29340473c4cfc06424a.jpg',
                    '0006ea84499fd9a06fefbdf47a5eb4c0.jpg'
                   ]

In [None]:
for imageName in listOfImageNames:
    display(Image(filename=f'{path_to_dir}//{imageName}'))