<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#How-large-is-our-dataset-when-we-restrict-to-women's-apparel?" data-toc-modified-id="How-large-is-our-dataset-when-we-restrict-to-women's-apparel?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>How large is our dataset when we restrict to women's apparel?</a></span></li></ul></div>

In [1]:
import os
import pandas as pd
import json
import numpy as np

In [2]:
DATA_DIR = "../data/external/fashion-dataset"
STYLES_PATH = os.path.join(DATA_DIR, "./styles.csv")

Read styles metadata:

In [3]:
%%capture

df = pd.read_csv(STYLES_PATH, error_bad_lines=False)

In [4]:
df.head()

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


# How large is our dataset when we restrict to women's apparel?

Restrict focus to women's apparel, and group by article type:

In [5]:
df_women = df.drop(['baseColour', 'season', 'year', 'usage', 'productDisplayName'], axis=1).groupby(
    ['gender', 'masterCategory', 'subCategory', 'articleType']).count().xs(('Women', 'Apparel'))

Breakdown of women's apparel counts in fashion-dataset:

In [6]:
df_women

Unnamed: 0_level_0,Unnamed: 1_level_0,id
subCategory,articleType,Unnamed: 2_level_1
Apparel Set,Kurta Sets,90
Apparel Set,Swimwear,2
Bottomwear,Capris,129
Bottomwear,Churidar,23
Bottomwear,Jeans,228
Bottomwear,Jeggings,34
Bottomwear,Leggings,153
Bottomwear,Patiala,38
Bottomwear,Salwar,27
Bottomwear,Salwar and Dupatta,7


In [7]:
print("Women's apparel total image count: ", df_women.sum()['id'])

Women's apparel total image count:  8623


Save women's apparel ids:

In [8]:
women_ids = df.loc[(df['gender'] == 'Women') & (df['masterCategory'] == 'Apparel')]
women_ids.to_csv(
    '../data/processed/fashion-dataset/women_apparal_ids.csv',
    index=False, columns=['id'])

How are the json metadata files structured?

In [9]:
ids = [id for id in women_ids['id']]
with open('../data/external/fashion-dataset/styles/{}.json'.format(ids[0])) as f:
    example_md = json.load(f)
    print(json.dumps(example_md, indent=2))

{
  "notification": {},
  "meta": {
    "code": 200,
    "requestId": "92c5cd1a-219f-4c22-b5f3-60a929bf9a3a"
  },
  "data": {
    "id": 26960,
    "price": 699,
    "discountedPrice": 699,
    "styleType": "P",
    "productTypeId": 320,
    "articleNumber": "1JW06761",
    "visualTag": "",
    "productDisplayName": "Jealous 21 Women Purple Shirt",
    "variantName": "Solid",
    "myntraRating": 1,
    "catalogAddDate": 1330675732,
    "brandName": "Jealous 21",
    "ageGroup": "Adults-Women",
    "gender": "Women",
    "baseColour": "Purple",
    "colour1": "NA",
    "colour2": "NA",
    "fashionType": "Fashion",
    "season": "Summer",
    "year": "2012",
    "usage": "Casual",
    "vat": 5.5,
    "displayCategories": "Casual Wear",
    "weight": "0",
    "navigationId": 1157,
    "landingPageUrl": "Shirts/Jealous-21/Jealous-21-Women-Purple-Shirt/26960/buy",
    "articleAttributes": {
      "Fit": "Regular Fit",
      "Pattern": "Solid",
      "Body or Garment Size": "Garment Measurem

In [10]:
from PIL import Image

List all of the image dimensions in women's apparel data:

In [11]:
dims = set(Image.open('../data/external/fashion-dataset/images/{}.jpg'.format(id)).size for id in ids)
dims

{(360, 480),
 (540, 720),
 (683, 1024),
 (1080, 1440),
 (1200, 1600),
 (1354, 2020),
 (1800, 2399),
 (1800, 2400),
 (1806, 2700),
 (2500, 3333),
 (2652, 3536),
 (2700, 2700),
 (2711, 3615),
 (2759, 3678),
 (2774, 3698),
 (2804, 3739)}

TODO: use PIL to resize images.