# Tidying the Data
The data that we have here today is a little special: it comes as a json file with nested lists. We will need to untangle and flatten this data set before we can make use of the machine learning tools we will be using in R.


## Importing the Data Set
To start off, we need to take a look at the data set and explore what it has to offer. Upon first look, we can see that the data comes in a nested format, with each outfit having the attributes `name`, `views`, `items`, `image`, `likes`, `date`, `set_url`, and `desc`. The `items` attribute is nested, which means it has its own set of attributes within it. This includes `index`, `name`, `price`, `likes`, `image`, and `categoryid`, which all describe parts of an outfit.

The data also comes pre-split, but for the sake of this project, we will need to recombine them.

In [1]:
import pandas as pd
import json
from collections import Counter

f = open('polyvore_data/train_no_dup.json')
dict1 = json.load(f)
f.close()
f = open('polyvore_data/valid_no_dup.json')
dict2 = json.load(f)
f.close()
f = open('polyvore_data/test_no_dup.json')
dict3 = json.load(f)
f.close()

df1 = pd.json_normalize(dict1)
df2 = pd.json_normalize(dict2)
df3 = pd.json_normalize(dict3)

dflist = [df1, df2, df3]
df = pd.concat(dflist)

In [2]:
df.head(3)

Unnamed: 0,name,views,items,image,likes,date,set_url,set_id,desc
0,Casual,8743,"[{'index': 1, 'name': 'mock neck embroidery su...",http://ak1.polyvoreimg.com/cgi/img-set/cid/214...,394,One month,http://www.polyvore.com/casual/set?id=214181831,214181831,A fashion look from January 2017 by beebeely-l...
1,Being a Vans shoe model with Luke. Idk about t...,188,"[{'index': 1, 'name': 'nirvana distressed t-sh...",http://ak1.polyvoreimg.com/cgi/img-set/cid/120...,9,Two years,http://www.polyvore.com/being_vans_shoe_model_...,120161271,A fashion look from April 2014 featuring destr...
2,These Chanel bags is a bad habit .x,562,"[{'index': 1, 'name': 'monki singlet', 'price'...",http://ak1.polyvoreimg.com/cgi/img-set/cid/143...,32,Two years,http://www.polyvore.com/these_chanel_bags_is_b...,143656996,12.19.14


In [3]:
df.shape

(21889, 9)

In [4]:
df['items'].loc[1,][0:2]

  df['items'].loc[1,][0:2]


1    [{'index': 1, 'name': 'nirvana distressed t-sh...
1    [{'index': 1, 'name': 'classic bracelet', 'pri...
Name: items, dtype: object

## Unnesting the Data
The attributes that I'm interested within each item are `name` and `categoryid` since they give the most information about the visual aspects of the outfit which is what I want to focus on. I've created a function that extracts this information and pivots them into their own columns.

In [5]:
def itemscol_to_dataframe(column):
    data = []

    for index, sublist in column.items():
        name_list = []
        category_list = []

        for item in sublist:
            name_list.append(item['name'])
            category_list.append(item['categoryid'])

        if len(name_list) < 8:
            fill = 8 - len(name_list)
            name_list = name_list + (['empty'] * fill)
            category_list = category_list + ([0] * fill)
        row = name_list + category_list
        data.append(row)
        
    itemsdf = pd.DataFrame(data, columns = ['iname1', 'iname2', 'iname3', 'iname4',
                                        'iname5', 'iname6', 'iname7', 'iname8',
                                        'iid1', 'iid2', 'iid3', 'iid4',
                                        'iid5', 'iid6', 'iid7', 'iid8'])
    return itemsdf

In [6]:
items_df = itemscol_to_dataframe(df['items'])
items_df.head(2)

Unnamed: 0,iname1,iname2,iname3,iname4,iname5,iname6,iname7,iname8,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8
0,mock neck embroidery suede sweatshirt,luxe double zip hooded jacket,citizens humanity high rise rocket hem jean,suede tie short boots,cloth travel school backpack,,polyvore,empty,4495,25,27,261,259,1967,2,0
1,nirvana distressed t-shirt,rag bone rock w/ black skinny jeans,vans authentic black mono trainers,time low rubber bracelet hot topic,veil logo rubber bracelet,rubber bracelet hot topic,romance i'm,disney alice wonderland cat rubber bracelet ho...,21,237,49,106,106,106,106,106


Since there can be up to 8 items in an outfit, this dataframe contains 8 * 2 = 16 columns. Outfits with fewer items have their excess item name and item id columns filled with `'empty'` and `0`, respectively. I've chosen not to use `NULL` / `N/A` here since this isn't really "missing" information; rather, it gives us information on how many items are in the outfit which might be useful.

Now we can combine this with the full dataframe, dropping unnecessary columns, and export it as a csv!

In [7]:
df2 = df1.drop(columns = ['items', 'image', 'name', 
                        'date', 'set_url', 'desc']
        ).join(items_df)
df2.head(3)

Unnamed: 0,views,likes,set_id,iname1,iname2,iname3,iname4,iname5,iname6,iname7,iname8,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8
0,8743,394,214181831,mock neck embroidery suede sweatshirt,luxe double zip hooded jacket,citizens humanity high rise rocket hem jean,suede tie short boots,cloth travel school backpack,,polyvore,empty,4495,25,27,261,259,1967,2,0
1,188,9,120161271,nirvana distressed t-shirt,rag bone rock w/ black skinny jeans,vans authentic black mono trainers,time low rubber bracelet hot topic,veil logo rubber bracelet,rubber bracelet hot topic,romance i'm,disney alice wonderland cat rubber bracelet ho...,21,237,49,106,106,106,106,106
2,562,32,143656996,monki singlet,joy denim jacket,topshop moto joni high rise skinny jeans,black pointed chelsea boots,pre-owned chanel shoulder bag,rag bone floppy brim fedora,empty,empty,104,25,237,261,37,55,0,0


In [8]:
df2['inametotal'] = df2['iname1'] + ' ' + df2['iname2'] + ' ' + df2['iname3'] + ' ' + df2['iname4'] + ' ' + df2['iname5'] + ' ' + df2['iname6'] + ' ' + df2['iname7'] + ' ' + df2['iname8']
all_item_names = df2['inametotal'].str.cat(sep = ' ')
all_item_names

item_word_list = all_item_names.split()
counter = Counter(item_word_list)
most_occur = counter.most_common(100)
most_occur

[('empty', 23722),
 ('black', 9909),
 ('leather', 8516),
 ('bag', 6350),
 ("women's", 5810),
 ('top', 4504),
 ('jeans', 4133),
 ('dress', 4100),
 ('gold', 4031),
 ('white', 3837),
 ('earrings', 3619),
 ('iphone', 3613),
 ('sunglasses', 3382),
 ('necklace', 3381),
 ('skirt', 3254),
 ('boots', 3142),
 ('suede', 3004),
 ('jacket', 2922),
 ('case', 2871),
 ('denim', 2763),
 ('ring', 2703),
 ('mini', 2622),
 ('yoins', 2563),
 ('high', 2535),
 ('blue', 2533),
 ('clutch', 2497),
 ('plus', 2465),
 ('bracelet', 2418),
 ('skinny', 2164),
 ('coat', 2127),
 ('shoulder', 2125),
 ('sandals', 2122),
 ('long', 2112),
 ('set', 2106),
 ('women', 2106),
 ('lace', 2069),
 ('red', 2014),
 ('new', 1996),
 ('print', 1986),
 ('pink', 1961),
 ('sleeve', 1954),
 ('ankle', 1949),
 ('silver', 1894),
 ('pre-owned', 1877),
 ('lipstick', 1861),
 ('shorts', 1850),
 ('topshop', 1818),
 ('sweater', 1788),
 ('size', 1749),
 ('faux', 1711),
 ('vintage', 1699),
 ('shoes', 1693),
 ('rose', 1689),
 ('pumps', 1651),
 ('de', 

In [9]:
import os

all_files = os.listdir('color_synonyms/')
all_files

def get_color_synonyms(folder):
    all_files = os.listdir(folder)
    colordict = {}
    for file in all_files:
        f = open(folder + file)
        data = f.read()
        f.close()
        colordict[file[:-4]] = data.split('\n')
    return colordict

colordict = get_color_synonyms('color_synonyms/')
colordict

def what_color(itemname):
    itemnamelist = itemname.split()
    color = []
    for item in itemnamelist:
        for key in colordict:
            if item in colordict[key]:
                color.append(key)
    return list(set(color))[0:4]#' '.join(set(color))


In [10]:
df2['colors'] = df2['inametotal'].apply(what_color)
df2.head(10)
#len(df2['colors'].unique()) # there are 749 color combinations. 


Unnamed: 0,views,likes,set_id,iname1,iname2,iname3,iname4,iname5,iname6,iname7,...,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8,inametotal,colors
0,8743,394,214181831,mock neck embroidery suede sweatshirt,luxe double zip hooded jacket,citizens humanity high rise rocket hem jean,suede tie short boots,cloth travel school backpack,,polyvore,...,4495,25,27,261,259,1967,2,0,mock neck embroidery suede sweatshirt luxe dou...,[]
1,188,9,120161271,nirvana distressed t-shirt,rag bone rock w/ black skinny jeans,vans authentic black mono trainers,time low rubber bracelet hot topic,veil logo rubber bracelet,rubber bracelet hot topic,romance i'm,...,21,237,49,106,106,106,106,106,nirvana distressed t-shirt rag bone rock w/ bl...,"[white, black]"
2,562,32,143656996,monki singlet,joy denim jacket,topshop moto joni high rise skinny jeans,black pointed chelsea boots,pre-owned chanel shoulder bag,rag bone floppy brim fedora,empty,...,104,25,237,261,37,55,0,0,monki singlet joy denim jacket topshop moto jo...,"[white, black, blue]"
3,2613,88,186627934,tops,saint laurent zip cutout stretch nappa leather...,corset super store women's black steampunk corset,allurez square diamond halo engagement ring we...,lip buckled matte womens corset lip uk,nude pink lipstick,amazing eye makeup miss,...,11,28,2,65,52,200,186,76,tops saint laurent zip cutout stretch nappa le...,"[black, red, yellow]"
4,62,3,206969379,yoins leather sexy v-neck sleeveless crop top,solid color long sleeve irregular blazer,alice+olivia floral pattern a-line skirt,zipped top chunky booties,gold boho turquoise leaf tassel earrings,etro heart locket necklace,bohemian flower mandala blue crystal clear pho...,...,11,236,9,261,64,62,1967,200,yoins leather sexy v-neck sleeveless crop top ...,"[black, red, blue, yellow]"
5,276,83,201969694,new look light blue denim oversized long sleev...,mango skinny jane jegging,sophia webster leather butterfly flats,michael michael kors mini selma crossbody bag,velvet vase embellished necklace,powder eye shadow deep sea ea,empty,...,11,241,47,37,62,196,0,0,new look light blue denim oversized long sleev...,"[white, black, blue]"
6,1580,395,216470135,isabel marant alpaca blend jumper,yoins light blue gradient color hole denim skirt,alice light blue shoes flats leather sandals,yoins beige leather-look gold-tone metal clutc...,yoins stone long pendant necklace,elizabeth mini set,,...,19,9,41,38,62,140,4438,0,isabel marant alpaca blend jumper yoins light ...,"[gray, black, blue, brown]"
7,591,233,216220312,oasis shadow bird knit pink,michelle mason women's suede wrap front mini s...,chloé lauren leather ballerinas,giuseppe zanotti patent-leather clutch bag,hermÃ¨s rose gold bracelet,deborah lippmann creme nail polish,empty,...,19,8,47,38,106,222,0,0,oasis shadow bird knit pink michelle mason wom...,"[black, red, yellow]"
8,1142,239,185225843,valentino pleated cotton-blend dress,valentino red ruffle detail blouse,valentino wool-blend coat,valentino rockstud metallic leather pumps,valentino small striped leather satchel,valentino printed cashmere silk scarf,empty,...,4,17,24,43,318,105,0,0,valentino pleated cotton-blend dress valentino...,"[black, red, gray]"
9,24915,492,213824660,yoins plus size blue stripe shirt,yoins grey sleeveless faux fur coat,topshop moto dark indigo jamie jeans,yoins grey buckle design chunky heels short boots,wild side mini hair,bobbi brown peach,set,...,11,24,237,261,58,188,171,37,yoins plus size blue stripe shirt yoins grey s...,"[brown, red, blue, gray]"


In [11]:
(df2['colors'].values == '').sum() / len(df2)
# couldn't get color info on 5% of the data. will drop this since there is a lot of data to work with. 

0.0

In [12]:
len(df2.iloc[df2['colors'].apply(len).idxmax(),:].colors) # the max length of the color list is 8(damn)

4

## Extracting Color Information
To deal with this data set, I will need to extract the color information from the item names. To do this, I've created a folder, `color_synonyms`, that contains 10 text files of 10 main colors. Each file contains synonyms for that color (I've also opted to put brown into the orange category since brown is essentially dark orange). Initially, I was tempted to only do red, yellow, orange, green, blue, purple, and gray (which would include black, gray, white), but seeing that one of the most common colors was black, I felt it was important to differentiate between black, gray, and white.

Using these text files, I've created a function that turns each file in the folder into a dictionary where each color name corresponds to a list of its synonyms. Using this dictionary, I made another function that takes a string and returns a list of all of the unique colors contained in that string. For example:

`item = 'cream dress with gold accent white lace detail'`

`what_color(item)`

`output: ['white', 'yellow']`

I applied this function to the `inametotal` column, which contains all of the item names together as one string. Given that the main outfit items are always listed in the first few objects, I will limit the colors to 4. Once we have this info, we can drop the iname columns and replace them with 4 new columns: `color_1`, `color_2`, `color_3`, and `color_4`. I will also drop entries where there is no color for `color_1`, since we won't have any color info on that outfit. I'll replace the remaining empty strings with 'none', since I don't want them treated as missing values (since it gives information on how many colors are in the outfit).

In [13]:
def what_color(itemname):
    itemnamelist = itemname.split()
    color = []

    for item in itemnamelist:
        for key in colordict:
            if item in colordict[key]:
                color.append(key)

    colorlist = list(set(color))[0:4]
    fill = 4 - len(colorlist)
    colorlist = colorlist + ([''] * fill)
    
    return colorlist

In [14]:
df2['colors'] = df2['inametotal'].apply(what_color)
df2.head()

Unnamed: 0,views,likes,set_id,iname1,iname2,iname3,iname4,iname5,iname6,iname7,...,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8,inametotal,colors
0,8743,394,214181831,mock neck embroidery suede sweatshirt,luxe double zip hooded jacket,citizens humanity high rise rocket hem jean,suede tie short boots,cloth travel school backpack,,polyvore,...,4495,25,27,261,259,1967,2,0,mock neck embroidery suede sweatshirt luxe dou...,"[, , , ]"
1,188,9,120161271,nirvana distressed t-shirt,rag bone rock w/ black skinny jeans,vans authentic black mono trainers,time low rubber bracelet hot topic,veil logo rubber bracelet,rubber bracelet hot topic,romance i'm,...,21,237,49,106,106,106,106,106,nirvana distressed t-shirt rag bone rock w/ bl...,"[white, black, , ]"
2,562,32,143656996,monki singlet,joy denim jacket,topshop moto joni high rise skinny jeans,black pointed chelsea boots,pre-owned chanel shoulder bag,rag bone floppy brim fedora,empty,...,104,25,237,261,37,55,0,0,monki singlet joy denim jacket topshop moto jo...,"[white, black, blue, ]"
3,2613,88,186627934,tops,saint laurent zip cutout stretch nappa leather...,corset super store women's black steampunk corset,allurez square diamond halo engagement ring we...,lip buckled matte womens corset lip uk,nude pink lipstick,amazing eye makeup miss,...,11,28,2,65,52,200,186,76,tops saint laurent zip cutout stretch nappa le...,"[black, red, yellow, ]"
4,62,3,206969379,yoins leather sexy v-neck sleeveless crop top,solid color long sleeve irregular blazer,alice+olivia floral pattern a-line skirt,zipped top chunky booties,gold boho turquoise leaf tassel earrings,etro heart locket necklace,bohemian flower mandala blue crystal clear pho...,...,11,236,9,261,64,62,1967,200,yoins leather sexy v-neck sleeveless crop top ...,"[black, red, blue, yellow]"


In [15]:
df2[['color_1', 'color_2', 'color_3', 'color_4']] = pd.DataFrame(df2.colors.to_list(), index = df2.index)
df2.head()
df3 = df2.drop(columns=['iname1', 'iname2', 'iname3', 'iname4', 'iname5', 'iname6', 'iname7', 'iname8', 'colors', 'inametotal'])[df2['color_1'] != ''].reset_index(drop=True)
df3[['color_2', 'color_3', 'color_4']] = df3[['color_2', 'color_3', 'color_4']].replace('', 'none', regex=True)
df3

Unnamed: 0,views,likes,set_id,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8,color_1,color_2,color_3,color_4
0,188,9,120161271,21,237,49,106,106,106,106,106,white,black,none,none
1,562,32,143656996,104,25,237,261,37,55,0,0,white,black,blue,none
2,2613,88,186627934,11,28,2,65,52,200,186,76,black,red,yellow,none
3,62,3,206969379,11,236,9,261,64,62,1967,200,black,red,blue,yellow
4,276,83,201969694,11,241,47,37,62,196,0,0,white,black,blue,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16427,183,117,216801059,104,9,47,35,4428,319,0,0,black,gray,none,none
16428,2428,382,190488700,17,236,28,261,36,0,0,0,black,red,none,none
16429,2184,398,187504514,4,24,46,37,65,65,0,0,white,none,none,none
16430,3147,440,211085207,4,24,43,37,4428,316,0,0,black,gray,yellow,none


In [16]:
df3.to_csv('tidy_data.csv', index=False)