# Tidying the Data
The data that we have here today is a little special: it comes as a json file with nested lists. We will need to untangle and flatten this data set before we can make use of the machine learning tools we will be using in R.


## Importing the Data Set
To start off, we need to take a look at the data set and explore what it has to offer. Upon first look, we can see that the data comes in a nested format, with each outfit having the attributes `name`, `views`, `items`, `image`, `likes`, `date`, `set_url`, and `desc`. The `items` attribute is nested, which means it has its own set of attributes within it. This includes `index`, `name`, `price`, `likes`, `image`, and `categoryid`, which all describe parts of an outfit.

The data also comes pre-split, but for the sake of this project, we will need to recombine them.

In [17]:
import pandas as pd
import json
from collections import Counter

f = open('polyvore_data/train_no_dup.json')
dict1 = json.load(f)
f.close()
f = open('polyvore_data/valid_no_dup.json')
dict2 = json.load(f)
f.close()
f = open('polyvore_data/test_no_dup.json')
dict3 = json.load(f)
f.close()

df1 = pd.json_normalize(dict1)
df2 = pd.json_normalize(dict2)
df3 = pd.json_normalize(dict3)

dflist = [df1, df2, df3]
df = pd.concat(dflist)

In [18]:
df.head(3)

Unnamed: 0,name,views,items,image,likes,date,set_url,set_id,desc
0,Casual,8743,"[{'index': 1, 'name': 'mock neck embroidery su...",http://ak1.polyvoreimg.com/cgi/img-set/cid/214...,394,One month,http://www.polyvore.com/casual/set?id=214181831,214181831,A fashion look from January 2017 by beebeely-l...
1,Being a Vans shoe model with Luke. Idk about t...,188,"[{'index': 1, 'name': 'nirvana distressed t-sh...",http://ak1.polyvoreimg.com/cgi/img-set/cid/120...,9,Two years,http://www.polyvore.com/being_vans_shoe_model_...,120161271,A fashion look from April 2014 featuring destr...
2,These Chanel bags is a bad habit .x,562,"[{'index': 1, 'name': 'monki singlet', 'price'...",http://ak1.polyvoreimg.com/cgi/img-set/cid/143...,32,Two years,http://www.polyvore.com/these_chanel_bags_is_b...,143656996,12.19.14


In [19]:
df.shape

(21889, 9)

In [20]:
df['items'].loc[1,][0:2]

1    [{'index': 1, 'name': 'nirvana distressed t-sh...
1    [{'index': 1, 'name': 'classic bracelet', 'pri...
Name: items, dtype: object

## Unnesting the Data
The attributes that I'm interested within each item are `name` and `categoryid` since they give the most information about the visual aspects of the outfit which is what I want to focus on. I've created a function that extracts this information and pivots them into their own columns.

In [21]:
def itemscol_to_dataframe(column):
    data = []

    for index, sublist in column.items():
        name_list = []
        category_list = []

        for item in sublist:
            name_list.append(item['name'])
            category_list.append(item['categoryid'])

        if len(name_list) < 8:
            fill = 8 - len(name_list)
            name_list = name_list + (['empty'] * fill)
            category_list = category_list + ([0] * fill)
        row = name_list + category_list
        data.append(row)
        
    itemsdf = pd.DataFrame(data, columns = ['iname1', 'iname2', 'iname3', 'iname4',
                                        'iname5', 'iname6', 'iname7', 'iname8',
                                        'iid1', 'iid2', 'iid3', 'iid4',
                                        'iid5', 'iid6', 'iid7', 'iid8'])
    return itemsdf

In [22]:
items_df = itemscol_to_dataframe(df['items'])
items_df.head(2)

Unnamed: 0,iname1,iname2,iname3,iname4,iname5,iname6,iname7,iname8,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8
0,mock neck embroidery suede sweatshirt,luxe double zip hooded jacket,citizens humanity high rise rocket hem jean,suede tie short boots,cloth travel school backpack,,polyvore,empty,4495,25,27,261,259,1967,2,0
1,nirvana distressed t-shirt,rag bone rock w/ black skinny jeans,vans authentic black mono trainers,time low rubber bracelet hot topic,veil logo rubber bracelet,rubber bracelet hot topic,romance i'm,disney alice wonderland cat rubber bracelet ho...,21,237,49,106,106,106,106,106


Since there can be up to 8 items in an outfit, this dataframe contains 8 * 2 = 16 columns. Outfits with fewer items have their excess item name and item id columns filled with `'empty'` and `0`, respectively. I've chosen not to use `NULL` / `N/A` here since this isn't really "missing" information; rather, it gives us information on how many items are in the outfit which might be useful.

Now we can combine this with the full dataframe, dropping unnecessary columns, and export it as a csv!

In [23]:
df2 = df1.drop(columns = ['items', 'image', 'name', 
                        'date', 'set_url', 'desc']
        ).join(items_df)
df2.head(3)

Unnamed: 0,views,likes,set_id,iname1,iname2,iname3,iname4,iname5,iname6,iname7,iname8,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8
0,8743,394,214181831,mock neck embroidery suede sweatshirt,luxe double zip hooded jacket,citizens humanity high rise rocket hem jean,suede tie short boots,cloth travel school backpack,,polyvore,empty,4495,25,27,261,259,1967,2,0
1,188,9,120161271,nirvana distressed t-shirt,rag bone rock w/ black skinny jeans,vans authentic black mono trainers,time low rubber bracelet hot topic,veil logo rubber bracelet,rubber bracelet hot topic,romance i'm,disney alice wonderland cat rubber bracelet ho...,21,237,49,106,106,106,106,106
2,562,32,143656996,monki singlet,joy denim jacket,topshop moto joni high rise skinny jeans,black pointed chelsea boots,pre-owned chanel shoulder bag,rag bone floppy brim fedora,empty,empty,104,25,237,261,37,55,0,0


In [24]:
df2.to_csv('tidy_data.csv')

In [27]:
df2['inametotal'] = df2['iname1'] + ' ' + df2['iname2'] + ' ' + df2['iname3'] + ' ' + df2['iname4'] + ' ' + df2['iname5'] + ' ' + df2['iname6'] + ' ' + df2['iname7'] + ' ' + df2['iname8']
all_item_names = df2['inametotal'].str.cat(sep = ' ')
all_item_names
item_word_list = all_item_names.split()
counter = Counter(item_word_list)
most_occur = counter.most_common(100)
most_occur
# Will need to use a website or something to get a list of color synonyms to group these with


[('empty', 23722),
 ('black', 9909),
 ('leather', 8516),
 ('bag', 6350),
 ("women's", 5810),
 ('top', 4504),
 ('jeans', 4133),
 ('dress', 4100),
 ('gold', 4031),
 ('white', 3837),
 ('earrings', 3619),
 ('iphone', 3613),
 ('sunglasses', 3382),
 ('necklace', 3381),
 ('skirt', 3254),
 ('boots', 3142),
 ('suede', 3004),
 ('jacket', 2922),
 ('case', 2871),
 ('denim', 2763),
 ('ring', 2703),
 ('mini', 2622),
 ('yoins', 2563),
 ('high', 2535),
 ('blue', 2533),
 ('clutch', 2497),
 ('plus', 2465),
 ('bracelet', 2418),
 ('skinny', 2164),
 ('coat', 2127),
 ('shoulder', 2125),
 ('sandals', 2122),
 ('long', 2112),
 ('set', 2106),
 ('women', 2106),
 ('lace', 2069),
 ('red', 2014),
 ('new', 1996),
 ('print', 1986),
 ('pink', 1961),
 ('sleeve', 1954),
 ('ankle', 1949),
 ('silver', 1894),
 ('pre-owned', 1877),
 ('lipstick', 1861),
 ('shorts', 1850),
 ('topshop', 1818),
 ('sweater', 1788),
 ('size', 1749),
 ('faux', 1711),
 ('vintage', 1699),
 ('shoes', 1693),
 ('rose', 1689),
 ('pumps', 1651),
 ('de', 