# Exploring Open Products Data
There are multiple datasets related to products in retail on: http://pod.opendatasoft.com/explore/?sort=modified
Here, we explore the datasets to find out the details of the dataset.

In [12]:
import pandas as pd
import numpy as np

pod_gtin = pd.read_csv('data/pod_gtin.csv', sep=';', dtype='unicode')

In [23]:
pod_gtin.head(2)

Unnamed: 0,GTIN Code,GTIN Name,Segmentation (GPC),GPC Segmentation Code,GPC Brick Code,GPC Class Code,GPC Family Code,GTIN Image,Brand Code,Brand,...,Package Type Code,Package Type Name,Weight (g),Weight (oz),Volume (ml),Volume (fl oz),Alcohol by Volume,Alcohol by Weight,Registration,Registration Country ISO Code
0,33674156872,Way Super Fisol Fish Oil 45 softgels,Healthcare,51000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,YNNNNA,Nature's Way,...,,,,,,,,,GS1 US,US
1,33674156889,Super Fisol Enteric-coated Fish Oil 500 mg 90 ...,Healthcare,51000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,YNNNNA,Nature's Way,...,,,,,,,,,GS1 US,US


In [24]:
pod_gtin.describe()

Unnamed: 0,GTIN Code,GTIN Name,Segmentation (GPC),GPC Segmentation Code,GPC Brick Code,GPC Class Code,GPC Family Code,GTIN Image,Brand Code,Brand,...,Package Type Code,Package Type Name,Weight (g),Weight (oz),Volume (ml),Volume (fl oz),Alcohol by Volume,Alcohol by Weight,Registration,Registration Country ISO Code
count,921804,715555,482126,482126,11278,11593,11596,921804,527822,527822,...,615,615,154507.0,240227.0,79170.0,69237.0,134.0,0.0,921804,921802
unique,921804,551958,31,31,129,53,22,921804,4007,3989,...,22,22,3003.0,2326.0,1156.0,931.0,36.0,0.0,106,104
top,657812120213,Ice Cream,Food - Beverage - Tobacco,50000000,10000278,72010300,72010000,http://product.okfn.org.s3.amazonaws.com/image...,L2E3J8,Carrefour,...,13,Pot en verre,0.0,16.0,0.0,12.0,45.0,,GS1 US,US
freq,1,1072,177670,177670,800,2020,3691,1,8562,8562,...,123,123,5220.0,14332.0,5798.0,5377.0,17.0,,603699,603699


In [61]:
np.shape(pod_gtin)

(921804, 39)

Selecting those rows from pod_gtin for which the column 'Segmentation (GPC)' is not NaN. We need this subset so that we can sample from categories of items.

In [81]:
pod_gtin_with_segmentation_code = pod_gtin.dropna(subset=['Segmentation (GPC)'])

In [82]:
np.shape(pod_gtin_with_segmentation_code)

(482126, 39)

In [92]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()
pod_gtin_with_segmentation_code['Segmentation (GPC)'].value_counts()

Food - Beverage - Tobacco                              177670
Beauty - Personal Care - Hygiene                       154427
Healthcare                                              48481
Pet Care - Food                                         24210
Toys - Games                                            14221
Cleaning - Hygiene Products                             10820
Clothing                                                 9440
Home Appliances                                          6635
Baby Care                                                6519
Stationery - Office Machinery - Occasion Supplies        5516
Automotive                                               4156
Kitchen Merchandise                                      4003
Household - Office Furniture - Furnishings               3355
Audio Visual - Photography                               3231
Personal Accessories                                     2436
Footwear                                                 1177
Sports E

<matplotlib.figure.Figure at 0x895bb00>

Now, we want to sample a subset of this data. Two parameters required for sampling include a list of 'Segmentation (GPC)' and the number of samples for each 'Segmentation (GPC)'.

In [111]:
seg_gpc_list_to_select = {'Food - Beverage - Tobacco':100, 
                          'Beauty - Personal Care - Hygiene':100, 
                          'Healthcare': 100, 
                          'Cleaning - Hygiene Products': 100, 
                          'Home Appliances':100, 
                          'Arts - Crafts - Needlework':100, 
                          'Lawn - Garden Supplies':100, 
                          'Kitchen Merchandise':100, 
                          'Personal Accessories':100, 
                          'Clothing':100}

# to accummulate dataframes to be concatinated
frames = [] 

for seg_gpc in seg_gpc_list_to_select:
    pod_gtin_subset_to_add = pod_gtin_with_segmentation_code[pod_gtin_with_segmentation_code['Segmentation (GPC)'] == seg_gpc]
    pod_gtin_subset_to_add = pod_gtin_subset_to_add.head(n=seg_gpc_list_to_select[seg_gpc])
    print(np.shape(pod_gtin_subset_to_add))
    frames.append(pod_gtin_subset_to_add)

pod_gtin_subset = pd.concat(frames)
print(np.shape(pod_gtin_subset))
#np.shape(pod_gtin_with_segmentation_code[pod_gtin_with_segmentation_code['Segmentation (GPC)'] == 'Food - Beverage - Tobacco')

(100, 39)
(100, 39)
(100, 39)
(100, 39)
(100, 39)
(100, 39)
(100, 39)
(100, 39)
(100, 39)
(100, 39)
(1000, 39)


In [113]:
pod_gtin_subset.head(20)

Unnamed: 0,GTIN Code,GTIN Name,Segmentation (GPC),GPC Segmentation Code,GPC Brick Code,GPC Class Code,GPC Family Code,GTIN Image,Brand Code,Brand,...,Package Type Code,Package Type Name,Weight (g),Weight (oz),Volume (ml),Volume (fl oz),Alcohol by Volume,Alcohol by Weight,Registration,Registration Country ISO Code
48,33698815281,Wine Vinegar,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,,,750.0,25.4,,,GS1 US,US
49,33698815298,Hot Cherry Peppers Ground,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,,,946.0,32.0,,,GS1 US,US
50,33698815304,Hot Cherry Peppers Ground,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,,,355.0,12.0,,,GS1 US,US
51,33698815502,Spanish Olives Stuffed Queens,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,269.0,9.5,,,,,GS1 US,US
52,33698815809,Giardiniera Fancy,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,,32.0,,,,,GS1 US,US
53,33698815823,Pepper Steaks,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,,,946.0,32.0,,,GS1 US,US
54,33698816714,Clam Sauce,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,425.0,15.0,,,,,GS1 US,US
55,33698816851,Diced Tomatoes,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,411.0,14.5,,,,,GS1 US,US
56,33698816912,Ready To Serve Soup,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,75.0,,,,,,GS1 US,US
57,33698817049,Pepperoni Pizza Kit,Food - Beverage - Tobacco,50000000,,,,http://product.okfn.org.s3.amazonaws.com/image...,W1EWKK,Sun of Italy,...,,,,,,,,,GS1 US,US


Exporting the subset of the pod_gtin dataset to a file.

In [121]:
file_name = 'pod_gtin_' + np.shape(pod_gtin_subset)[0].__str__() + '.csv'
pod_gtin_subset.to_csv(file_name)