# Notebook 1

The objective of this notebook is to clean the backend datasets, extract features and produce a file that can be used as an input to the modeling stage.

_This notebook contains:_
1. Data cleaning
2. Feature Creation
3. Feature extraction
4. Lemmatizing / Stemming
5. Brand Encoding

# Import packages and data 

In [7]:
import pandas as pd 
import numpy as np
import re
import regex as regex
import time
from tqdm import tqdm
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
import nltk
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
from tqdm import tqdm


In [8]:
brands = pd.read_csv('behold_brands USC.csv')
product = pd.read_excel('Behold+product+data+04262021.xlsx')

**Quick peek into the data**

In [9]:
brands.head(2)

Unnamed: 0,brand_id,brand,brand_value,bio,quote,quote_attribute,intro,lifestyle_copy,short_bio,listing_bio
0,01ESKR0CH2KYC7KBNTN0S38EQA,Mari Giudicelli,Handmade / Artisan Crafted,,,,,,,Behold Mari Giudicelli! This Brazilian shoe de...
1,01ESKR0CH2KYC7KBNTN0S38EQA,Mari Giudicelli,Sustainable,,,,,,,Behold Mari Giudicelli! This Brazilian shoe de...


In [10]:
# Considering data stats
brands.describe().T

Unnamed: 0,count,unique,top,freq
brand_id,162,74,01EFJFZ29KVBK14BDJNFBDK1G7,5
brand,162,74,lemlem,5
brand_value,154,6,Women Owned,47
bio,157,71,lemlem is a women's resort wear brand made ent...,5
quote,157,71,"We believe in the power of collaboration, it i...",5
quote_attribute,154,59,Vogue,13
intro,154,70,"Founded in 2007 by Liya Kebede, lemlem is enti...",5
lifestyle_copy,156,66,...,8
short_bio,157,71,Á La Holiday,5
listing_bio,157,71,Behold LemLem! Founded by unstoppable supermod...,5


In [11]:
# Checking for null values
brands.isna().sum()

brand_id           0
brand              0
brand_value        8
bio                5
quote              5
quote_attribute    8
intro              8
lifestyle_copy     6
short_bio          5
listing_bio        5
dtype: int64

In [12]:
# Number of unique brands
print(f' Number of Brands in the Brand file: {len(brands.brand.unique())}')

 Number of Brands in the Brand file: 74


In [13]:
product.head(2)

Unnamed: 0,product_id,brand,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,product_active
0,01EX0PN4J9WRNZH5F93YEX6QAF,Two,Unknown,Khadi Stripe Shirt-our signature shirt,,2021-01-27 01:17:19.305 UTC,https://two-nyc.myshopify.com/products/white-k...,Our signature khadi shirt\navailable in black ...,Our signature khadi shirt\n\navailable in blac...,Khadi Stripe Shirt-our signature shirt,True
1,01F0C4SKZV6YXS3265JMC39NXW,Collina Strada,Unknown,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,,2021-03-09 18:43:10.457 UTC,https://collina-strada-2.myshopify.com/product...,Mid-length dress with ruffles and adjustable s...,Mid-length dress with ruffles and adjustable s...,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,True


In [14]:
product.describe().T

Unnamed: 0,count,unique,top,freq
product_id,61355,61355,01EQ9MQK5WPJFR5HD1812QTXW6,1
brand,61355,386,7 For All Mankind,9011
brand_category,60896,632,Unknown,53249
name,61354,42615,ANKLE SKINNY,247
details,9200,7034,True to size.,265
created_at,61355,61349,2020-08-28 21:40:42.81 UTC,2
brand_canonical_url,61355,58156,https://www.7forallmankind.com/airweft-denim-s...,13
description,51238,42210,product name by ancient greek sandals,127
brand_description,51234,42550,product name by ancient greek sandals,127
brand_name,61354,42602,ANKLE SKINNY,247


In [15]:
product.isna().sum()

product_id                 0
brand                      0
brand_category           459
name                       1
details                52155
created_at                 0
brand_canonical_url        0
description            10117
brand_description      10121
brand_name                 1
product_active             0
dtype: int64

In [16]:
print(f' Number of Brands in the Product file: {len(product.brand.unique())}')

 Number of Brands in the Product file: 386


In [17]:
print(f'Number of Brand Category: {len(product.brand_category.unique())}')

Number of Brand Category: 633


# Cleaning

In this stage, we get into the data cleaning. 

**Cleaning Tasks**:

1. fill missing values
2. remove \n, html tags and other encoding characters
3. remove punctuation
4. change to lower case
5. remove stopwords
6. create named entities


## Fill Missing Value

In [18]:
# create a copy of the original dataset
product_copy = product.copy()

In [19]:
# fill missing values
product_copy.fillna('None',inplace = True)

## Remove \n, tags and Punctuation

In [20]:
# remove unusefull elements
product_copy.description = product_copy.description.str.replace('\n',' ')
product_copy.description = product_copy.description.str.replace('\r',' ')
product_copy.description = product_copy.description.str.replace(r'  ',' ')
product_copy.description = product_copy.description.str.replace(r'\bs\b','')

# remove punctuation
product_copy.description = product_copy.description.str.replace(r'[^A-Za-z0-9 ]+','')



  product_copy.description = product_copy.description.str.replace(r'\bs\b','')
  product_copy.description = product_copy.description.str.replace(r'[^A-Za-z0-9 ]+','')


## Change to Lower Case

In [21]:
# change all the words to lower case 
product_copy.description = product_copy.description.str.lower()

In [22]:
# quick check
product_copy.description[1]

'midlength dress with ruffles and adjustable straps bias cut side seam invisible zipper made in new york model wears size small 100 rose sylk rose sylk is an organic cellulose fiber made from the natural waste of rose bushes and stems'

## Create Named Entities

**Note:** We decided to remove stopwords after identifying entities and creating features. Given the nature of the product description and general phrasing, we wanted to preserve the semantic meaning as much as possible while creating features. We tried removing stopwords and then extracting features but found the former to identify features better.

In [23]:
# Cleaning versions of New York City so that it can be captured in the features
product_copy['description'] = product_copy['description'].str.\
                                replace(r'\bnew\b\s\byork\b\s(?:\bcity\b)?','new_york_city ')



  product_copy['description'] = product_copy['description'].str.\


In [24]:
# check the result of replacement
product_copy['description'].astype(str).apply(lambda x: re.search(r'\bnew\b\s\byork\b\s(?:\bcity\b)?',x)).sum()

0

In [25]:
# Cleaning versions of USA so that it can be captured in the features

product_copy['description'] = product_copy['description'].str.\
                                replace(r'\b(?:the\s)?usa?\b','USA')
product_copy['description'] = product_copy['description'].str.\
                                replace(r'\b(?:the\s)?united\sstates\b','USA')

  product_copy['description'] = product_copy['description'].str.\
  product_copy['description'] = product_copy['description'].str.\


In [26]:
## check the result 
product_copy['description'].astype(str).apply(lambda x: re.search(r'\b(?:the\s)?usa?\b',x)).sum()

0

# Create Features (Categorizing)

To better capture the characteristics about the products, we decided to create some features manually using the regex extraction. Some common ways of categorizing products are:
    
    1. User type: women, men, or children
    2. Color
    3. Clothing Category: top, bottom, one piece...
    4. Occasions: sports, casual,cozy, formal, swim, holiday, business... 
    5. Made in:  USA, China, Europe...
    7. Wash Type: no machine wash,dry wash, hand wash, tumble dry, hang dry...
    8. Fabric: ploy, chiffon, cotton ...

## User Type

Because a large proportion of clothing is for women, we suppose that if there are no keywords indicate it is for children or men, we treat this product as women's clothing. For the products that don't have a description are also treated as women's clothing.

In [27]:
def isWomensClothing(txt):
    """ Function to determine whether it is an article of women's clothing """

    txt = str(txt)
    val = True
    if re.search(r'\b(Girl?|boy?|men|man|Gir?|baby|kid?)\b', txt, re.IGNORECASE ):
        val = False
    return val


In [28]:
product_copy['is_womens_clothing'] = pd.DataFrame(
                                                [product_copy.description.\
                                                  apply(isWomensClothing),
                                                product_copy.name.apply(isWomensClothing),
                                                product_copy.details.\
                                                 apply(isWomensClothing)]
                                                 ).all()
    
# any return false is false women
# all return True is True women

product_copy['is_womens_clothing'].value_counts()

True     59626
False     1729
Name: is_womens_clothing, dtype: int64

> **There are 59,626 records labelled as `is_women_clothing`**

In [29]:
def isChildrenClothing(txt):
    """ Function to determine whether it is an article of children's clothing """

    txt = str(txt)
    val = False
    if re.search(r'\b(Girl?|boy?|Gir?|baby|kid?)\b', txt, re.IGNORECASE ):
        val = True
    return val

In [30]:
product_copy['is_children_clothing'] = np.nan
product_copy['is_children_clothing'] = pd.DataFrame(
                                                  [product_copy.description.\
                                                  apply(isChildrenClothing),
                                                product_copy.name.apply(isChildrenClothing),
                                                product_copy.details.\
                                                 apply(isChildrenClothing)]
                                                    ).any()

product_copy['is_children_clothing'].value_counts()

False    60455
True       900
Name: is_children_clothing, dtype: int64

> **900 records are labeled as children's clothing.**

In [31]:
def isMenClothing(txt):
    """ Function to determine whether it is an article of women's clothing """

    txt = str(txt)
    val = False
    if re.search(r'\b(man|men|man\'s|men\'s)\b', txt, re.IGNORECASE ):
        val = True
    return val

In [32]:
product_copy['is_men_clothing'] = pd.DataFrame(  [product_copy.description.\
                                                  apply(isMenClothing),
                                                product_copy.name.apply(isMenClothing),
                                                product_copy.details.\
                                                 apply(isMenClothing)]
                                                    ).any()
product_copy['is_men_clothing'].value_counts()

False    60517
True       838
Name: is_men_clothing, dtype: int64

> **838 records are labelled as men's clothing.**

In [33]:
conditions = [(product_copy.is_womens_clothing == True),
              (product_copy.is_men_clothing == True),
              (product_copy.is_children_clothing == True)]
              
values = ['women','men','children']

product_copy['user_type']= np.select(conditions, values)

product_copy['user_type'].value_counts()

women       59626
children      891
men           838
Name: user_type, dtype: int64

## Clothing Category

We grouped the products into 9 categories:

    1. bottom 
    2. one-piece 
    3. shoes
    4. handbag
    5. scarf
    6. top 
    7. accessory
    8. linen
    9. lingerie. 
    
  We used text in 'name','description','details' fields in the product dataframe to identify these categories. 

In [34]:
bottom_seq=r'\b(capri?|leggings?|bottoms?|skirts?|sweatpants?|pants?|jeans?|midi|trousers?|shorts?|trunks?)\b' #|knee|ankle
one_piece_seq=r'\b(kimono|jumpsuit|dress(?:es)?|gowns?|swimsuit?|onesies?|unitards?|bodysuits?|rompers?|one ?piece)\b'
shoe_seq=r'\b(shoes?|sneakers?|flats|boot|heels?|sandals?|mules?|loafers?|pumps?)\b' #(?:ies|s)?
handbag_seq=r'\b(bags?|handbags?|clutch(?:es)?|wallets?|purses?|duffels?)\b'
scarf=r'\b(scar(?:f|ves)?|wraps?|stoles?|shawl)\b' # row 412, 705, 
top=r'\b(Tank|caftan|hoodies?|tshirts?|tees?|tops?|top|blouses?|jackets?|blazers?|shirts?|tops?|coats?|suits|sweaters?|sweatshirts?)\b'
acc=r'\b(capes?|socks?|earrings?|belts?|gloves?|headbands?|ties?|hats?|caps?)\b'
linen=r'\b(linens?)\b'
lingerie=r'\b(bras?)\b'

# product category hierarchy
dict_seq={'one_piece':one_piece_seq,
          'shoe':shoe_seq,
          'handbag':handbag_seq,
          'scarf':scarf,
          'top':top,'acc':acc,'linen':linen,
          'bottom':bottom_seq,'lingerie':lingerie
        }
# occurence score calculation
for d in dict_seq:
    product_copy[f'{d}_check']=0
    for col in ['name','description','details','brand_category']:
        product_copy[f'{d}_check']=product_copy[f'{d}_check']+product_copy[col].str.contains(dict_seq[d],case=False)


 #calculating max occurence         
product_copy['max_value_cat']=product_copy[["bottom_check","shoe_check","one_piece_check","handbag_check","scarf_check",'acc_check','top_check','linen_check','lingerie_check']].max(axis=1)
def max_presence(row):
    for d in dict_seq.keys():
        colname=f'{d}_check'
        if (row['max_value_cat']==row[colname])&(row['max_value_cat']!=0):
            return d
        elif row['max_value_cat']==0:
            return 'None'
#searching for the category with max occurence    
product_copy['final_category']=product_copy.apply(max_presence,axis=1)

  return func(self, *args, **kwargs)


In [35]:
product_copy['final_category'].value_counts()

top          18373
None         12943
one_piece     9636
bottom        8687
shoe          4901
acc           2741
handbag       2495
scarf          837
lingerie       440
linen          296
Name: final_category, dtype: int64

## Color

In [36]:
def findColors(txt):
    """ Function to determine the color of item """
   
    colors_re = r'\b(beige|light brown|black|blue ?green|blue|brown|umber|burgundy|gold(?:en)?|gray|grey|green|navy|neutral|orange|aurantia|pink|purple|violet|red|scarlet|silver|teal|white|yellow|(?:multi(?:ple)?|several|different|many|more than one) ?colou?rs?)\b'

    val = []
    txt = str(txt)
    if re.findall(colors_re, txt, re.IGNORECASE ):
        val = re.findall(colors_re, txt, re.IGNORECASE )
    return val

In [37]:
# find all colors in item descriptions and product name 
product_copy['color_list'] = product_copy['description'].apply(findColors) +product_copy['name'].apply(findColors)              


In [38]:
# extract the set of colors
product_copy['colors'] = product_copy['color_list'].apply(lambda x: set(y.lower() for y in x))


In [39]:
product_copy.sample(2)

Unnamed: 0,product_id,brand,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,...,scarf_check,top_check,acc_check,linen_check,bottom_check,lingerie_check,max_value_cat,final_category,color_list,colors
32371,01EPABXZX4KR0ZQHS92AP5VQ9Y,Holden,Unknown,Womens Oversized Wool Crew,,2020-11-04 19:32:26.66 UTC,https://holdenouterwear.myshopify.com/products...,the sweater modernized with deep drop tail and...,The sweater modernized with deep drop tail and...,Womens Oversized Wool Crew,...,0,1,0,0,0,0,1.0,top,[],{}
47268,01EC8PMCC0HKGT4AD2Z97G3APC,Sea,Unknown,O'Keefe Blouse,,2020-07-02 21:23:58.968 UTC,https://seanyc.myshopify.com/products/okeefe-b...,the okeefe long sleeve blouse features a peasa...,The O'Keefe long sleeve blouse features a peas...,O'Keefe Blouse,...,0,2,0,0,0,0,2.0,top,[],{}


In [40]:
# count the number of colors
product_copy['n_colors'] = product_copy['colors'].apply(len)
product_copy[['colors','n_colors']].head(5)


Unnamed: 0,colors,n_colors
0,"{white, black}",2
1,{pink},1
2,{red},1
3,{black},1
4,{black},1


In [41]:
# tag color labels
product_copy['final_color'] = np.nan

# label items with more than one color as "Multi"
product_copy.loc[product_copy['n_colors']>1,'final_color'] = 'multi'

# label items with one color
product_copy.loc[product_copy['n_colors']==1,'final_color'] = product_copy.loc[product_copy['n_colors']==1,'colors'].\
                                                                        apply(lambda x:list(x)[0])

In [42]:
product_copy['final_color'].value_counts()

multi                   5176
black                   4312
white                   1948
blue                    1902
gold                     915
navy                     771
pink                     737
green                    717
grey                     600
red                      534
silver                   392
yellow                   384
brown                    305
orange                   264
neutral                  219
beige                    149
multicolor               147
golden                   140
purple                   131
gray                     104
violet                   102
burgundy                  90
teal                      56
multi color               30
different colors           9
blue green                 9
scarlet                    8
different colours          7
multicolour                6
light brown                6
multiple colors            6
more than one colour       5
different color            4
several colours            2
multiple color

**There are many ways to say 'multicolors', we grouped them into one value 'multi'**

In [43]:
multi_pattern = r'\b(?:multi(?:ple)?|several|different|many|more than one\b)\s? ?colou?rs?'
for i in tqdm(range(len(product_copy))):
    match = re.search(multi_pattern,
                       str(product_copy.loc[i,'final_color']))
    if match:
        product_copy.loc[i,'final_color'] = 'multi'
                       
    

100%|██████████| 61355/61355 [00:02<00:00, 21293.25it/s]


In [44]:
product_copy['final_color'].value_counts()

multi          5397
black          4312
white          1948
blue           1902
gold            915
navy            771
pink            737
green           717
grey            600
red             534
silver          392
yellow          384
brown           305
orange          264
neutral         219
beige           149
golden          140
purple          131
gray            104
violet          102
burgundy         90
teal             56
blue green        9
scarlet           8
light brown       6
umber             1
Name: final_color, dtype: int64

## Wash Type

This is further sub-categorized into:

    Machine wash cold
    Tumble dry low
    Wash Cold
    Hang Dry
    No Bleach
    Dry clean
    Hand wash
    Tumble dry

In [45]:
def findCare(txt):
    """ Function to determine the care method of item """
    txt = str(txt)
    val = []
    care_re = r'\b(dry clean|hand wash|not? bleach|tumble dry|hang dry\
                |machine wash cold|machine wash|(?:[don\'t|not?]) machine wash)\b'
    if re.findall(care_re, txt, re.IGNORECASE ):
        val = re.findall(care_re, txt, re.IGNORECASE)
    return val

In [46]:
product_copy['care_list'] = product_copy['description'].apply(findCare)              
product_copy['wash_type'] = product_copy['care_list'].apply(lambda x: set(y.lower() for y in x))


In [47]:
def is_dry_clean(txt):
    txt = str(txt)
    val = 0
    if re.search(r'\b(dry clean)\b', txt, re.IGNORECASE ):
        val = 1
   
    return val

product_copy['is_dry_clean'] = pd.DataFrame(product_copy.description.apply(is_dry_clean))
product_copy['is_dry_clean'].value_counts()

0    53596
1     7759
Name: is_dry_clean, dtype: int64

In [48]:
def is_hand_wash(txt):
    txt = str(txt)
    val = 0
    if re.search(r'\b(handwash)\b', txt, re.IGNORECASE ):
        val = 1
   
    return val

product_copy['is_hand_wash'] = pd.DataFrame(product_copy.description.apply(is_hand_wash))
product_copy['is_hand_wash'].value_counts()

0    61268
1       87
Name: is_hand_wash, dtype: int64

In [49]:
def is_machine_wash_cold(txt):
    txt = str(txt)
    val = 0
    if re.search(r'\b(machine wash cold)\b', txt, re.IGNORECASE ):
        val = 1
   
    return val

product_copy['is_machine_wash_cold'] = pd.DataFrame(product_copy.description.apply(is_machine_wash_cold))
product_copy['is_machine_wash_cold'].value_counts()

0    58564
1     2791
Name: is_machine_wash_cold, dtype: int64

In [50]:
def is_machine_wash(txt):
    txt = str(txt)
    val = 0
    if re.search(r'\b(machine wash)\b', txt, re.IGNORECASE ):
        val =1
   
    return val

product_copy['is_machine_wash'] = pd.DataFrame(product_copy.description.apply(is_machine_wash))
product_copy['is_machine_wash'].value_counts()

0    58026
1     3329
Name: is_machine_wash, dtype: int64

In [51]:
def is_tumble_dry(txt):
    txt = str(txt)
    val = 0
    if re.search(r'\b(tumble dry)\b', txt, re.IGNORECASE ):
        val = 1
   
    return val

product_copy['is_tumble_dry'] = pd.DataFrame(product_copy.description.apply(is_tumble_dry))
product_copy['is_tumble_dry'].value_counts()

0    58834
1     2521
Name: is_tumble_dry, dtype: int64

In [52]:
def is_not_bleach(txt):
    txt = str(txt)
    val = 0
    if re.search(r'\b((not|no) bleach)\b', txt, re.IGNORECASE ):
        val = 1
   
    return val

product_copy['is_not_bleach'] = pd.DataFrame(product_copy.description.apply(is_not_bleach))
product_copy['is_not_bleach'].value_counts()

0    59351
1     2004
Name: is_not_bleach, dtype: int64

In [53]:
product_copy[product_copy['is_not_bleach']==1]['description']

60       details  super cozy lightweight jersey warm iv...
943      the classic army jacket add tomboy toughness t...
2660     details  lightweight long sleeve vintage terry...
2699     details  cashmere blend long sleeve crewneck h...
2965     details  luxe cotton long sleeve crisp white b...
                               ...                        
56547    high waisted wide leg pull on pant in our cust...
56651    one shouldered dress in our custom simon mille...
58836    the nejvi top is the intersection of three of ...
58837    these wideleg beauties feature an understated ...
59161    this machinewashable top has the makings of a ...
Name: description, Length: 2004, dtype: object

In [54]:
product_copy.iloc[60]

product_id                                     01F029NZMTHZ5V3RED5RS07D1W
brand                                                               Rails
brand_category                                                    Unknown
name                                              NESSA - BOTANICAL PALMS
details                                                              None
created_at                                    2021-03-05 22:56:09.878 UTC
brand_canonical_url     https://rails-25.myshopify.com/products/nessa-...
description             details  super cozy lightweight jersey warm iv...
brand_description       DETAILS |\nSuper cozy, lightweight jersey, war...
brand_name                                        NESSA - BOTANICAL PALMS
product_active                                                      False
is_womens_clothing                                                   True
is_children_clothing                                                False
is_men_clothing                       

## Fabric

We try to find the fabric of the product.

In [55]:
fabric_list = ['broadcloth','brocade','calico','cashmere','chambray','chiffon',
               'corduroy','cotton','eyelet','faille','foulard','furbelow',
               'fustian','gingham','grosgrain','jacquard','knit','linen','lisle',
              'madras','merino','paisley','sateen','satin','seersucker','shetland',
              'silk','taffeta','tulle','velvet','polyester']

In [56]:

def findFabric(txt):
    """ Function to determine the fabric of item """
    fabric_re = r'\b(broadcloth|brocade|calico|cashmere|chambray|chiffon|corduroy|cotton|eyelet|faille|foulard|furbelow|fustian|gingham|grosgrain|jacquard|knit\
                    |linen|lisle|madras|merino|paisley|sateen|satin|\
                    seersucker|shetland|silk|taffeta|tulle|velvet|polyester|rayon)\b'
    val = []
    txt = str(txt)
    if re.findall(fabric_re, txt, re.IGNORECASE ):
        val = re.findall(fabric_re, txt, re.IGNORECASE )
    return val

In [57]:
product_copy['fabric_list'] = product_copy['description'].apply(findFabric) +product_copy['name'].apply(findFabric)              

product_copy['fabrics'] = product_copy['fabric_list'].apply(lambda x: set(y.lower() for y in x))


In [58]:
product_copy['n_fabric'] = product_copy['fabrics'].apply(len)

In [59]:

# tag color labels
product_copy['final_fabric'] = np.nan
# label items with more than one color as "Multi"
product_copy.loc[product_copy['n_fabric']>1,'final_fabric'] = 'multi'
# label items with one color
product_copy.loc[product_copy['n_fabric']==1,'final_fabric'] = \
       product_copy.loc[product_copy['n_fabric']==1,'fabrics'].apply(lambda x:list(x)[0])
  

In [60]:
product_copy['final_fabric'].value_counts()

cotton       10752
multi         5085
polyester     1922
silk          1643
linen          881
cashmere       871
chiffon        739
rayon          634
satin          379
merino         287
velvet         274
chambray       256
jacquard       206
grosgrain      129
eyelet         102
corduroy        98
gingham         67
sateen          49
tulle           30
paisley         29
taffeta         21
foulard          9
madras           3
brocade          3
calico           2
faille           1
lisle            1
Name: final_fabric, dtype: int64

In [61]:
product_copy.columns

Index(['product_id', 'brand', 'brand_category', 'name', 'details',
       'created_at', 'brand_canonical_url', 'description', 'brand_description',
       'brand_name', 'product_active', 'is_womens_clothing',
       'is_children_clothing', 'is_men_clothing', 'user_type',
       'one_piece_check', 'shoe_check', 'handbag_check', 'scarf_check',
       'top_check', 'acc_check', 'linen_check', 'bottom_check',
       'lingerie_check', 'max_value_cat', 'final_category', 'color_list',
       'colors', 'n_colors', 'final_color', 'care_list', 'wash_type',
       'is_dry_clean', 'is_hand_wash', 'is_machine_wash_cold',
       'is_machine_wash', 'is_tumble_dry', 'is_not_bleach', 'fabric_list',
       'fabrics', 'n_fabric', 'final_fabric'],
      dtype='object')

## Made In

In this section, we try to identify the location the product is madufactured in.

Created a feature that includes the city/country the item is made in. Countries such as Italy, Ghana, India, and Spain were included because we saw a few items that had associations with those countries.

In [62]:
location_list_clean = []
for rows in range(0,len(product_copy)):
    #check if content in description column, if not (aka null), there's usually content in details column
    if pd.isnull(product_copy.loc[rows,'description']):
        if pd.isnull(product_copy.loc[rows,'details']):
            location_list_clean.append('None')
            continue
        else:
            #append entity to list and operate on the details column, will be used to slap it onto main dataframe later
            if len(re.findall(r'\b(USA|United States)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('USA')
            elif len(re.findall(r'\b(Italy)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Italy')
            elif len(re.findall(r'\b(New York|NewYork|new_york_city|NewYorkCity|NY|NYC|N\.Y\.)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('New York')
            elif len(re.findall(r'\b(Los Angeles|LA|LosAngeles|L\.A\.)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('LA')
            elif len(re.findall(r'\b(Ghana)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Ghana')
            elif len(re.findall(r'\b(China)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('China')
            elif len(re.findall(r'\b(India)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('India')
            elif len(re.findall(r'\b(Spain)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Spain')
            elif len(re.findall(r'\b(France)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('France')
            elif len(re.findall(r'\b(London|UK|U\.K\.)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('London/UK')
            elif len(re.findall(r'\b(Japan|Tokyo)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Japan')
            else:
                location_list_clean.append('None')
    else:
        #do the same thing as above if content is in description column
            if len(re.findall(r'\b(USA|United States)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('USA')
            elif len(re.findall(r'\b(Italy)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Italy')
            elif len(re.findall(r'\b(New York|NewYork|new_york_city|NewYorkCity|NY|NYC|N\.Y\.)\b',str(product.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('New York')
            elif len(re.findall(r'\b(Los Angeles|LA|LosAngeles|L\.A\.)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('LA')
            elif len(re.findall(r'\b(Ghana)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Ghana')
            elif len(re.findall(r'\b(China)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('China')
            elif len(re.findall(r'\b(India)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('India')
            elif len(re.findall(r'\b(Spain)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Spain')
            elif len(re.findall(r'\b(France)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('France')
            elif len(re.findall(r'\b(London|UK|U\.K\.)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('London/UK')
            elif len(re.findall(r'\b(Japan|Tokyo)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                location_list_clean.append('Japan')
            else:
                location_list_clean.append('None')

## Occasions

Occasions could be an interesting feature, since brands might have specialties and/or some brands are very occasion specific (ex. yoga attire only, swimwear only, etc). We wanted to capture all four seasons, while further segmenting Work, Beach, Swimwear, Sports, and Yoga.

In [63]:
occasion_list_cleaned = []
for rows in range(0,len(product_copy)):
    #check if content in description column, if not (aka null), there's usually content in details column
    if pd.isnull(product_copy.loc[rows,'description']):
        if pd.isnull(product_copy.loc[rows,'details']):
            occasion_list_cleaned.append('None')
            continue
        else:
            #append entity to list and operate on the details column, will be used to slap it onto main dataframe later
            if len(re.findall(r'\b(summer|summertime|sun|sunny|heat|June|July|August)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Summer_Season')
            elif len(re.findall(r'\b(work)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Work')
            elif len(re.findall(r'\b(beach|beachy|sand|sandy)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Beach')
            elif len(re.findall(r'\b(fall|autumn|thanksgiving|halloween|September|October|November)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Fall_Season')
            elif len(re.findall(r'\b(spring|March|April|Easter)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Spring_Season')
            elif len(re.findall(r'\b(winter|cold|New Year|Christmas|snow|December|January|February)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Winter_Season')                
            elif len(re.findall(r'\b(swim|water|wet|pool|swimwear)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Swimwear')    
            elif len(re.findall(r'\b(sports|golf|tennis|marathon|basketball|soccer|cycling|hiking|climbing|running|sport|sportswear|hike|climb|run|workout)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Sports') 
            elif len(re.findall(r'\b(yoga|studio|yogi|poses)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Yoga')
            else:
                occasion_list_cleaned.append('None')
    else:
        #do the same thing as above if content is in description column
            if len(re.findall(r'\b(summer|summertime|sun|sunny|heat|June|July|August)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Summer_Season')
            elif len(re.findall(r'\b(work)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Work')
            elif len(re.findall(r'\b(beach|beachy|sand|sandy)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Beach')
            elif len(re.findall(r'\b(fall|autumn|thanksgiving|halloween|September|October|November)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Fall_Season')
            elif len(re.findall(r'\b(spring|March|April|Easter)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Spring_Season')
            elif len(re.findall(r'\b(winter|cold|New Year|Christmas|snow|December|January|February)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Winter_Season')                
            elif len(re.findall(r'\b(swim|water|wet|pool|swimwear)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Swimwear')    
            elif len(re.findall(r'\b(sports|golf|tennis|marathon|basketball|soccer|cycling|hiking|climbing|running|sport|sportswear|hike|climb|run|workout)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Sports') 
            elif len(re.findall(r'\b(yoga|studio|yogi|poses)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >= 1:
                occasion_list_cleaned.append('Yoga')
            else:
                occasion_list_cleaned.append('None')

## Deadstock

We noticed some brands used deadstock fabrics/materials to create items, so we wanted to see if this could help predict the brand better.

In [64]:
deadstock_list = []
for rows in range(0,len(product_copy)):
    #check if content in description column, if not (aka null), there's usually content in details column
    if pd.isnull(product_copy.loc[rows,'description']):
        if pd.isnull(product_copy.loc[rows,'details']):
            deadstock_list.append(0)
            continue
        else:
            #append entity to list and operate on the details column, will be used to slap it onto main dataframe later
            if len(re.findall(r'\b(deadstock)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >=1:
                deadstock_list.append(1)
            else:
                deadstock_list.append(0)

    else:
        #do the same thing as above if content is in description column
            if len(re.findall(r'\b(deadstock)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >=1:
                deadstock_list.append(1)
            else:
                deadstock_list.append(0)

## Handcrafted

Similar to the logic of trying the deadstock feature, we noticed some brands handcrafted their items, so we wanted to see if this could help predict the brand better.

In [65]:
handcrafted_list = []
for rows in range(0,len(product_copy)):
    #check if content in description column, if not (aka null), there's usually content in details column
    if pd.isnull(product_copy.loc[rows,'description']):
        if pd.isnull(product_copy.loc[rows,'details']):
            handcrafted_list.append(0)
            continue
        else:
            #append entity to list and operate on the details column, will be used to slap it onto main dataframe later
            if len(re.findall(r'\b(handcrafted|hand-crafted)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >=1:
                handcrafted_list.append(1)
            else:
                handcrafted_list.append(0)

    else:
        #do the same thing as above if content is in description column
            if len(re.findall(r'\b(handcrafted|hand-crafted)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >=1:
                handcrafted_list.append(1)
            else:
                handcrafted_list.append(0)

## True to size

Some brands had many items that fit true to size, so we wanted to create a feature from this as well.

In [66]:
truetosize_list = []
for rows in range(0,len(product_copy)):
    #check if content in description column, if not (aka null), there's usually content in details column
    if pd.isnull(product_copy.loc[rows,'description']):
        if pd.isnull(product_copy.loc[rows,'details']):
            truetosize_list.append(0)
            continue
        else:
            #append entity to list and operate on the details column, will be used to slap it onto main dataframe later
            if len(re.findall(r'\b(true to size)\b',str(product_copy.loc[rows,'details']),flags=re.IGNORECASE)) >=1:
                truetosize_list.append(1)
            else:
                truetosize_list.append(0)

    else:
        #do the same thing as above if content is in description column
            if len(re.findall(r'\b(true to size)\b',str(product_copy.loc[rows,'description']),flags=re.IGNORECASE)) >=1:
                truetosize_list.append(1)
            else:
                truetosize_list.append(0)

In [67]:
product_copy['Made_In'] = location_list_clean
product_copy['Occasion'] = occasion_list_cleaned
product_copy['Deadstock'] = deadstock_list
product_copy['Handcrafted'] = handcrafted_list
product_copy['True_To_Size'] = truetosize_list

In [68]:
product_copy['True_To_Size'].value_counts()

0    55433
1     5922
Name: True_To_Size, dtype: int64

# Stopwords

At this stage, we remove stopwords before lemmatization.

In [69]:
import gensim
from gensim.parsing.preprocessing import remove_stopwords
stopwords = gensim.parsing.preprocessing.STOPWORDS
print(stopwords)

frozenset({'least', 'last', 'between', 'our', 'empty', 'her', 'of', 'de', 'former', 'get', 'whoever', 'per', 'cry', 'myself', 'therein', 'quite', 'namely', 'found', 'less', 'as', 'it', 'anyhow', 'do', 'beyond', 'have', 'enough', 'below', 'out', 'really', 'anything', 'onto', 'indeed', 'whatever', 'none', 'an', 'together', 'front', 'thru', 'we', 'five', 'has', 'hereby', 'made', 'co', 'to', 'you', 'be', 'throughout', 'well', 'here', 'sixty', 'most', 'or', 'what', 'perhaps', 'each', 'nine', 'sincere', 'sometime', 'thus', 'thereby', 'wherein', 'wherever', 'four', 'us', 'his', 'while', 'must', 'seems', 'couldnt', 'eight', 'there', 'every', 'third', 'among', 'ltd', 'does', 'him', 'any', 'fire', 'now', 'over', 'someone', 'computer', 'serious', 'toward', 'anywhere', 'before', 'sometimes', 'are', 'everything', 'becomes', 'were', 'few', 'how', 'from', 'where', 'full', 'he', 'thereupon', 'their', 'hasnt', 'all', 'should', 'keep', 'into', 'due', 'which', 'and', 'whether', 'whereafter', 'thick', 'bo



In [70]:
stopwords_mod = ['moreover', 'only', 'eight', 'otherwise', 'unless', 'done', 'as', 'somehow',
 'off', 'three', 'do', 'become', 'nothing', 'a', 'top', 'describe', 'not',
 'although', 'co', 'if', 'mostly', 'such', 'third', 'myself', 'sometime',
 'because', 'yours', 'within', 'noone', 'former', 'through', 'seeming',
 'further', 'fifteen', 'had', 'inc', 'into', 'is', 'who', 'amount', 'during', 'per', 'doing',
 'for', 'neither', 'an', 'yourself', 'under', 'still', 'doesn', 'this', 'name', 'rather', 'it',
 'whence', 'toward', 'various', 'somewhere', 'the', 'hasnt', 'few', 'thereupon', 'alone', 'all', 'own', 'yet',
 'well', 'ourselves', 'anywhere', 'with', 'many', 'themselves', 'until',
 'side', 'move', 'from', 'its', 'her', 'upon', 'here', 'don', 'above',
 'wherein', 'their', 'becomes', 'thus', 'up', 'either', 'another', 'can', 'beforehand', 'twelve', 'ours', 'call',
 'hereafter', 'me', 'part', 'less', 'between', 'other', 'de', 'kg', 'nowhere', 'at', 'without', 'among', 'thin', 'anyway', 'towards', 'using',
 'see', 'what', 'serious', 'whether', 'perhaps', 'thereafter', 'or', 'put', 'thick', 'sixty', 'five', 'i', 'how', 'even', 'one', 'didn', 'below', 'which', 'first', 'them', 'hundred', 'hereupon', 'mill', 'been', 'besides', 'to', 'amongst', 'make', 'however', 'just', 'must', 'both', 'each', 'any', 'again', 'are', 'everyone', 'herein', 'bill', 'then', 'get', 'fifty',
 'anyone', 'whereby', 'so', 'un', 'became', 'nor', 'were', 'used', 'whereupon',
 'show', 'give', 'seems', 'but', 'always', 'against', 'him', 'wherever', 'made',
 'some', 'last', 'along', 'computer', 'anyhow', 'cry', 'about', 'that', 'on', 'due',
 'meanwhile', 'his', 'our', 'when', 'these', 'and', 'several', 'formerly', 'since',
 'whole', 'am', 'eleven', 'once', 'whoever', 'eg', 'please', 'amoungst', 'least',
 'hence', 'us', 'ie', 'go', 'ever', 'every', 'none', 'others', 'of', 'fire', 'whenever', 'too', 'indeed',
 'already', 'by', 'becoming', 'whose', 'something', 'yourselves', 're', 'around', 'nine', 'via', 'where', 'forty', 'hereby', 'everything', 'sometimes', 'system', 'might', 'no', 'across', 'could', 'very', 'more', 'behind', 'afterwards', 'whereas', 'twenty', 'while', 'out', 'ten', 'latterly', 'namely', 'be', 'should', 'thereby', 'mine', 'whom', 'fill', 'two', 'beyond', 'take', 'my', 'else', 'throughout', 'would', 'thence', 'say', 'will', 'down', 'does', 'together', 'though', 'next', 'also', 'we', 'back', 'cannot', 'sincere', 'most', 'seemed', 'therein', 'she', 'being',
 'latter', 'they', 'seem', 'did', 'detail', 'whatever', 'someone', 'himself', 'regarding', 'nobody', 'six', 'bottom', 'elsewhere', 'find', 'etc', 'couldnt', 'your', 'interest', 'has', 'ltd', 'therefore', 'thru', 'four', 'km', 'anything', 'quite', 'now', 'everywhere', 'those', 'con', 'much', 'you', 'than', 'same', 'keep', 'full', 'cant', 'beside', 'herself', 'except', 'itself', 'after', 'may', 'before', 'often', 'in', 'almost', 'nevertheless', 'why', 'have', 'front', 'enough', 'whereafter', 'there', 'whither', 'he', 'found', 'really', 'hers', 'no','in',
 'empty', 'never', 'was', 'onto', 'over']

We added some customised stopwords. Almost all the products contains information about size.

    For instance: 'waist:60, bust:80, hip:90, size:S'. 

This is not useful to differenciate brands. 

Thus, we need to remove them. Apart from this, we used the English stopwords from gensim. 

In [71]:
custom_sw = ['size', 'fit','height', 'wide',
             'waist','bust' 'hip','measurement',
             'model','wear','small','medium','large',
            'sizing','high','small','long','cm','easy']
stopwords_list = stopwords_mod + custom_sw


In [72]:
# define a function to remove stopwords
def remove_sw(text):
    text = str(text)
    # split sentence into words
    words = word_tokenize(text)
    
    new_words = []
    # remove stopwords
    for w in words:
        if w in stopwords_list:
            continue
        new_words.append(w)
    
    return ' '.join(new_words)

In [73]:
%%time

product_copy['description'] = pd.DataFrame(product_copy['description'].apply(remove_sw))

CPU times: user 57.3 s, sys: 1.32 s, total: 58.6 s
Wall time: 1min 7s


In [68]:
# remove other stopwords pattern
pattern=r'\b(mm|(x)?(x)?s|wear(s|ing)?|measurements?|x|new(est)?|detail(s|ed)?|)\b'
product_copy.description=product_copy.description.str.replace(pattern,'')


  product_copy.description=product_copy.description.str.replace(pattern,'')


In [69]:
## check the result
product_copy.description[0]

'signature khadi shirt available black white beach city promise goto warm weather item perfect blazer hand loomed woven stripe khadi cotton slightly sheer gets softer wash ships week april color white black length 27 width 265 fits grid khadi cotton'

In [74]:
product.loc[4,'description']

nan

In [76]:
product.loc[4,'description']='Blank'
product.loc[4,'description']

'Blank'

# Lemma & Stemming


## Stemming

We considered stemming the description field but we got better results from the lemmatizer. The stemming code is available below for reference.


```python
stemmer = PorterStemmer()
def stemming_sentence(sentence):
    sentence = str(sentence)
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    stemming_sentence = []
    for word, tag in nltk_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            stemming_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            stemming_sentence.append(stemmer.stem(word))
    return stemming_sentence

%%time
stemmed_descriptions  = []
for des in product_copy.description:
    stemmed_description = stemming_sentence(des)
    stemmed_descriptions.append(stemmed_description)
    
stemmed_descriptions_join = []
for doc in tqdm(stemmed_descriptions):
    new_doc = ' '.join(doc)
    stemmed_descriptions_join.append(new_doc)
    
product_copy['stemmed_description'] = stemmed_descriptions_join 
```

## Lemmatization with wordnet tag

In [70]:
# https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258
def lemmatize_sentence(sentence):
    sentence = str(sentence)
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return lemmatized_sentence

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

In [71]:
## don't use list comprehension, save memory 
lemmatized_descriptions = []
for des in product_copy.description:
    lemmatized_description =  lemmatize_sentence(des)
    lemmatized_descriptions.append(lemmatized_description)
    



In [72]:
# the lemmatized_description is a list of list with all the words separated 
# so we need to join them together as a complete sentence
lemmatized_description_join = []
for doc in tqdm(lemmatized_descriptions):
    new_doc = ' '.join(doc)
    lemmatized_description_join.append(new_doc)

100%|██████████| 61355/61355 [00:00<00:00, 190429.39it/s]


In [73]:
product_copy['lemmatized_description'] = lemmatized_description_join

In [74]:
product_copy.head(2)

Unnamed: 0,product_id,brand,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,...,fabric_list,fabrics,n_fabric,final_fabric,Made_In,Occasion,Deadstock,Handcrafted,True_To_Size,lemmatized_description
0,01EX0PN4J9WRNZH5F93YEX6QAF,Two,Unknown,Khadi Stripe Shirt-our signature shirt,,2021-01-27 01:17:19.305 UTC,https://two-nyc.myshopify.com/products/white-k...,signature khadi shirt available black white be...,Our signature khadi shirt\n\navailable in blac...,Khadi Stripe Shirt-our signature shirt,...,"[cotton, cotton]",{cotton},1,cotton,,Beach,0,0,0,signature khadi shirt available black white be...
1,01F0C4SKZV6YXS3265JMC39NXW,Collina Strada,Unknown,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,,2021-03-09 18:43:10.457 UTC,https://collina-strada-2.myshopify.com/product...,midlength dress ruffles adjustable straps bias...,Mid-length dress with ruffles and adjustable s...,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,...,[],{},0,,New York,,0,0,0,midlength dress ruffle adjustable strap bias c...


In [75]:
#Filling in the fields in the final dataframe
product_copy['Made_In'] = location_list_clean
product_copy['Occasion'] = occasion_list_cleaned
product_copy['Deadstock'] = deadstock_list
product_copy['Handcrafted'] = handcrafted_list
product_copy['True_To_Size'] = truetosize_list

# Brand Encoding - the top brands

Identifying the top 30 brands for classification.

In [76]:
brand_analysis=pd.DataFrame(product_copy['brand'].astype(str).value_counts())
Brand_to_classify=list(brand_analysis.head(30).index)

In [77]:
Brand_to_classify

['7 For All Mankind',
 'Rails',
 'Intentionally Blank',
 'A.L.C.',
 'Rachel Comey',
 'Misa',
 'Studio 189',
 'ASTR the Label',
 'lemlem',
 'Simon Miller',
 'Cynthia Rowley',
 'Outerknown',
 'Chufy',
 'Faherty',
 'M.M.LaFleur',
 'Janessa Leone',
 'Araks',
 'Sea',
 'BROCHU WALKER',
 'Tanya Taylor',
 'Clare V.',
 'Nili Lotan',
 'Les Girls Les Boys',
 'Prism',
 'Sandy Liang',
 '6397',
 'Ancient Greek Sandals',
 'Alo Yoga',
 'Collina Strada',
 'Whit']

In [78]:
def brand_exclusion(row):
    if str(row) in Brand_to_classify:
        return str(row)
    else:
        return 'Other'
product_copy['label']=product_copy['brand'].apply(brand_exclusion)
product_copy['label'].astype(str).value_counts()

Other                    15589
7 For All Mankind         9011
Rails                     2864
Intentionally Blank       2534
A.L.C.                    2092
Rachel Comey              2081
Misa                      2030
Studio 189                1956
ASTR the Label            1942
lemlem                    1821
Simon Miller              1451
Cynthia Rowley            1347
Outerknown                1338
Chufy                     1209
Faherty                   1204
M.M.LaFleur               1192
Janessa Leone             1119
Araks                     1081
Sea                       1053
BROCHU WALKER             1001
Tanya Taylor               991
Clare V.                   922
Nili Lotan                 717
Les Girls Les Boys         695
Prism                      667
Sandy Liang                663
6397                       652
Ancient Greek Sandals      618
Alo Yoga                   525
Collina Strada             501
Whit                       489
Name: label, dtype: int64

In [79]:
# save data in pickle files
import pickle
# final_cleaned_data.pkl is the cleaned raw data with manully created features and brand lables
with open('final_cleaned_data.pkl', 'wb') as f:
    pickle.dump(product_copy, f)

In [80]:
# Creating a csv file of the cleaned data
product_copy.to_csv('final_cleaned_data.csv')

This cleaned file will be accessed in the 2_Prepare_Training_data notebook.

In [6]:
product_copy[['is_dry_clean', 'is_hand_wash', 'is_machine_wash_cold',
           'is_machine_wash', 'is_tumble_dry', 'is_not_bleach','Made_In','final_fabric','Handcrafted','True_To_Size']].astype(str).describe()

NameError: name 'product_copy' is not defined