# 02 - Raverly API - Data Cleaning
___

In [2]:
import pandas as pd

## Contents
---
* [Read In Data - Drop and Modify](#Read-In-Data---Drop-and-Modify)
* [Null Cleaning](#Null-Cleaning)

## Read In Data - Drop and Modify
___

Reading in indivudual garment datasets and combine into one dataframe.  View the data and check for nulls.

In [3]:
#Read in all of the different garment datasets - 2000 observations per csv.

hats = pd.read_csv('../data/beanie-toque_details.csv')
socks = pd.read_csv('../data/mid-calf_details.csv')
pullovers = pd.read_csv('../data/pullover_details.csv')

In [4]:
# I want to add the type of garment as a column in each dataframe.
hats['type'] = 'hat'
socks['type'] = 'socks'
pullovers['type'] = 'pullover'

In [5]:
# Some of the yarn weights in the garment dataframes have an usual value.  Will change to worsted since it's the most similar type.
hats['yarn_weight'] = hats['yarn_weight'].apply(lambda value: 'Worsted' if value == 'Aran / Worsted' else value)
socks['yarn_weight'] = socks['yarn_weight'].apply(lambda value: 'Worsted' if value == 'Aran / Worsted' else value)
pullovers['yarn_weight'] = pullovers['yarn_weight'].apply(lambda value: 'Worsted' if value == 'Aran / Worsted' else value)
pullovers['yarn_weight'] = pullovers['yarn_weight'].apply(lambda value: 'DK' if value == 'DK / Sport' else value)

In [6]:
# Drop the 'cobweb' and 'thread' yarn types from the pullovers dataframe.
# Only one observation of each and not seen in other garment dataframes.
rows_to_drop = pullovers[pullovers['yarn_weight'].isin(['Thread', 'Cobweb'])].index
pullovers = pullovers.drop(rows_to_drop)

In [7]:
#Combine dataframes
rav_df = pd.concat([hats, socks, pullovers], ignore_index = True)

Dropping the sizes available column.  There is no standardization of how this is populated.  It can be whatever the artist enters.  Making the assumption that most patterns are knit for most sizes or that size is specified in the name column ex. Baby Hat.  I don't believe it would help in creating a better model.

The downside to this is making an assumption that anyone using this recommender is an average size range.  Ravelry could benefit by standardizing size ranges by using either exclusively measurements or categetories.  However, this is difficult because different garments size and fit differently.

Also dropping the gauge_pattern column.  This is a value that is also set by the designer and has no standardization.  Most patterns assume gauge is measure using stockinette stitch.  After filling nulls with 'stockinette' and viewing the data, 76% of all patterns have 'stockinette' as their gauge_pattern.  I don't believe this would be helpful in pattern recommendation.

In [8]:
rav_df.drop(columns = ['sizes_available', 'gauge_pattern'], inplace = True)

## Null Cleaning
___

|Function|Argument|Purpose|
|---|---|---|
|**null_pipeline**|*df* - dataframe to clean|Combines all helper functions and removes all null values from dataframe|
|**gauge_calculator**|*row* -dataframe row|Inputs number of stitches per four inches depending on yarn weight|
|**yardage_calculator**|*row* - dataframe row|Inputs the average max yardage used depending on garment type|

#### Function Details
___

**gauge_calculator** -- Gauge is typically described as number of knitting stitches per inch.  For example, a pattern with 20 in the gauge column and 4 in the gauge divisor means '20stiches per 4 inches'.  This equates to 5 stitches per inch which is indicative of a worsted weight yarn.  The gauge_calculator function is designed to fill gauge column nulls based on observation yarn weight.  Yarn weight infomation referenced from the [Craft Yarn Council](https://www.craftyarncouncil.com/standards/yarn-weight-system).

**yardage_calculator** -- The yardage and gauge functions originally were 'if' and 'elif' statements for each conditional, but they would create more nulls than fill them.  I could not figure out why and eventually copied my code into ChatGPT. It recommended doing a dictionary, which looks cleaner but wasn't in the spirit of my original code.  However, I realized that I would need different max_yardage values for each garment.  A sweater knit in worsted weight yarn takes more yarn to make than a hat.  Instead of making three different fuctions, I thought this would be a good case for a dictionary of dictionaries.  I modified ChatGPT's code into a nested dictionary so that I just require one function to fill max_values for all of my garment types.  That also meant that I could concatenate all garment type datasets before cleaning.

**null_pipeline** -- This function combines all other functions, as well as a '.fillna' line of code that removes all null values in the dataset.  

The 'notes' column is optional alphanumeric information about the pattern that the designer can include with the pattern.  They also have the option of adding nothing.  Will replace nulls in notes column with 'notes not provided'.  The 'price' column nulls mean that the pattern is avaiable to download for free.  Will replace nulls in price column with 0 as to maintain float datatype. I will also round the prices to the nearest whole dollar amount.  The 'gauge_divisor column will default to 4.0 as most patterns are written as x stitches per 4 inches when referring to gauge.  The filled in gauge values also assume four inches.

In [9]:
def gauge_calculator(row):
    

    ''' 
    Meant to be applied to a garment dataframe.  Argument is 'row' and is meant to fill nulls in
    the gauge column.  Will look for any null values in the gauge columnand replace it with the
    typical 4in gauge number associated with its yarn weight.
    
    '''
    
    if pd.isnull(row['gauge']):
        yarn_weight_gauge = {
            'Worsted': 20,
            'DK': 22,
            'Bulky': 16,
            'Aran': 20,
            'Super Bulky': 12,
            'Fingering': 32,
            'Sport': 24,
            'Any gauge': 20,  # Assuming the most popular yarn type - Worsted
            'Unavailable': 20,  # Assuming worsted yarn for the reason stated above
            'Light Fingering': 32,
            'Jumbo': 5,
            'Lace': 40
        }
        
        return yarn_weight_gauge.get(row['yarn_weight'])
    else:
        return row['gauge']

In [10]:
def yardage_calculator(row):

    ''' 
    Meant to be applied to a garment dataframe.  Argument is 'row'.
    This function is meant to fill nulls forthe max_yardage column.  Will look for any null values in the max_yardage column
    and replace it with the average max_yardage associated with its yarn weight. Average max_yardage is based on average of all available
    max_yardage values in the dataset.
    '''
    yarn_weight_yardage = {
        'Worsted': {'hat':213.0, 'pullover':1781.0, 'socks':326.0},
        'DK': {'hat':253.0, 'pullover':1840.0, 'socks':386.0},
        'Bulky': {'hat':142.0, 'pullover':1408.0, 'socks':329.0},
        'Aran': {'hat':188.0, 'pullover':1552.0, 'socks':399.0},
        'Super Bulky': {'hat':99.0, 'pullover':1077.0, 'socks':187.0},
        'Fingering': {'hat':301.0, 'pullover':2137, 'socks':429.0},
        'Sport': {'hat':274.0, 'pullover': 2129.0, 'socks':430.0},
        'Any gauge': {'hat':259.0, 'pullover':2869.0, 'socks':450.0},  
        'Unavailable': {'hat':225.0, 'pullover':2309, 'socks':331.0}, 
        'Light Fingering': {'hat':345.0, 'pullover':2118.0, 'socks':452.0},
        'Jumbo': {'hat':124.0, 'pullover':1752.0, 'socks':264.0},
        'Lace': {'hat':250.0, 'pullover':1844.0, 'socks':610.0}
        }
    if pd.isnull(row['max_yardage']):

        return yarn_weight_yardage.get(row['yarn_weight'], {}).get(row['type'])
    else:
        return row['max_yardage']


In [11]:
def null_pipeline(df):

    # Custom functions
    df['max_yardage'] = df.apply(yardage_calculator, axis = 1)
    df['gauge'] = df.apply(gauge_calculator, axis = 1)
    
    # One-off .apply functions using values dictionary as  the .fillna reference
    # Gauge is usually described as x stitches per four inches.  This fills null gauge_divisor values
    # with 4.0.  This matches the gauge_calculator function, which fills in number of stitches per four 
    # inches depending on what garment it is.
    
    values = ({'gauge_divisor':4.0, 'notes':'notes not provided','price':0})
    df.fillna(values, inplace = True)
    
    return df

In [12]:
rav_clean_df = null_pipeline(rav_df)

One more thing - normalize the gauge.  Some patterns have gauge for various gauge divisors.  I will divide gauge by gauge divisor to and add these values to a 'gauge_per_inch' column.

In [13]:
# One more thing - normalize the gauge
rav_clean_df['gauge_per_inch'] = rav_clean_df['gauge'] / rav_clean_df['gauge_divisor']

In [14]:
rav_clean_df.head(10)

Unnamed: 0,id,name,author,difficulty_avg,gauge,gauge_divisor,max_yardage,notes,price,projects_count,queued_projects_count,rating_avg,yarn_weight,permalink,type,gauge_per_inch
0,990044,Musselburgh,Ysolda Teague,2.46,6.0,1.0,610.0,>Our favourite swatchless hat pattern\r\n> jus...,6.0,23932,7716,4.89,Fingering,musselburgh,hat,6.0
1,899479,Classic Ribbed Hat,Purl Soho,1.92,32.0,4.0,305.0,MATERIALS\r\nPurl Soho’s [Cashmere Merino Bloo...,0.0,10383,5374,4.83,DK,classic-ribbed-hat-5,hat,8.0
2,1353734,Alpine Bloom Hat,Caitlin Hunter,3.37,24.0,4.0,230.0,The Alpine Bloom hat is designed as a companio...,5.0,1553,2400,4.84,Sport,alpine-bloom-hat,hat,6.0
3,528611,Classic Cuffed Hat,Purl Soho,1.87,20.0,4.0,328.0,"MATERIALS\r\n\r\n- Hat with Pom Pom: 1 (2, 2) ...",0.0,9106,5037,4.7,Worsted,classic-cuffed-hat,hat,5.0
4,7340640,The Traveler Hat,Andrea Mowry,2.18,20.0,4.0,350.0,[Do you enjoy Andrea’s patterns? Sign up for t...,7.0,838,312,4.83,Worsted,the-traveler-hat,hat,5.0
5,905506,February Hat,Kate Gagnon Osborn,2.62,18.0,4.0,213.0,When thinking about what I wanted to do for my...,0.0,3891,3528,4.75,Worsted,february-hat-3,hat,4.5
6,1213476,Manhattan Hat,Tori Yu,2.35,24.0,4.0,285.0,*“an effortless ribbed hat with a touch of big...,7.0,929,640,4.8,Worsted,manhattan-hat-2,hat,6.0
7,970741,October Hat,Sloane Rosenthal,3.24,29.0,4.0,188.0,What’s better than a free hat pattern? 12 free...,0.0,2808,2822,4.83,Aran,october-hat-3,hat,7.25
8,1285536,My Baker's Hat,Emily Russell,1.79,20.0,4.0,150.0,This hat was created for my boyfriend to keep ...,0.0,813,547,4.88,Worsted,my-bakers-hat,hat,5.0
9,44131,Berry Baby Hat,Michele Sabatier,2.01,20.0,4.0,100.0,"This is a very popular hat as a newborn gift, ...",0.0,10766,7191,4.62,Worsted,berry-baby-hat,hat,5.0


In [64]:
rav_clean_df.to_csv('../data/rav_clean.csv', index = False)

In [15]:
rav_clean_df

Unnamed: 0,id,name,author,difficulty_avg,gauge,gauge_divisor,max_yardage,notes,price,projects_count,queued_projects_count,rating_avg,yarn_weight,permalink,type,gauge_per_inch
0,990044,Musselburgh,Ysolda Teague,2.46,6.0,1.0,610.0,>Our favourite swatchless hat pattern\r\n> jus...,6.00,23932,7716,4.89,Fingering,musselburgh,hat,6.0
1,899479,Classic Ribbed Hat,Purl Soho,1.92,32.0,4.0,305.0,MATERIALS\r\nPurl Soho’s [Cashmere Merino Bloo...,0.00,10383,5374,4.83,DK,classic-ribbed-hat-5,hat,8.0
2,1353734,Alpine Bloom Hat,Caitlin Hunter,3.37,24.0,4.0,230.0,The Alpine Bloom hat is designed as a companio...,5.00,1553,2400,4.84,Sport,alpine-bloom-hat,hat,6.0
3,528611,Classic Cuffed Hat,Purl Soho,1.87,20.0,4.0,328.0,"MATERIALS\r\n\r\n- Hat with Pom Pom: 1 (2, 2) ...",0.00,9106,5037,4.70,Worsted,classic-cuffed-hat,hat,5.0
4,7340640,The Traveler Hat,Andrea Mowry,2.18,20.0,4.0,350.0,[Do you enjoy Andrea’s patterns? Sign up for t...,7.00,838,312,4.83,Worsted,the-traveler-hat,hat,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14993,1159217,Furbelow Pullover,Debi Maige,0.00,26.0,4.0,2622.0,Get the full issue for [$8.99][1].\r\n\r\nThe ...,7.99,4,32,0.00,Fingering,furbelow-pullover,pullover,6.5
14994,1163057,Kiely Swoncho,Tamy Gore,5.00,22.0,4.0,2340.0,\r\nA fun architectural design knit in strande...,8.50,11,32,5.00,DK,kiely-swoncho,pullover,5.5
14995,1164074,Suttons Bay Sweater,Plucky Knitter Design,2.67,18.0,4.0,1732.0,The Suttons Bay Sweater may just become your f...,8.00,11,47,4.83,Worsted,suttons-bay-sweater,pullover,4.5
14996,1168304,Hedera,Stephanie Lotven,2.83,22.0,4.0,3210.0,"> **Buy 3, Get 1 FREE! Place 4 of my patterns ...",8.00,11,73,5.00,DK,hedera-4,pullover,5.5
