# 01 - Ravelry API - Data Collection Script
___

References that helped me create this code can be found at:
* [How to access an API for first-time API users](https://medium.com/data-science-at-microsoft/how-to-access-an-api-for-first-time-api-users-879002f5f58d) by Riesling Walker
* [Ravelry API Documentation](https://www.ravelry.com/api)
* The helpful community in the [Ravelry API](https://www.ravelry.com/groups/ravelry-api) group

In [3]:
#Imports
import pandas as pd
import requests
from requests.auth import HTTPBasicAuth
import json


import os
from dotenv import load_dotenv
load_dotenv()

True

## Contents
---
* [API Script](#API-Script)
* [Photo Collection](#Photo-Collection)

## API Script 
___

In [4]:
username = os.getenv('ravname')
password = os.getenv('password')

|Function|Argument|Purpose|
|---|---|---|
|**unique_pattern_collection**|*str* - pc <br> *int* - total|Function calls the Ravelry API along with specific parameters such as pattern language and craft type. Collects a JSON file of requested information and transforms it into a dataframe. The dataframe contains the pattern ID, pattern name, and its 'medium photo' URL.|
|**detail_collector**|*dataframe* - dataframe|Fuction accepts a 'dataframe' argument that is intended to be the dataframe created with the unique_pattern_collection function. This function takes the pattern ID column from the dataframe and calls the Ravelry API. Using the pattern ID, it collects specific details about the pattern such as priice and author. The details, along with the pattern id are collected as columns in a dataframe and returned.
|**data_collection_pipeline**|*str* - garment <br> *int* - total|Combines the previously mentioned functions to create a dataframe with pattern details and save it to a csv file.

In [17]:
def unique_pattern_collection(pc, total):

    '''
    Function is designed to take in a string argument of 'pc' which is a Ravelry pattern category and integer 'total' which is how many patterns to be collected.  The Ravelry API is
    called with specified category along with other pre-determined parameters and generates a json file.  The parameters are set to look for knitting patterns in the English 
    language that are not discontinued and originating from the United States.  Function is set up to have 150 patterns per page and starts on page one and iterate through 
    pages as needed.  Function will check for duplicates and record 2000 patters in the 'posts' list.  From the posts list the function will pull the pattern ID, name, and 'medium photo' URL,
    create a dataframe with these properties, and return the dataframe.
    '''
    #Parameters used to filter what kind of patterns I want, keep track of posts collected, and unique IDs so duplicates are avoided.
    language = 'en'
    craft = 'knitting'
    pc = pc
    availability = '-discontinued'
    photo = 'yes'
    country = 'united-states'
    sort = 'popularity'
    page = 1
    page_size = 150
    posts = []
    unique_ids = set()
    total_unique_patterns = total

    # Call the API with relevant parameters, username, and password - It helped to look at how the URL
    #structure looks on a Ravelry pattern to piece this together.
    while len(unique_ids) < total_unique_patterns:
        url = 'https://api.ravelry.com/patterns/search.json?page_size={}&craft={}&pc={}&availability={}&language={}&photo={}&country={}&sort={}&page={}'.format(page_size, craft, pc, availability, language, photo,country,sort, page)      
        response = requests.get(url, auth=requests.auth.HTTPBasicAuth(username, password))

        # If the call is good, transform the response to a json and get everything in the patterns section where relevant data is located
        if response.status_code == 200:
            rav_data = response.json()
            patterns = rav_data.get('patterns')

            # Check for unique pattern IDs.  If it's not in the unique ID list yet, that means it hasn't been collected and should be added to the pattern list.
            for pattern in patterns:
                if pattern['id'] not in unique_ids:
                    unique_ids.add(pattern['id'])
                    posts.append(pattern)
                    if len(unique_ids) == total_unique_patterns:
                        break
                        
        # Print status code if API breaks
        else:
            print(f'API request failed with status code {response.status_code}')
            break
            
        # Turn the page to continue the loop
        page += 1

    
    # Collect all relevant data from the API and put them into variables
    # Added 
    id = [post['id'] for post in posts]
    name = [post['name'] for post in posts]
    photos = [post['first_photo']['medium_url'] if post.get('first_photo') and post['first_photo'].get('medium_url') else 'No photo' for post in posts]
    
    # Make a dataframe from above variables
    df = pd.DataFrame({'id': id, 'name': name, 'photo':photos})
        
    # Print how many patterns were collected and return the dataframe
    print(f"Total unique pattern IDs collected: {len(unique_ids)}")

    return df


In [20]:
def detail_collector(dataframe):

    '''
    Fuction accepts a 'dataframe' argument that is intended to be the dataframe created with the unique_pattern_collection function.  This function takes that dataframe and breaks
    out all of the values in the ID column into a list.  The Ravelry pattern API is called.
    '''
    # Break out the garment dataframe to just get a list of IDs to iterate through API.  Also need to keep a list of collected raw data.
    pattern_ids = dataframe['id'].tolist()
    raw_details = []
    
    for id in pattern_ids:
        url = f'https://api.ravelry.com//patterns.json?ids={id}'
        response = requests.get(url, auth=requests.auth.HTTPBasicAuth(username, password))
    
        pattern_data = response.json()
        pat_details = pattern_data.get('patterns', {}).get(str(id)) # Make sure to get the string of the ID as that is how it's written in API call
        # Otherwise I get None for entries
    
        raw_details.append(pat_details)
    
    #Stuff to collect
    id = [raw_detail['id'] for raw_detail in raw_details]
    name = [raw_detail['name'] for raw_detail in raw_details]
    difficulty_avg = [round(raw_detail['difficulty_average'],2) for raw_detail in raw_details]
    gauge = [raw_detail['gauge'] for raw_detail in raw_details]
    gauge_divisor = [raw_detail['gauge_divisor'] for raw_detail in raw_details]
    gauge_pattern = [raw_detail['gauge_pattern'] for raw_detail in raw_details]
    max_yardage = [raw_detail['yardage_max'] for raw_detail in raw_details]
    price = [raw_detail['price'] for raw_detail in raw_details]
    rating_avg = [round(raw_detail['rating_average'],2) for raw_detail in raw_details]
    projects_count = [raw_detail['projects_count'] for raw_detail in raw_details]
    queued_projects_count = [raw_detail['queued_projects_count'] for raw_detail in raw_details]
    sizes_available = [raw_detail['sizes_available'] for raw_detail in raw_details]
    yarn_weight = [raw_detail['yarn_weight']['name'] if raw_detail.get('yarn_weight') and raw_detail['yarn_weight'].get('name') else 'Unavailable' for raw_detail in raw_details]
    author = [raw_detail['pattern_author']['name'] for raw_detail in raw_details]
    projects_count = [raw_detail['projects_count'] for raw_detail in raw_details]
    notes = [raw_detail['notes'] for raw_detail in raw_details]

    
    final_df = pd.DataFrame({'id': id,
                            'name': name,
                            'author': author,
                            'difficulty_avg': difficulty_avg,
                            'gauge': gauge,
                            'gauge_divisor': gauge_divisor,
                            'gauge_pattern': gauge_pattern,
                            'max_yardage': max_yardage,
                            'notes': notes,
                            'price': price,
                            'projects_count': projects_count,
                            'queued_projects_count': queued_projects_count,
                            'rating_avg': rating_avg,
                            'sizes_available': sizes_available,
                            'yarn_weight': yarn_weight,
                        })
    
    return final_df


In [7]:
def data_collection_pipeline(garment, total):
    
    '''
    garment (str) - type of garment to be searched on Ravelry API
    total (int) - Number of patterns to be collected
    Function will call the Ravelry API twice to collect the following details about each pattern:
    id
    name
    author
    difficulty_avg
    gauge
    gauge_divisor
    gauge_pattern
    max_yardage
    notes
    price
    projects_count
    queued_projects_count
    rating_avg
    sizes_available
    yarn_weight
    Function will return the dataframe as well as save the dataframe as {garment}_details.csv
    '''
    
    df = unique_pattern_collection(garment, total)
    final_df = detail_collector(df)
    final_df.to_csv(f'../data/{garment}_details.csv', index = False)
    return final_df
    

In [None]:
scarf_df = data_collection_pipeline('scarf', 10)

## Photo Collection

Use the unique_pattern_collection function to collect 100 photos of different types of garments

In [32]:
socks_photos_df = unique_pattern_collection('socks', 100)

Total unique pattern IDs collected: 100


In [33]:
socks_photos_df.head()

Unnamed: 0,id,name,photo
0,1039033,Vanilla Socks on Magic Loop,https://images4-f.ravelrycache.com/uploads/the...
1,1039035,"Vanilla Socks on 9"" Circulars",https://images4-f.ravelrycache.com/uploads/the...
2,130787,Hermione's Everyday Socks,https://images4-g.ravelrycache.com/flickr/3/7/...
3,1159708,DK Weight Vanilla Socks,https://images4-f.ravelrycache.com/uploads/the...
4,1091238,DRK Everyday Socks,https://images4-f.ravelrycache.com/uploads/dre...


In [34]:
socks_photos_df.to_csv('../data/socks_photos.csv', index = False)