## Data for Colaborative Model Utility Matrix: (User-Item)

Using ravelry.com's api.  Limit of 100k per call. There are over 10 million users and over 18 million projects (items).  Need to reduce the size. After some exploration, have decided on getting only projects for patterns that have over threshold of projects (600), and use those to get the users.  Will filter down users after I have the data in order to ensure the utlity matrix is not too sparse. 

https://www.ravelry.com/api#index

In [1]:
import pandas as pd
import numpy as np

import json
import requests
from requests.auth import HTTPBasicAuth
from pprint import pprint

from config import basic_auth_username, basic_auth_password
from config import basic_auth_username_read_only, basic_auth_password_read_only

#### Get a the list of patterns from which to pull projects

In [4]:
# load data
df = pd.read_csv('data/consolidated_patterns.csv', low_memory=False)

After experimenting with the project count, I decided to start with 600 as the cut off as some of these patterns have over 30k projects completed, and can only pull in 500 at a time, so start high and can work to include more patterns after I get a minimum viable product. 

In [5]:
df[df['projects_count'] >= 601].shape

(3197, 24)

In [3]:
def clean_out_bad_data(df):
    """ Remove any patterns that will result in a more sparse matrix """
    # extra indicies from .csv appending and accidental douplicates
    df = df.drop(df[df['pattern_id']=='pattern_id'].index)
    df = df.drop_duplicates()
    
    # not enough projects (43451 rows)
    df = df.drop(df[df['projects_count'] <= 600].index)
    
    # drop rows with too many nulls (15795 rows)
    df = df.drop(df[df.isnull().sum(axis=1) >3].index)
    
    # drop if no category
    df = df.drop(df[df['categories'].isna()].index)

    return df

df = clean_out_bad_data(df)
df = df.reset_index(drop=True)

In [4]:
def get_search_results(pattern_id, page):
    """ Get json response for page of projects for a particular pattern"
    try:
        response = requests.get(f"https://api.ravelry.com/patterns/{pattern_id}/projects.json?photoless=0&page_size=500&page={page}",auth=HTTPBasicAuth(basic_auth_username, basic_auth_password))
        search_result = response.json()
    except:
        print(f'page number {page} failed')
#         print(response)
    return search_result


In [5]:
def get_all_projects_for_pattern(pattern_id, saving_lower,saving_upper):
    """Gets all pages of projects for a particular pattern and saves to .csv file. """
    
    #initialize page
    page = 1
    
    # get number of pages to pull
    try:
        search_result = get_search_results(pattern_id,page)
        last_page = search_result['paginator']['last_page']
        print(f'starting new pattern - number of pages {last_page}')
    except:
        print("failed to get first page or page numbers")
    
    while page < last_page+1:
        try:
            search_result = get_search_results(pattern_id, page)
            
            # for each page
            user_id = []
            pattern_ids =[]
            date_completed = []
            project_id =[]

            for i in range(len(search_result['projects'])):
                try:
                    user_id.append(search_result['projects'][i]['user_id'])
                    pattern_ids.append(search_result['projects'][i]['pattern_id'])
                    date_completed.append(search_result['projects'][i]['completed'])
                    project_id.append(search_result['projects'][i]['id'])
                    
                except Exception as e:
                    print(e)
                    
            # assemble dictionary          
            data = {'user_id':user_id,
                    'pattern_ids':pattern_ids,
                    'date_completed':date_completed,
                    'project_id': project_id,  
                   }
            
            #convert and save
            df = pd.DataFrame(data)   

            df.to_csv(f'data/users_projects_{saving_lower}-{saving_upper}.csv', mode ="a", index=False)
            print(f"yay pattern  {pattern_id}, page {page} saved!")
            
            page += 1


        except Exception as e:
                print(e, 'Stopped on page {} -retrying!'.format(page))

I was worried about the file sizes so split them apart by 200 patterns (remember each patter has anywhere from 600 - 30000 projects each.)  The **saving upper**, and **saving lower** helped automate the separation and saving.  This ran for more than 16 hours at a time.  (There were some missing page errors that had to be rerun around after.)

In [6]:
# saving_lower = [100,300,500,700,900,1100,1300,1500,1700,1900,2100,2300,2500,2700]
# saving_upper = [300,500,700,900,1100,1300,1500,1700,1900,2100,2300,2500,2700,2900]
# saving_lower = [1900,2100,2300,2500,2700] # will need to go back and do 17-1900 and 2900+
# saving_upper = [2100,2300,2500,2700,2900]
# saving_lower = [1700,2900] 
# saving_upper = [1900,2962]
saving_lower = [1887,2900] 
saving_upper = [1900,2962]

The cell below was run to collect the projects. 

In [6]:
pattern_id_df = df[['pattern_id']]

# Get projects for a particular pattern
for j in range (0,len(saving_lower)):
    print(f'starting new file #{j}')
    for i in range(saving_lower[j],saving_upper[j]):#range(len(pattern_id_df)):
        pattern_id = pattern_id_df['pattern_id'][i]
        print(f'file {j}, Step {i} Getting patterns for {pattern_id}')
        get_all_projects_for_pattern(pattern_id, saving_lower[j],saving_upper[j])