# Where We Left Off
In the previuos post, we dealt with missing values and finally asking some last minute questions. In this post, we're going to follow in the footsteps of **Robert Ritz's** [post](https://www.datafantic.com/create-an-auto-updating-dataset-on-kaggle-with-deepnote/) about getting a dataset updated automatically to avoid cluttering up Kaggle with old datasets which are not very useful being out of date.

Note that this post is about getting the data uploaded and not about doing any sort of Analysis.

## A Bit of Cleaning

In [1]:
#| include: false
# Imports 
import pandas as pd             # for the data.
import numpy as np              # for a NaN type
import matplotlib.pyplot as plt # For plotting, and some customization of plots.
import seaborn as sns           # For pretty plots.
import requests as r            # For downloading from websites

# Fix the size of the graphs
sns.set(rc={"figure.figsize":(11, 8)})

In [2]:
#| include: false
urlJobs = "https://thecyclefrontier.wiki/wiki/Jobs"
urlLoot = "https://thecyclefrontier.wiki/wiki/Loot"
urlDataDrives = 'https://thecyclefrontier.wiki/wiki/Utilities#Data_Drives-0'
urlGun = "https://thecyclefrontier.wiki/wiki/Weapons"
urlGear = 'https://thecyclefrontier.wiki/wiki/Gear'
urlAmmo = "https://thecyclefrontier.wiki/wiki/Ammo"
urlMiner = "https://thecyclefrontier.wiki/wiki/Heavy_Mining_Tool"

In [3]:
#| include: false
# Functions:

def buildJobsRewards(data):
    # Function to take job rewards data and return a cleaned version

    rewardsSubset = data[["Name", "Description", "Difficulty"]].copy()
    rewardsSubset.columns = ["Units", "Rewards", "Job"]

    index = range( 0, len(rewardsSubset) - 4, 4)
    offset = np.array([1, 2, 3])

    rewardsSubset.Job = np.NaN

    for i in index:
        aJob = rewardsSubset.iloc[i, 0]
        indexes = i + offset
        rewardsSubset.iloc[ indexes, 2 ] = aJob
        
    cutNA = rewardsSubset.Job.isna()
    rewardsSubset = rewardsSubset[ ~cutNA ]

    rewardsSubset = rewardsSubset.assign(
        Units = rewardsSubset['Units'].astype(int)
    )

    return rewardsSubset

def breakLoot(taskString, index=0):
    parts = taskString.split(' ', maxsplit=1)
    if index == 0:
        return int(parts[index])
    elif index == 1:
        return parts[index]
    else:
        # This shouldn't be called.
        return None

def extractSite(siteData, columns, adjust, step,  offset):
    if not isinstance(columns, list):
        print("Columns argument must be a list.")
        return None
    siteSubset = siteData[columns].copy()

    siteSubset = siteSubset.assign(
        Loot = np.NaN
    )

    # Some extra error handling 
    if not isinstance(adjust, int):
        print("adjust argument must be an int.")
        return None
    if not isinstance(step, int):
        print("step argument must be an int.")
        return None
    if not isinstance(offset, list):
        print("offset argument must be a list.")
        return None
    
    index = range( 0, len(siteSubset) - adjust, step)
    offset = np.array(offset)

    for i in index:
        aLoot = siteSubset.iloc[i, 1]
        indexes = i + offset
        siteSubset.iloc[indexes, len(siteSubset.columns)-1] = aLoot

    tmp = siteSubset.iloc[:, 1:len(siteSubset.columns)]
    tmp = tmp.fillna(method="ffill")
    siteSubset.iloc[:, 1:len(siteSubset.columns)-1] = tmp

    cutNA = siteSubset.Loot.isna()
    returnData = siteSubset[ ~cutNA ]
    returnData = returnData.rename(columns={'Image':'Unit', 'Name':'Reward'})

    return returnData

In [4]:
#| include: false
# Jobs Data:
siteJobs = pd.read_html(urlJobs, match="Name",
    converters = {
        "Name": str,
        "Description": str, 
        "Unlocked": int, 
        "Tasks": str,
        "Rewards": str})

korolevRewards = buildJobsRewards( siteJobs[0] )
icaRewards = buildJobsRewards( siteJobs[1] )
osirisRewards = buildJobsRewards( siteJobs[2] )

# Add Jobs Together:
allJobRewards = pd.concat([korolevRewards, icaRewards, osirisRewards])

In [5]:
#| include: false
# Loot Data:
siteLoot = pd.read_html(urlLoot, attrs={"class":"zebra"})[0]

lootSubset = siteLoot[[
    'Image', 'Name', 'Rarity',
    'Personal Quarters', 'Campaigns',
    'Jobs', 'Printing']].copy()

filterIndex = lootSubset.Printing == "Yes"
lootSubset.loc[~filterIndex, "Printing"] = "No"


# Change range to 5 instead of 4
index = range( 0, len(lootSubset) - 4, 5)
offset = np.array([1, 2, 3, 4])

lootSubset = lootSubset.assign(
    Loot = np.NaN
)

for i in index:
    # Correct Loot column
    aLoot = lootSubset.iloc[i, 1]
    indexes = i + offset
    lootSubset.iloc[ indexes, 7 ] = aLoot

tmp = lootSubset.iloc[:, 1:7]
tmp = tmp.fillna(method="ffill")
lootSubset.iloc[:, 1:7] = tmp

cutNA = lootSubset.Loot.isna()
lootData = lootSubset[ ~cutNA ]
lootData = lootData.rename(columns={'Image':'Unit'})
lootData['Rarity'] = pd.Categorical(
    lootData.Rarity, categories = ['Common', 'Uncommon', 'Rare', 'Epic', 'Exotic', 'Legendary']
)
loot = lootData

In [6]:
#| include: false
tasks = []

for index in range(0,3):
    tasksSubset = siteJobs[index][["Name", "Description", "Tasks"]].copy()
    tasksSubset = tasksSubset[ ~tasksSubset.Tasks.isna()]
    tasksSubset = tasksSubset[ ~tasksSubset.Tasks.str.contains("Kill")]

    regex = r"(\d+\s[\w]+\s[\w]+)"
    tmp = tasksSubset.Tasks.str.extractall(regex)

    count = tmp.reset_index()[0].apply(breakLoot).values
    aLoot = tmp.reset_index()[0].apply(breakLoot, index=1).values

    tmp = tmp.assign(
        count = count,
        loot = aLoot
    )

    nameDescriptSlice = tasksSubset.loc[tmp.reset_index()["level_0"], ['Name', 'Description']]

    tmp = tmp.assign(
        name = nameDescriptSlice.Name.values,
        description = nameDescriptSlice.Description.values
    )

    taskSlice = tmp.reset_index().drop([
        'level_0',
        'match',
        0
    ], axis =1 )

    taskSlice = taskSlice[['name', 'count', 'loot', 'description']]
    tasks.append(taskSlice)
tasks = pd.concat([*tasks])

In [7]:
#| include: false
# Most corrections:
tasks.loc[ tasks.loot == "Master Unit", 'loot'] = 'Master Unit CPU'
tasks.loc[ tasks.loot == "Zero Systems", 'loot'] = 'Zero Systems CPU'
tasks.loc[ tasks.loot == "Pure Focus", 'loot'] = 'Pure Focus Crystal'
tasks.loc[ tasks.loot == "Heavy Mining", 'loot'] = 'Heavy Mining Tool'
tasks.loc[ tasks.loot == "Magnetic Field", 'loot'] = 'Magnetic Field Stabilizer'
tasks.loc[ tasks.loot == "Brittle Titan", 'loot'] = 'Brittle Titan Ore'
tasks.loc[ tasks.loot == "NiC Oil", 'loot'] = 'NiC Oil Cannister'
tasks.loc[ tasks.loot == "Charged Spinal", 'loot'] = 'Charged Spinal Base'
tasks.loc[ tasks.loot == "Hardened Bone", 'loot'] = 'Hardened Bone Plates'
tasks.loc[ tasks.loot == "Pale Ivy", 'loot'] = 'Pale Ivy Blossom'
tasks.loc[ tasks.loot == "Glowy Brightcap", 'loot'] = 'Glowy Brightcap Mushroom'
tasks.loc[ tasks.loot == "Blue Runner", 'loot'] = 'Blue Runner Egg'
tasks.loc[ tasks.loot == "Magic", 'loot'] = 'Magic-GROW Fertilizer'
tasks.loc[ tasks.loot == "Letium", 'loot'] = 'Letium Clot'
tasks.loc[ tasks.loot == "Azure Tree", 'loot'] = 'Azure Tree Bark'

# Fix the Gun Naming:
tasks.loc[tasks.loot.str.contains('Advocate at'), 'loot'] = 'Advocate'

tasks = tasks.copy()

In [8]:
#| include: false
siteDrive = pd.read_html(urlDataDrives, attrs={"class":"zebra"})[2]


drives = extractSite(siteDrive, ['Image', 'Name', 'Rarity', 'Weight'], 2, 3, [1, 2])
drives['Rarity'] = pd.Categorical(
        drives.Rarity, categories = ['Common', 'Uncommon', 'Rare', 'Epic', 'Exotic', 'Legendary']
    )
drives['Loot'] = drives['Rarity'].astype('str') + ' Data'

In [9]:
#| include: false
siteGun = pd.read_html(urlGun, attrs={"class":"zebra"})[0]

gunData = siteGun[~siteGun.Type.isna()]
indx = gunData['Proj. Speed'] == 'Hitscan'
gunData.loc[indx, 'Proj. Speed'] = np.NaN

gunData = gunData.assign(
    Unit = gunData['Sell Value'].str.replace(' K-Marks', '').astype('float'),
    Reward = "K-Marks",
    Loot = gunData['Name']
)

# # This removes the legendary weapons
# data = data.query('Faction != "Printing"')

guns = gunData.drop('Image', axis=1)

In [10]:
#| include: false
siteGear = pd.read_html(urlGear)

siteBackPacks = siteGear[0]
siteHelmet = siteGear[10]
siteShield = siteGear[22]

In [11]:
#| include: false
backpacks = extractSite(
    siteData = siteBackPacks.loc[siteBackPacks.Name.str.contains("Backpack|K-Marks")],
    columns = ['Image', 'Name', 'Rarity', 'Space', 'Sale Price'],
    adjust = 2, 
    step = 3,
    offset = [1, 2])

In [12]:
#| include: false
ammo = pd.read_html(urlAmmo)[0]
ammo = ammo.rename(
    {"Item Name":"Loot", "Sell Value":"Unit"}, axis=1
    ).assign(
        Reward = "K-Marks",
        Rarity = pd.Categorical(
            ammo.Rarity, categories = ['Common', 'Uncommon', 'Rare', 'Epic', 'Exotic', 'Legendary'])
    )[['Unit', 'Reward', 'Rarity', 'Loot']]

In [13]:
#| include: false
siteMiner = pd.read_html(urlMiner)
minerData = siteMiner[0]

row = minerData[1].T.to_list() + ['Heavy Mining Tool']
columns = minerData[0].to_list() + ['Loot']

mineTool = pd.DataFrame(columns = columns)
mineTool.loc[0] = row
mineTool = mineTool.assign(
    Reward = "K-Marks",
    Rarity = pd.Categorical(
        mineTool.Rarity, categories = ['Common', 'Uncommon', 'Rare', 'Epic', 'Exotic', 'Legendary']
    ),
    Unit = mineTool['Sell Value'].astype(int),
)[['Unit', 'Reward', 'Rarity', 'Loot']]

# Setup Integration
Following along with the article, we're going to create a Kaggle API Key. We'll go to the account page and make a new API key for this:
![](images/2022-12-06/create-an-api-key.png)

And, once we have this we add it as an Environmental Variable per the post:
![](images/2022-12-06/add-the-api-key.png)

Since we're making our own data set, we'll need to initialize the dataset on kaggle first:
```
!kaggle datasets init -p cycle-frontier-data
```
This will not work, as it did not for me, since it expects you to make the folder first. So, we'll do that and then init:
```
!mkdir cycle-frontier-data
!kaggle datasets init -p cycle-frontier-data
```

Once done, you'll want to update the dataset-metadata.json file which can be found on the side:
![](images/2022-12-06/fill-out-the-dataset-metadata.png)

# Import and Merge Data

So, now we need to save the data to this folder since Kaggle expects it there.

In [None]:
from pathlib import Path
dataPath = Path("cycle-frontier-data")

allJobRewards.to_csv(dataPath/"allJobsRewards", index=False)
loot.to_csv(dataPath/'loot', index=False)
tasks.to_csv(dataPath/'tasks', index=False)
drives.to_csv(dataPath/'drives', index=False)
guns.to_csv(dataPath/'guns', index=False)
backpacks.to_csv(dataPath/'backpacks', index=False)
ammo.to_csv(dataPath/'ammo', index=False)
mineTool.to_csv(dataPath/'minetool', index=False)

# Add Dataset to Kaggle; Schedule It To Run
Once done, you run the create to upload the data:
```
!kaggle datasets create -p cycle-frontier-data
```
... and the dataset will be private until you fill it out the dataset and then set it to public. Make sure that at the bottom of the Notebook you include the line to update the dataset - per the post:
```
!kaggle datasets version -p cycle-frontier-data -m "Automatic Update"
```

Last, you'll want to set the Notebook to run on a schedule; I set it to run as rarely as possible. If the data is out of data you could always simply login and force it to run.