# Exploring Kaggle Competitions

For this exercise, I will explore Kaggle current and past challenges for those that are possibly biomedically-related.  I will be using the following list of terms to search the archive for what the team deems as "biomedical":

* medicine
* cancer
* cell
* biology
* neuro
* drug
* health
* life


## Setup

This exercise will utilize the [Kaggle public APIs](https://github.com/Kaggle/kaggle-api).

To use the Kaggle API, an account will be required. Credentials can be passed by mounting an account's token into `.kaggle/kaggle.json`.

> To re-run the next couple of lines, you will first need to generate your token key by going to the Account tab of your Kaggle user profile, then clicking on Create API Token. A `kaggle.json` will download onto your machine. Upload that file with the next code block.

First, install the needed libraries, including `pandas`.

In [4]:
!pip install -q pandas kaggle ipywidgets

Note: you may need to restart the kernel to use updated packages.


In order to prevent sharing secrets, I will use `FileUpload` from the `ipywidgets` to upload my `kaggle.json` file to the notebook.

In [None]:
from ipywidgets import FileUpload

uploader = FileUpload()
display(uploader)

In [None]:
!mkdir -p ~/.kaggle

import codecs
uploaded_file = uploader.value[0]
with open("/home/vscode/.kaggle/kaggle.json", "w") as cred:
    cred.write(codecs.decode(uploaded_file.content, encoding="utf-8") + "\n")
    
!chmod 600 ~/.kaggle/kaggle.json

## Search with Kaggle APIs

In [3]:
!for term in medicine cancer cell biology neuro drug health life; do echo ${term^^}; kaggle competitions list -s $term; echo ""; done

MEDICINE
No competitions found

CANCER
ref                                                               deadline             category   reward  teamCount  userHasEntered  
----------------------------------------------------------------  -------------------  --------  -------  ---------  --------------  
https://www.kaggle.com/competitions/rsna-breast-cancer-detection  2023-02-27 23:59:00  Featured  $50,000       1601           False  

CELL
No competitions found

BIOLOGY
ref                                                                                deadline             category   reward  teamCount  userHasEntered  
---------------------------------------------------------------------------------  -------------------  --------  -------  ---------  --------------  
https://www.kaggle.com/competitions/amp-parkinsons-disease-progression-prediction  2023-05-18 23:59:00  Featured  $60,000          5           False  

NEURO
No competitions found

DRUG
No competitions found

HEALTH
ref 

Because the API only returns _active_ challenges, the results were pretty minimal (n=2 challenges).

## Meta Kaggle

Now, I will search through the Kaggle archive (Meta Kaggle) to find possible biomedical challenges, using the same key terms as earlier.

First, I will download the "Competitions" file from the `meta-kaggle` dataset.  Since the file is big, I will also need to unzip it.

In [4]:
!kaggle datasets download -d kaggle/meta-kaggle -f Competitions.csv

import pandas as pd
import zipfile

with zipfile.ZipFile('Competitions.csv.zip') as zip_ref:
    zip_ref.extractall()

Downloading Competitions.csv.zip to /workspaces/challenge-registry/apps/openchallenges/notebook/notebooks
  0%|                                                | 0.00/479k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 479k/479k [00:00<00:00, 70.6MB/s]


Next, I will read in the file as a data frame, then search for each of the terms within all of the columns.

In [5]:
search_terms = ["medicine", "cancer", "cell", "biology", "neuro", "drug", "life sciences", "health"]

competitions = pd.read_csv('Competitions.csv')
bio_challenges = []
for term in search_terms:
  res = competitions[competitions.apply(lambda row: row.astype(str).str.contains(
      f"\s{term}\s", regex=True, case=False).any(), axis=1)]
  bio_challenges.append(res)
bio_challenges = pd.concat(bio_challenges)
display(bio_challenges.drop_duplicates())

Unnamed: 0,Id,Slug,Title,Subtitle,HostSegmentTitle,ForumId,OrganizationId,CompetitionTypeId,EnabledDate,DeadlineDate,...,UserRankMultiplier,CanQualifyTiers,TotalTeams,TotalCompetitors,TotalSubmissions,ValidationSetName,ValidationSetValue,EnableSubmissionModelHashes,EnableSubmissionModelAttachments,HostName
3492,23871,aimdatathon,AIM Datathon 2020,Join the AI in Medicine ( AIM ) Datathon 2020,Community,979798.0,,1,11/09/2020 02:01:22,11/22/2020 23:59:00,...,0.0,False,0,0,0,,,False,False,
362,5354,opc-recurrence,Oropharynx Cancer (OPC) Radiomics Challenge ::...,Determine from CT data whether a tumor will be...,Community,1380.0,,1,07/26/2016 20:49:02,09/12/2016 23:59:00,...,0.0,False,14,23,78,,,False,False,
364,5360,oropharynx-radiomics-hpv,Oropharynx Cancer (OPC) Radiomics Challenge ::...,Predict from CT data the HPV phenotype of orop...,Community,1379.0,,1,07/26/2016 20:48:54,09/12/2016 23:59:00,...,0.0,False,9,17,70,,,False,False,
440,6004,data-science-bowl-2017,Data Science Bowl 2017,Can you improve lung cancer detection?,Featured,2363.0,360.0,1,01/12/2017 14:00:00,04/12/2017 23:59:00,...,1.0,True,394,742,1676,,,False,False,
453,6152,predict-impact-of-air-quality-on-death-rates,Predict impact of air quality on mortality rates,Predict CVD and cancer caused mortality rates ...,Community,2619.0,,1,02/13/2017 18:15:14,05/05/2017 23:59:00,...,0.0,False,52,53,443,,,False,False,
463,6243,intel-mobileodt-cervical-cancer-screening,Intel & MobileODT Cervical Cancer Screening,Which cancer treatment will be most effective?,Featured,2880.0,484.0,1,03/15/2017 19:00:22,06/21/2017 23:59:00,...,1.0,True,260,386,1365,,,False,False,
533,6841,msk-redefining-cancer-treatment,Personalized Medicine: Redefining Cancer Treat...,Predict the effect of Genetic Variants to enab...,Research,4368.0,4.0,1,06/26/2017 17:31:39,10/02/2017 23:59:00,...,0.0,False,355,445,3032,,,False,False,
1027,10306,mubravo,Predicting Cancer Diagnosis,Bravo's machine learning competition!,Community,48323.0,,1,07/31/2018 19:52:16,08/13/2018 20:59:00,...,0.0,False,11,13,72,,,False,False,
1256,11848,histopathologic-cancer-detection,Histopathologic Cancer Detection,Identify metastatic tissue in histopathologic ...,Playground,87813.0,4.0,1,11/16/2018 18:51:17,03/30/2019 23:59:00,...,0.0,False,1149,1347,20132,,,False,False,
1785,14867,histologa-cancer-mama,Histología en Cancer de Mama,Crear un modelo para identificar metastasis en...,Community,242775.0,,1,06/14/2019 15:19:01,06/27/2019 23:59:00,...,0.0,False,0,0,0,,,False,False,


Searching for the terms this way gives us a total of 50 challenges, which is slightly lower than expected, e.g. searching for "cancer" directly on Kaggle returns 124 challenges. This lower number is due to Kaggle not doing a ocmplete dump of their database: