<a href="https://colab.research.google.com/github/mlites/mlites2019/blob/master/kaggle_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Link to your Google Drive to enable import of data at ```/content/gdrive```




# Introduction to Kaggle Datasets

In this exercise, we'll learn how to get data into Colab

First, we'll load up Google Drive

Second, we'll search for data stored on Kaggle's servers

Finally, we'll download the datasets we'll use in this course

## Mounting Google Drive is easy

In [0]:
from google.colab import drive # from PACKAGE import CLASS
drive.mount('/content/gdrive') # mount your Google Drive to the /content/gdrive folder in your VM

## Setting up Kaggle is a bit more complicated
[Setting up Kaggle in Colab](https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463)

1. Sign up for Kaggle if you're not already a member
2. Go to _My Account_
3. Go to _Create New API Token_
4. That will download a file called **kaggle.json**
5. Click *Files* -> *Upload* on the left and upload **kaggle.json**



In [0]:
# install kaggle package

# pip is a python program for installing new packages
# the ! indicates this command is to be run on the system's command line
!pip install kaggle 

# set up key authentication
!mkdir /content/kaggle #!mkdir is the system command to make a new directory
!mkdir ~/.kaggle #the . makes it hidden, this is where our credentials will be stored

# import json #import the json package which is often used to encode data for the web
# token = {"username":"YOUR-USER-NAME","key":"SOME-VERY-LONG-STRING"}
# with open('~/.kaggle/kaggle.json', 'w') as file: #this says to save the string into a file
#     json.dump(token, file)

!mv /content/kaggle.json ~/.kaggle/kaggle.json #copy the file to a place that kaggle expects to find it
!chmod 600 ~/.kaggle/kaggle.json #change the permissions to avoid leaking your credentials

!kaggle config set -n path -v/content/kaggle #setup kaggle to use the /content/kaggle directory we made earlier
!chmod 600 /root/.kaggle/kaggle.json #hide kaggle's copy of the credentials

# finally, import the kaggle package

import kaggle



Find Kaggle datasets of interest

Kaggle package API details

https://github.com/Kaggle/kaggle-api#datasets

only the first 20 results are shown, additional pages can be shown with the --page flag

In [0]:
!kaggle datasets list --tags oceans #find datasets tagged with 'oceans'
!kaggle datasets list --user noaa #find datasets by user 'NOAA'
!kaggle datasets list --search environment --page 2 #find page 2 of datasets using search term 'environment'
!kaggle datasets list --search alaska

ref                                                           title                                            size  lastUpdated          downloadCount  
------------------------------------------------------------  ----------------------------------------------  -----  -------------------  -------------  
noaa/noaa-icoads                                              NOAA ICOADS                                     171GB  2018-03-13 17:37:47              0  
noaa/deep-sea-corals                                          Deep Sea Corals                                  10MB  2017-08-28 17:11:03            409  
uciml/el-nino-dataset                                         El Nino Dataset                                   3MB  2016-11-06 21:02:18           1006  
teajay/global-shark-attacks                                   Global Shark Attacks                            548KB  2018-07-04 17:59:54           4540  
noaa/seismic-waves                                            Tsunami Causes

# Iditarod dataset

Now find the specific files you want and download them

Let's see what files are available for the 2017 Iditarod dataset

In [0]:
!kaggle datasets files iditarod/iditarod-race

name         size  creationDate         
----------  -----  -------------------  
report.csv  139KB  2017-03-22 15:03:30  


Let's download all the files, they'll show up in /content/kaggle/datasets

In [0]:
!kaggle datasets download iditarod/iditarod-race

Downloading iditarod-race.zip to /content/kaggle/datasets/iditarod/iditarod-race
  0% 0.00/21.5k [00:00<?, ?B/s]
100% 21.5k/21.5k [00:00<00:00, 20.3MB/s]


In [0]:
#unzip the files
!unzip /content/kaggle/datasets/iditarod/iditarod-race/iditarod-race.zip -d /content/kaggle/datasets/iditarod/iditarod-race/


Archive:  /content/kaggle/datasets/iditarod/iditarod-race/iditarod-race.zip
  inflating: /content/kaggle/datasets/iditarod/iditarod-race/report.csv  


In [0]:
#take a quick look

import pandas as pd
d = pd.read_csv('/content/kaggle/datasets/iditarod/iditarod-race/report.csv')
d.shape


(1146, 17)

In [0]:
d.head()

Unnamed: 0,Number,Name,Status,Country,Checkpoint,Latitude,Longitude,Distance,Time,Speed,Arrival Date,Arrival Time,Arrival Dogs,Elapsed Time,Departure Date,Departure Time,Departure Dogs
0,2,Ryan Redington,Veteran,United States,Fairbanks,64.8321,-147.813,,0.0,,,,,0.0,03/06/2017,11:00:00,16.0
1,3,Otto Balogh,Rookie,Hungary,Fairbanks,64.8321,-147.813,,0.0,,,,,,,,
2,4,Misha Wiljes,Rookie,Czech Republic,Fairbanks,64.8321,-147.813,,0.0,,,,,0.0,03/06/2017,11:04:00,15.0
3,5,Cody Strathe,Veteran,United States,Fairbanks,64.8321,-147.813,,0.0,,,,,0.0,03/06/2017,11:06:00,16.0
4,6,Linwood Fiedler,Veteran,United States,Fairbanks,64.8321,-147.813,,0.0,,,,,0.0,03/06/2017,11:08:00,16.0


# BIOL342 Dataset

Now we'll download the data that we'll use for examples in this course

In [0]:
!kaggle datasets list --user rec3141
!kaggle datasets files rec3141/biol342-genome-data
!kaggle datasets download rec3141/biol342-genome-data


ref                          title                                 size  lastUpdated          downloadCount  
---------------------------  ------------------------------------  ----  -------------------  -------------  
rec3141/biol342-genome-data  Decontamination of Microbial Genomes  29MB  2019-04-17 16:32:31              1  
name                    size  creationDate         
----------------------  ----  -------------------  
biol342_cov_len_gc.tsv   6MB  2019-04-17 16:32:31  
biol342_depths.tsv      43MB  2019-04-17 16:32:32  
biol342_paired.tsv      12MB  2019-04-17 16:32:32  
biol342_tax.tsv          5MB  2019-04-17 16:32:32  
biol342_tnf.tsv         79MB  2019-04-17 16:32:28  
Downloading biol342-genome-data.zip to /content/kaggle/datasets/rec3141/biol342-genome-data
 31% 9.00M/29.4M [00:00<00:00, 27.0MB/s]
100% 29.4M/29.4M [00:00<00:00, 66.2MB/s]


In [0]:
!unzip /content/kaggle/datasets/rec3141/biol342-genome-data/biol342-genome-data.zip -d ./

Archive:  /content/kaggle/datasets/rec3141/biol342-genome-data/biol342-genome-data.zip
  inflating: ./biol342_paired.tsv    
  inflating: ./biol342_tnf.tsv       
  inflating: ./biol342_cov_len_gc.tsv  
  inflating: ./biol342_tax.tsv       
  inflating: ./biol342_depths.tsv    


In [0]:
covlengc = pd.read_csv('biol342_cov_len_gc.tsv',sep='\t')
covlengc.shape
covlengc.head(25)


Unnamed: 0,contig,student,cov,len,gc
0,student0_1,student0,29.0114,255873,0.4995
1,student0_2,student0,31.5053,190425,0.5151
2,student0_3,student0,39.5121,149891,0.5077
3,student0_4,student0,37.9206,135958,0.5212
4,student0_5,student0,34.0143,121845,0.5204
5,student0_6,student0,30.4397,117759,0.5067
6,student0_7,student0,33.0765,114165,0.521
7,student0_8,student0,40.6539,112487,0.5224
8,student0_9,student0,34.6739,97943,0.5203
9,student0_10,student0,34.1807,96681,0.5237


Import PyTorch, NumPy, MatPlotLib, and Pandas

In [0]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import PIL
print(PIL.PILLOW_VERSION)
print("done")


5.3.0
done
