<a href="https://colab.research.google.com/github/mlites/mlites2019/blob/master/kaggle_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Kaggle Datasets

In this exercise, we'll learn how to get data into Colab

First, we'll load up Google Drive, where we'll store our work

Second, we'll search for data stored on Kaggle's servers

Finally, we'll download some datasets we'll use in this course

In [0]:
# FYI If you ever need to kill the VM and start fresh, uncomment and run this:
#!kill -9 -1

# the ! indicates a command is to be run on the system's command line

## Mounting Google Drive
...is easy

In [0]:
from google.colab import drive # from PACKAGE import MODULE
drive.mount('/content/gdrive') # mount your Google Drive to the /content/gdrive folder in your VM

Now we'll make a new directory in our Google Drive to work out of. This will make things easier later because the Colab Virtual Machines get recycled every once in a while, so work you've done isn't necessarily persistent across sessions.

We'll call the directory 'mlites' and add a link to it in the default /content/ location in Colab

In [0]:
!mkdir "/content/gdrive/My Drive/mlites"
!ln -s "/content/gdrive/My Drive/mlites" "/content/mlites"

## Setting up Kaggle

...is a bit more complicated

[Setting up Kaggle in Colab](https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463)

1. Sign up for Kaggle if you're not already a member
2. Go to _My Account_
3. Go to _Create New API Token_
4. That will download a file called **kaggle.json**
5. Click *Files* -> *Upload* on the left and upload **kaggle.json**



In [0]:
# set up key authentication
# FYI in the Colab VM (Virtual Machine) your user name is 'root'

#!mkdir is the system command to make a new directory
!mkdir /content/mlites/kaggle

#move the authentication file into mlites folder so we don't have to re-upload it in the future
!mv /content/kaggle.json /content/mlites/kaggle/

#the . makes it hidden, this is where our credentials will be stored
!mkdir /root/.kaggle

#copy the file to a place that kaggle expects to find it
!cp /content/mlites/kaggle/kaggle.json /root/.kaggle/kaggle.json 

#change the permissions to avoid leaking your credentials
!chmod 600 /root/.kaggle/kaggle.json 


# install the kaggle API package

# pip is a python program for installing new packages
!pip install kaggle

#setup kaggle to use the /content/mlites/kaggle directory we made earlier
!kaggle config set -n path -v/content/mlites/kaggle 



## Find Kaggle datasets of interest

Kaggle package API details

https://github.com/Kaggle/kaggle-api#datasets

only the first 20 results are shown, additional pages can be shown with the --page flag

In [0]:
!kaggle datasets list --tags oceans #find datasets tagged with 'oceans'
!kaggle datasets list --user noaa #find datasets by user 'NOAA'
!kaggle datasets list --search environment --page 2 #find page 2 of datasets using search term 'environment'
!kaggle datasets list --search alaska

## Iditarod dataset

Now find the specific files you want and download them

Let's see what files are available for the 2017 Iditarod dataset

In [0]:
!kaggle datasets files iditarod/iditarod-race

Let's download all the files, they'll show up in /content/mlites/kaggle/datasets

In [0]:
!kaggle datasets download iditarod/iditarod-race

In [0]:
#unzip the files into the current working directory ("/content/")
!unzip /content/mlites/kaggle/datasets/iditarod/iditarod-race/iditarod-race.zip -d ./

In [0]:
# let's rename it to something more useful
!mv report.csv iditarod.csv

# and take a quick look
import pandas as pd # pandas is library providing high-performance, easy-to-use data structures and data analysis tools
iditarod = pd.read_csv('iditarod.csv') # reads the CSV file into a Panda dataframe
iditarod.shape # the shape is the number of rows, columns in the dataframe


In [0]:
iditarod.head(10) # head() is a command to show the top few lines of the file, in this case 10
iditarod.tail(10) # tail() shows the bottom

#note that the jupyter notebook only displays the last command printed here, as a nicely formatted table

## BIOL342 Dataset

Now we'll download the data that we'll use for examples in this course

In [0]:
!kaggle datasets list --user rec3141
!kaggle datasets files rec3141/biol342-genome-data
!kaggle datasets download rec3141/biol342-genome-data


In [0]:
!unzip /content/mlites/kaggle/datasets/rec3141/biol342-genome-data/biol342-genome-data.zip -d ./

In [0]:
covlengc = pd.read_csv('biol342_cov_len_gc.tsv',sep='\t') #read in a TAB separated file
covlengc.shape

In [0]:
covlengc.head(10)

In [0]:
covlengc.tail(10)

## The End

Nice job!

In this lesson we learned:

1. How to mount our Google Drive into our Jupyter notebook
2. How to setup the Kaggle package in our Jupyter notebook
3. How to search for and download Kaggle datasets using **!kaggle datasets**
4. How to perform system commands like cp, mv, mkdir, unzip, chmod, and pip on the command line using **!**
5. How to import a text file into a pandas dataframe
6. How to preview and find some basic information about the data using **pd.head()**, **pd.tail()**, and the **shape** attribute