# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_train_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [3]:
from src.data import data_collection

In [4]:
# # Download the competition data from kaggle
# ! kaggle competitions download -c riiid-test-answer-prediction -p ../../data

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [5]:
data_collection.download_data_and_load_into_sql()

Successfully created database and all tables

Successfully loaded CSV file into `train` table
        
Successfully loaded CSV file into `questions` table
        
Successfully loaded CSV file into `lectures` table
        
Successfully loaded CSV file into `example_test` table
        


Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd

In [7]:
DBNAME = "riiid_education"

In [8]:
conn = psycopg2.connect(dbname=DBNAME)

In [9]:
pd.read_sql("SELECT * FROM train LIMIT 10;", conn)

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0.0,115,5692,0,1,3,1,,
1,1,56943.0,115,5716,0,2,2,1,37000.0,False
2,2,118363.0,115,128,0,0,0,1,55000.0,False
3,3,131167.0,115,7860,0,3,0,1,19000.0,False
4,4,137965.0,115,7922,0,4,1,1,11000.0,False
5,5,157063.0,115,156,0,5,2,1,5000.0,False
6,6,176092.0,115,51,0,6,0,1,17000.0,False
7,7,194190.0,115,50,0,7,3,1,17000.0,False
8,8,212463.0,115,7896,0,8,2,1,16000.0,False
9,9,230983.0,115,7863,0,9,0,1,16000.0,False


Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

Make sure you close the DB connection when you are done using it

In [10]:
conn.close()