# Riiid Data Wrangling

### Table of content<a id="TOC"></a>

* [Data Collection](#Step1)<br>
* [Data Organization](#Step2)<br>
* [Data Definition](#Step3)<br>
* [Data Cleaning](#Step4)<br>
* [Finalize Dataset](#Step5)<br>

## <a class="anchor" id="Step1">[Step one - Data Collection](#TOC)</a>

Organize your data to streamline the next steps of your capstone.
    1. import all the libraries
    2. Data loading
    3. Data joining

In [1]:
import pandas as pd

In [2]:
path = '/Users/adelsadr/Springboard/CapstoneTwoRiiid/riiid-test-answer-prediction/'

## <a class="anchor" id="Step2">[Step two - Data Organization](#TOC)</a>

Create a file structure and add your work to the GitHub repository you’ve created for this project.
    1. File structure
    2. GitHub

## <a class="anchor" id="Step3">[Step three - Data Definition](#TOC)</a>

Gain an understanding of your data features to inform the
next steps of your project.
    1. Column names
    2. Data types
    3. Description of the columns
    4. Counts and percents unique values
    5. Ranges of values

In [3]:
example_sample_submission = pd.read_csv(path + 'example_sample_submission.csv')
example_sample_submission.info()
print('\n\n\n')
print(example_sample_submission.head())
print('\n\n\n')
example_sample_submission.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   row_id              104 non-null    int64  
 1   answered_correctly  104 non-null    float64
 2   group_num           104 non-null    int64  
dtypes: float64(1), int64(2)
memory usage: 2.6 KB




   row_id  answered_correctly  group_num
0       0                 0.5          0
1       1                 0.5          0
2       2                 0.5          0
3       3                 0.5          0
4       4                 0.5          0






Unnamed: 0,count,mean,std,min,25%,50%,75%,max
row_id,104.0,53.480769,32.012734,0.0,25.75,53.5,80.25,108.0
answered_correctly,104.0,0.5,0.0,0.5,0.5,0.5,0.5,0.5
group_num,104.0,1.711538,1.09432,0.0,1.0,2.0,3.0,3.0


In [4]:
example_test = pd.read_csv(path + 'example_test.csv')
example_test.info()
print('\n\n\n')
print(example_test.head())
print('\n\n\n')
example_test.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   row_id                          104 non-null    int64  
 1   group_num                       104 non-null    int64  
 2   timestamp                       104 non-null    int64  
 3   user_id                         104 non-null    int64  
 4   content_id                      104 non-null    int64  
 5   content_type_id                 104 non-null    int64  
 6   task_container_id               104 non-null    int64  
 7   prior_question_elapsed_time     103 non-null    float64
 8   prior_question_had_explanation  103 non-null    object 
 9   prior_group_answers_correct     4 non-null      object 
 10  prior_group_responses           4 non-null      object 
dtypes: float64(1), int64(7), object(3)
memory usage: 9.1+ KB




   row_id  group_num    timestamp   

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
row_id,104.0,53.48077,32.01273,0.0,25.75,53.5,80.25,108.0
group_num,104.0,1.711538,1.09432,0.0,1.0,2.0,3.0,3.0
timestamp,104.0,16692830000.0,20396550000.0,0.0,1574161000.0,4693169000.0,27465570000.0,76816460000.0
user_id,104.0,1156358000.0,596609600.0,7792299.0,644823300.0,1257605000.0,1609175000.0,2103437000.0
content_id,104.0,6772.702,4503.711,18.0,2028.75,6225.5,11031.25,13354.0
content_type_id,104.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
task_container_id,104.0,1094.413,1588.054,0.0,162.75,378.5,1179.0,6958.0
prior_question_elapsed_time,103.0,24949.63,9969.309,7000.0,18000.0,24667.0,28875.0,72400.0


In [5]:
lectures = pd.read_csv(path + 'lectures.csv')
lectures.info()
print('\n\n\n')
print(lectures.head())
print('\n\n\n')
lectures.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   lecture_id  418 non-null    int64 
 1   tag         418 non-null    int64 
 2   part        418 non-null    int64 
 3   type_of     418 non-null    object
dtypes: int64(3), object(1)
memory usage: 13.2+ KB




   lecture_id  tag  part           type_of
0          89  159     5           concept
1         100   70     1           concept
2         185   45     6           concept
3         192   79     5  solving question
4         317  156     5  solving question






Unnamed: 0,count,mean,std,min,25%,50%,75%,max
lecture_id,418.0,16983.401914,9426.16466,89.0,9026.25,17161.5,24906.25,32736.0
tag,418.0,94.480861,53.586487,0.0,50.25,94.5,140.0,187.0
part,418.0,4.267943,1.872424,1.0,2.0,5.0,6.0,7.0


In [6]:
questions = pd.read_csv(path + 'questions.csv')
questions.info()
print('\n\n\n')
print(questions.head())
print('\n\n\n')
questions.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13523 entries, 0 to 13522
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   question_id     13523 non-null  int64 
 1   bundle_id       13523 non-null  int64 
 2   correct_answer  13523 non-null  int64 
 3   part            13523 non-null  int64 
 4   tags            13522 non-null  object
dtypes: int64(4), object(1)
memory usage: 528.4+ KB




   question_id  bundle_id  correct_answer  part            tags
0            0          0               0     1   51 131 162 38
1            1          1               1     1       131 36 81
2            2          2               0     1  131 101 162 92
3            3          3               0     1  131 149 162 29
4            4          4               3     1    131 5 162 38






Unnamed: 0,count,mean,std,min,25%,50%,75%,max
question_id,13523.0,6761.0,3903.89818,0.0,3380.5,6761.0,10141.5,13522.0
bundle_id,13523.0,6760.510907,3903.857783,0.0,3379.5,6761.0,10140.0,13522.0
correct_answer,13523.0,1.455298,1.149707,0.0,0.0,1.0,3.0,3.0
part,13523.0,4.264956,1.652553,1.0,3.0,5.0,5.0,7.0


In [7]:
train = pd.read_csv(path + 'train.csv')
train.info()
print('\n\n\n')
print(train.head())
print('\n\n\n')
train.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101230332 entries, 0 to 101230331
Data columns (total 10 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   row_id                          int64  
 1   timestamp                       int64  
 2   user_id                         int64  
 3   content_id                      int64  
 4   content_type_id                 int64  
 5   task_container_id               int64  
 6   user_answer                     int64  
 7   answered_correctly              int64  
 8   prior_question_elapsed_time     float64
 9   prior_question_had_explanation  object 
dtypes: float64(1), int64(8), object(1)
memory usage: 7.5+ GB




   row_id  timestamp  user_id  content_id  content_type_id  task_container_id  \
0       0          0      115        5692                0                  1   
1       1      56943      115        5716                0                  2   
2       2     118363      115     

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
row_id,101230332.0,50615170.0,29222680.0,0.0,25307580.0,50615170.0,75922750.0,101230300.0
timestamp,101230332.0,7703644000.0,11592660000.0,0.0,524343600.0,2674234000.0,9924551000.0,87425770000.0
user_id,101230332.0,1076732000.0,619716300.0,115.0,540811600.0,1071781000.0,1615742000.0,2147483000.0
content_id,101230332.0,5219.605,3866.359,0.0,2063.0,5026.0,7425.0,32736.0
content_type_id,101230332.0,0.01935222,0.1377596,0.0,0.0,0.0,0.0,1.0
task_container_id,101230332.0,904.0624,1358.302,0.0,104.0,382.0,1094.0,9999.0
user_answer,101230332.0,1.376123,1.192896,-1.0,0.0,1.0,3.0,3.0
answered_correctly,101230332.0,0.6251644,0.5225307,-1.0,0.0,1.0,1.0,1.0
prior_question_elapsed_time,98878794.0,25423.81,19948.15,0.0,16000.0,21000.0,29666.0,300000.0


## <a class="anchor" id="Step4">[Step four - Data Cleaning](#TOC)</a>

Clean up the data in order to prepare it for the next steps of
your project.
    1. NA or missing values
    2. Duplicates

## <a class="anchor" id="Step5">[Step five - Finalize Dataset](#TOC)</a>

Save the clean data to a csv file to use in the future processes.