In [None]:
library(tidyverse)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(data.table)
library(lme4)
library(arm)
library(grid)

In [None]:
library(bayesplot)

In [None]:
library(rstanarm)

# **1.Train data**

In [None]:
path = "../input/riiid-test-answer-prediction/"
list.files(path)

**There are five files, I will start with train.csv. This is the main data.**

In [None]:
train <- fread(paste0(path, "train.csv"),
               na.strings=c("", "NULL"))


In [None]:
glimpse(train)

**There are 101,230,332 rows and 10 columns. The columns are described as:**

* row_id: (int64) ID code for the row.

* timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

* user_id: (int32) ID code for the user.
 
* content_id: (int16) ID code for the user interaction
 
* content_type_id: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.
 
* task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.
 
* user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
 
* answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.
 
* prior_question_elapsed_time: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.
 
* prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

## **The original dataset is too large to operate, I will only keep the first 2 million for EDA.**

In [None]:
data <- read.csv("../input/riiid-test-answer-prediction/train.csv", header = T, na.strings = c("","NA"), nrows = 2000000)

In [None]:
head(data,n=5)

## Checking NAs

In [None]:
map(data, ~sum(is.na(.)))

In [None]:
sapply(data, function(x)sum(is.na(x))) / nrow(data) * 100

**There are 46676 missing values under $prior_question_elapsed_time, making up 2.33% of this column**

**There are 7693 missing values under $prior_question_had_explanation, making up 0.38465% of this column**

## Understanding timestamp:# 
**timestamp: time between the first interaction and this interaction, timestamp = 0 if this is the first interaction.**

In [None]:
hist(data$timestamp, main = "Distribution of timestamp", xlab = "timestamp")

**Most of the timestamps are 0 means many users only have one interaction.**

**Next I want to see whether or not timestamp is 0 would affect answer correctness.**

In [None]:
data1 <- data %>% filter(answered_correctly != -1)
data2<- data1 %>% group_by(user_id) %>% summarise(count = sum(answered_correctly))
data1$answered_correctly <- factor(data1$answered_correctly,levels = c('0','1'))

In [None]:
data1 <- merge(data1,data2,by="user_id")

In [None]:
head(data1,n=100)

In [None]:
d1 <- data1 %>% filter(timestamp == 0)
d2 <- data1 %>% filter(timestamp != 0)

In [None]:
g1 <- ggplot(d1,aes(x=answered_correctly,y=count/sum(count), fill = answered_correctly)) +
        geom_bar(stat='identity') +
        labs(title = "timestamp = 0", y = "proportion") 

g2 <- ggplot(d2,aes(x=answered_correctly,y=count/sum(count), fill = answered_correctly)) +
        geom_bar(stat='identity') +
        labs(title = "timestamp = 1", y = "proportion") 
 

grid.arrange(g1,g2)

**From the comparison, whether timestamp is 0 or not does not affect the answer correctness too much.**

## Understanding content_type_id:
0 if the user is answering questions, 1 if the user is watching a lecture.

In [None]:
ggplot(data, aes(x=factor(1), fill=factor(content_type_id)))+
  geom_bar(width = 1) +
  coord_polar("y")

In [None]:
tab <- matrix(c(mean(data$content_type_id == '1'), mean(data$content_type_id == '0')),ncol = 2)
colnames(tab) <- c("Lecture","Question")
tab

**About 2% of the content are lecturing and 98% are answering questions.**

## Understanding prior_question_elapsed_time:
**time used to answer the previous question bundle, null if this is the first question bundle.**

In [None]:
hist(data$prior_question_elapsed_time, main = "Distribution of prior_question_elapsed_time", xlab = "prior_question_elapsed_time",xlim=c(0,200000))

**Most of the elapsed time is less than 40000.**

## Number of interaction

In [None]:
df <- data.frame(table(data$user_id))
colnames(df)<- c("user_id","inter_num")
df1 <- arrange(df, desc(inter_num))[1:10,]

In [None]:
hist(df$inter_num,main="Distribution of interaction counts",xlab="number of interaction",,breaks="FD",xlim=c(0,800))

## Top 10 users by interaction counts

In [None]:
df1$user_id <- factor(df1$user_id, levels = df1$user_id)

In [None]:
ggplot(df1, aes(user_id, inter_num, fill = user_id)) + geom_col() + coord_flip() +
    scale_fill_brewer(palette="Spectral")

## Understanding Content_id:

**ID for user interaction**

In [None]:
content_used <- data.frame(table(data$content_id))
colnames(content_used )<- c("content_id","used_counts")
content_used1 <- content_used[order(-content_used$used_counts),][1:10,]

In [None]:
content_used1$content_id <- factor(content_used1$content_id, levels = content_used1$content_id)

In [None]:
hist(content_used$used_counts, main = "Distribution of used content_id", xlabs = "number of used time", xlim = c(0,2000))

## Top 10 most used content_id

In [None]:
ggplot(content_used1, aes(content_id, used_counts, fill = content_id)) + geom_col() + coord_flip() +
    scale_fill_brewer(palette="Spectral")

##  Understanding answered_correctly

**0 if answer is wrong, 1 if answer is correct, -1 if content type is lecture.**

**As talked before, there are about 2% of contents are watching lectures. Ignore them and only see questions.**

In [None]:
question <- data %>% filter(content_type_id == 0)
answer <- data.frame(table(question$answered_correctly))
colnames(answer) <- c("Answer_correctness","Count")

In [None]:
ggplot(answer, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
geom_bar(stat="identity") +
geom_text(aes(label = Count), vjust = 2, color= "black", size=5)

**About 1/3 answers are not correct.**

## Understanding user_answer:

**the user's answer to the question, read -1 for lectures.**

In [None]:
user_answer <- data.frame(table(data$user_answer))
user_answer1 <- user_answer %>% filter(user_answer != -1)
colnames(user_answer1) <- c("user_answer","count")
user_answer1

**Exclude -1, there are four different user answer groups: 0, 1, 2 and 3.**

In [None]:
ggplot(user_answer1, aes(x = user_answer, y = count, fill = user_answer))+ 
geom_bar(stat="identity") +
geom_text(aes(label = count), vjust = 2, color= "black", size=5)

**Would the odds of correct and uncorrect answer be different in four answer groups?**

In [None]:
f <- function(n){
    answer <- data %>% filter(data$user_answer == n)
    answer <- data.frame(table(answer$answered_correctly))
    colnames(answer) <- c("Answer_correctness","Count")
    return(answer)
}

answer0 <- f(0)
answer1 <- f(1)
answer2 <- f(2)
answer3 <- f(3)


In [None]:
g3 <- ggplot(answer0, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) + 
        labs(title = "answer group 0")

g4 <- ggplot(answer1, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) +
        labs(title = "answer group 1")

g5 <- ggplot(answer2, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) +
        labs(title = "answer group 2")

g6 <- ggplot(answer3, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) +
        labs(title = "answer group 3")


grid.arrange(g3,g4,g5,g6)

**It turns out the percent of correct answers in each group is close.**

## prior_question_had_explanation: 
**Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture.**

**In the beginning, we found there are 7693 NAs in prior_question_had_explanation column.**

**Let's see the percentage of Ture, False and NA**

In [None]:
prior_a <- data.frame(table(data$prior_question_had_explanation))
colnames(prior_a) <- c("explanation_exists","count")
row <- data.frame("explanation_exists" = "NA", count = "7693")
prior_a <- rbind(prior_a,row)
prior_a$count <- as.numeric(prior_a$count)

In [None]:
ggplot(prior_a, aes(x = explanation_exists, y = count, fill = explanation_exists)) +
geom_bar(stat="identity") +
geom_text(aes(label = round(count / sum(count), digits=3)), vjust = 2, color= "black", size=5) 

**About 89% of the prior questions have explanation.**

**How does the correct answer change among the 3 groups?**

In [None]:
f1 <- function(n){
    answer <- data %>% filter(data$prior_question_had_explanation == n)
    answer <- data.frame(table(answer$answered_correctly))
    colnames(answer) <- c("Answer_correctness","Count")
    return(answer)
}
F <- f1("False")
T <- f1("True")

In [None]:
None <- subset(data, is.na(data$prior_question_had_explanation))
NAs <- data.frame(table(None$answered_correctly))
colnames(NAs) <- c("Answer_correctness","Count")

In [None]:
gg1 <- ggplot(T, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) + 
        labs(title = "Prior question has explanation = TRUE")

gg2 <- ggplot(F, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) + 
        labs(title = "Prior question has explanation = FALSE")

gg3 <- ggplot(NAs, aes(x = Answer_correctness, y = Count, fill = Answer_correctness)) +
        geom_bar(stat="identity") +
        geom_text(aes(label = round(Count/sum(Count),digits = 2)), vjust = 2, color= "black", size=3) + 
        labs(title = "Prior question has explanation = NA")

grid.arrange(gg1,gg2,gg3)



**We can see prior question explanation improves the user correct rate.**

## Distribution of task_container_id

In [None]:
hist(data$task_container_id, main = "Distribution of task_container_id", xlab = "task_container_id")

## Distribution of user accuracy

In [None]:
question <- data %>% filter(content_type_id == 0)
ques_num <- as.data.frame(table(question$user_id))
colnames(ques_num) <- c("user_id","ques_num")
#ques_num[order(-ques_num$ques_num),]
try <- question %>% group_by(user_id) %>% dplyr::summarise(correct_num = sum(answered_correctly))

In [None]:
result <- merge(ques_num,try, by="user_id")
result$correctRate <- result$correct_num / result$ques_num
result <- merge(df,result,by="user_id")
result$lec_num <- result$inter_num - result$ques_num
result <- result[order(-result$inter_num),]

In [None]:
result$lec <- ifelse(result$lec_num > 0, 1, 0)
result

In [None]:
ggplot(result,aes(x = correctRate * 100))+
geom_histogram(breaks = seq(0, 100, 2),fill = 'purple', alpha = 0.7) +
labs(title = "distribution of answer correctness rate", x = "Correctness Rate in %")

**Divide the users to 2 groups, define users with 50 or more answering question interactions as experienced and lower than 50 as new.**

**Then I want to see their correct rate seperately.**

In [None]:
more <- result %>% filter(ques_num >= 50)
less <- result %>% filter(ques_num < 50)
me <- result %>% filter(ques_num >= 500)
a <- data.frame("new" = mean(less$correctRate), "new_count" = nrow(less), 
                "experienced" = mean(more$correctRate),"experienced_count" = nrow(more),
                "more_experienced" = mean(me$correctRate), "more_experienced_count" = nrow(me))

In [None]:
a

**The experienced user(answer question interaction counts >= 50) have higher correct rate than new users(interaction counts < 50)**

**more experienced users also have higher correct rate than experienced users, but not much.**

## Distribution of correct rate by question

In [None]:
ques_num2 <- data.frame(table(question$content_id))
colnames(ques_num2) <- c("content_id","ques_num")
try2 <- question %>% group_by(content_id) %>% dplyr::summarise(correct_num = sum(answered_correctly))

In [None]:
result2 <- merge(ques_num2, try2, by="content_id")
result2$correctRate <- result2$correct_num / result2$ques_num
result2$false_num <- result2$ques_num - result2$correct_num
result2 <- result2 %>% relocate(false_num, .before = correctRate)

In [None]:
ggplot(result2,aes(x = correctRate * 100))+
geom_histogram(breaks = seq(0, 100, 2),fill = 'purple', alpha = 0.7) +
labs(title = "distribution of answer correctness rate", x = "Correctness Rate in %")

**We can see that many questions have a correct rate 100%, some have 0 correct rate, while most of them have a correct rate 75%.**

# **2.Question Data**

In [None]:
ques <- read.csv("../input/riiid-test-answer-prediction/questions.csv",header=TRUE, na.strings = c("","NA"))
head(ques,nrow=5)

In [None]:
glimpse(ques)

**There are 13,523 rows and 5 columns, they are describes as:**

* question_id: foreign key for the train/test content_id column, when the content type is question (0).

* bundle_id: code for which questions are served together.

* correct_answer: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

* part: the relevant section of the TOEIC test.

* tags: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
map(ques, ~sum(is.na(.)))

**There is 1 missing value under $tags**

In [None]:
ques <- ques %>% filter(tags != 'NA')

### Part

In [None]:
part <- data.frame(table(ques$part))
colnames(part) <- c("part_number","count")
part <- arrange(part,desc(count))

In [None]:
ggplot(part,aes(x=reorder(part_number,-count), y = count,fill=part_number)) +
geom_bar(stat='identity') +
labs(title = "Distribution of $part", x = "Part Number")

**part 5 shows up the most times**

### Top 20 question tags

In [None]:
for (i in 1:nrow(ques)){
    if (i == 1){
        tagCount <- cbind.data.frame("tag" = as.numeric(unlist(str_split(ques$tags[i], " "))))
    } else{
        tagCount <- rbind.data.frame(tagCount, cbind.data.frame("tag" = as.numeric(unlist(str_split(ques$tags[i], " ")))))
    }
}

In [None]:
tc <- data.frame(table(tagCount$tag))
colnames(tc) <- c("tag","count")
tc1 <- arrange(tc,desc(count))[1:20,]

hot_tag <- tc1$tag

In [None]:
ggplot(tc1,aes(x=reorder(tag,-count), y = count,fill=tag)) +
geom_bar(stat='identity') +
labs(title = "Distribution of $part", x = "Tag Number") + 
labs(title = "Top 20 tags")

In [None]:
ques$hot <- 0

In [None]:
for(i in 1:nrow(ques)){
    tag_row <- as.numeric(unlist(str_split(ques$tags[i], " ")))
    length <- length(tag_row)
    for(j in 1:length) {
        for(k in 1:length(hot_tag)){
            if (tag_row[j] == hot_tag[k]) {
                ques$hot[i] <- 1
            } 
        }
    }

}

In [None]:
dat <- data %>% filter(content_type_id == 0) %>% dplyr::select("user_id","content_id","task_container_id","answered_correctly")
colnames(dat)[2] <- "question_id"
dat <- merge(dat,ques,by="question_id")

# **3. Lecture Data**

In [None]:
lec <- read.csv("../input/riiid-test-answer-prediction/lectures.csv",header=TRUE, na.strings = c("","NA"))
head(lec,n=10)

In [None]:
glimpse(lec)

**There are 418 rows and 4 columns, described as:**

* lecture_id: foreign key for the train/test content_id column, when the content type is lecture (1).

* part: top level category code for the lecture.

* tag: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

* type_of: brief description of the core purpose of the lecture

In [None]:
map(lec, ~sum(is.na(.)))

**There is no missing value.**

### Part Distribution

In [None]:
part1 <- data.frame(table(lec$part))
colnames(part) <- c("part_number","count")
part1 <- arrange(part,desc(count))

In [None]:
ggplot(part1,aes(x=reorder(part_number,-count), y = count,fill=part_number)) +
geom_bar(stat='identity') +
labs(title = "Distribution of $part", x = "Part Number")

### Top 30 Lecture Tags

In [None]:
tg <- data.frame(table(lec$tag))
colnames(tc) <- c("tag","count")
tg1 <- arrange(tc,desc(count))[1:30,]

In [None]:
ggplot(tg1,aes(x=reorder(tag,-count), y = count,fill=tag)) +
geom_bar(stat='identity') +
labs(title = "Distribution of $part", x = "Tag Number") + 
labs(title = "Top 30 tags")

### Distribution of lecture type

In [None]:
type <- data.frame(table(lec$type_of))
colnames(type) <- c("type","count")
type <- arrange(type,desc(count))

In [None]:
ggplot(type,aes(x=reorder(type,-count), y = count,fill=type)) +
geom_bar(stat='identity') +
labs(title = "Distribution of lecture type",x = "lecture type")

**I want to see whether or not watching lecture can imporve correct rate.**

In [None]:
r1 <- result %>% filter(lec_num == 0)
r2 <- result %>% filter(lec_num > 0)
r <- data.frame("Never_Attend_Lecture" = mean(r1$correctRate),"Lectured" = mean(r2$correctRate))
r

**It turns out people who watched lectures have higher accuracy than those who did not.**

In [None]:
r3 <- result %>% filter(lec_num >= 50)
r4 <- result %>% filter(lec_num < 50)
rr <- data.frame("Less_Than_50_Lecture" = mean(r4$correctRate),"More_lecture" = mean(r3$correctRate))
rr

## Dataframe contains interaction counts, question counts, lecture counts, correct counts and correct rate by user

In [None]:
head(result, n=5)

## Dataframe contains question counts, correct counts, false counts and correct rate by question

In [None]:
head(result2, n=10)

** I will do modeling in another file**