# Working with immune repertoire sequence data using `R`

We will be using `tidyverse` and `Bioconductor` to work with pre-processed immune repertoire sequence data in `R`.
You have already learned about both of these collections of packages, and today's class will give you the opportunity to practice using them in a new biological context.

Goals for today's class:
1. Familiarize with the format of immune repertoire sequence data
2. Work through an example analysis, and practice on your own

## Load packages

In [None]:
library(Biostrings)
library(tidyverse)

## Loading the data

In [None]:
indiv1 <- read.csv('https://drive.google.com/uc?id=1b1oXlhg99_YCPK_HzFL2ohYPDxvTbQOo', sep = '\t', header = TRUE)
indiv2 <- read.csv('https://drive.google.com/uc?id=10qMY16H9wD_wuC4ISAJRTRx_bOeclg5D', sep = '\t', header = TRUE)

Let's take a look at the data. 
Recall that each annotated sequence takes the form:

5'-[__Vgene__]-(_possible Vgene deletion_)-[__N1insertion__]-(_possible Dgene deletion_)-[__D gene__]-(_possible Dgene deletion_)-[__N2insertion__]-(_possible Jgene deletion_)-[__J gene with possible deletion__]-3'

Using the column names given by the file, we can interpret this as:

5'-[__`v_gene`__]-(_`v_trim`_)-[__`vd_insert`__]-(_`d0_trim`_)-[__`d_gene`__]-(_`d1_trim`_)-[__`dj_insert`__]-(_`j_trim`_)-[__`j_gene`__]-3'

In [None]:
head(indiv2)

## Some example analyses using `tidyverse` and `Bioconductor`

Find which V-gene occurs most frequently for individual 2

In [None]:
Vcounts2 <- indiv2 %>%
                group_by(v_gene) %>%
                summarise(total_count = n()) %>%
                arrange(desc(total_count))

head(Vcounts2)

Plot the distribution of trimming lengths for that gene

In [None]:
# isolate most frequently used V-gene
most_frequent_V <- Vcounts2 %>%
                    filter(total_count == max(Vcounts2$total_count))

# filter original data set by most frequently used V-gene
most_frequent_V_data <- indiv2 %>%
                            filter(v_gene == most_frequent_V$v_gene)

# plot trimming distribution
most_frequent_V_data %>%
    ggplot(aes(x = v_trim)) +
    geom_density() +
    theme_classic()

Calculate V-D N-insertion composition frequencies within individual 2

In [None]:
# filter for sequences that have VD N-inserts
n_indiv2 <- indiv2 %>%
                filter(vd_insert != 0)

# convert VD N-insert column to a BioStrings `DNAStringSet`
nucs_indiv2 <- DNAStringSet(n_indiv2$vd_insert_nucs)

# get frequencies
nucs_indiv2 %>%
  letterFrequency(c("A", "T", "C", "G"), collapse = TRUE, as.prob = TRUE)

# In-class exercises

(20 minutes)

### 1. Plot the distributions of V-gene trimming for each V-gene. Find the V-gene with the largest average number of nucleotides trimmed.

In [None]:
#  your code here...


### 2. Find the N-insertion base frequencies for each N-insertion junction (combining data from both individuals). Are they similar?

_Hint: you can use the `rbind` function to combine data sets_

In [None]:
# your code here...


### 3. Find the rearrangement/s (`cdr3` column) which have the largest overlap between the two individuals. Which nucleotide rearrangment most commonly leads to that CDR3 amino acid sequence for each individual?

In [None]:
# your code here...
