# Kegg annotation tables 

- This notebook will use output from Eggnogmapper to create key tables used to sum ORF counts to the 'Kegg Ko gene level' and to translate Ko IDs into gene names for visualization. 
---
## Prepare Environment

In [27]:
source('../r_scripts/functions.R')
checkAndLoadPackages('tidyverse','tibble','KEGGREST')

Installing package into ‘/Users/lquirk/Library/R/arm64/4.3/library’
(as ‘lib’ is unspecified)

“package ‘KEGGREST’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”



Bellow Packages Successfully Loaded:

tidyverse    tibble 
     TRUE      TRUE 


# 1. Read in eggnog annotation tables
---
The `get.eggnog` function reads in eggnog annotation tables. Changing the directory and file name to the organism number (4,8,6 or 13) is specific to how I organized my files. The only columns of interest are subset from the annotation table.  

In [28]:
get.eggnog <- function(org){
    emap <- read.csv(paste('../expression_analysis/eggnog_annotations/',org,".emapper.annotations", sep=""),
                 sep = "\t",
                 comment.char = "#",
                 header = FALSE,
                 na.strings = "-")
    
    colnames(emap) <- c(
        'orfs', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs', 'max_annot_lvl', 'COG_category', 'Description', 
        'Preferred_name', 'GOs', 'EC', 'ko_id', 'KEGG_Pathway', 'KEGG_Module', 'KEGG_Reaction', 
        'KEGG_rclass',  'BRITE', 'KEGG_TC', 'CAZy', 'BiGG_Reaction', 'PFAMs')
    
    emap[, c("orfs", "seed_ortholog", "EC","GOs","ko_id",'KEGG_Pathway', 'KEGG_Module', 'KEGG_Reaction', 
       'KEGG_rclass',  'BRITE', 'KEGG_TC', 'CAZy', 'BiGG_Reaction',"PFAMs")]
    }

emap4 <- get.eggnog('04')
emap8 <- get.eggnog('08')
emap6 <- get.eggnog('06')
emap13 <- get.eggnog('13')

### 1.2 Create df for each organism with annotation of choice, removing the rows with NA's

In [29]:
get.anno <- function(emap, anno){
    #extract query column and annotation of choice, ex: pfams
    vars <- c("orfs", anno)
    df <- select(emap, all_of(vars))
    df <- df %>% filter(is.na(df[,ncol(df)])==FALSE)
    }

ko4_ls <- get.anno(emap4,"ko_id")
ko8_ls <- get.anno(emap8,"ko_id")
ko6_ls <- get.anno(emap6,"ko_id")
ko13_ls <- get.anno(emap13,"ko_id")

In [30]:
pfam4_ls = get.anno(emap4, 'PFAMs')
pfam8_ls = get.anno(emap8, 'PFAMs')
pfam6_ls = get.anno(emap6, 'PFAMs')
pfam13_ls = get.anno(emap13, 'PFAMs')


Unnamed: 0_level_0,orfs,PFAMs
Unnamed: 0_level_1,<chr>,<chr>
1,NODE_10000_length_2085_cov_180.263917_g1941_i2.p2,Mg_trans_NIPA
2,NODE_10000_length_2085_cov_180.263917_g1941_i2.p3,Mg_trans_NIPA
3,NODE_10002_length_2085_cov_50.039761_g2374_i1.p2,"Exo_endo_phos,SAP"
4,NODE_10004_length_2085_cov_16.794732_g143_i20.p1,"Bac_rhamnosid6H,F5_F8_type_C,RicinB_lectin_2"
5,NODE_10004_length_2085_cov_16.794732_g143_i20.p4,"Bac_rhamnosid6H,F5_F8_type_C,RicinB_lectin_2"
6,NODE_10005_length_2085_cov_13.522366_g594_i1.p1,PFK


Unnamed: 0_level_0,orfs,ko_id
Unnamed: 0_level_1,<chr>,<chr>
1,NODE_10001_length_2060_cov_5.581278_g5365_i0.p1,ko:K00914
2,NODE_10002_length_2059_cov_37.920947_g5366_i0.p1,ko:K06911
3,NODE_10003_length_2059_cov_28.617321_g5366_i1.p1,ko:K06911
4,NODE_10009_length_2059_cov_5.150050_g5369_i0.p1,ko:K21797
5,NODE_10010_length_2058_cov_58.211083_g5370_i0.p1,ko:K18038
6,NODE_10011_length_2058_cov_52.328967_g4722_i3.p3,ko:K03020


# 2. Create a ko-to-orf mapping table
---
The ko-to-orf mapping table will associate each ORF to a Kegg Ko, and can be used as a key. Because some ORFs were assigned multiple Ko's, these will be split into multiple columns with the `split into multiple` function adapted from a stack overflow post. ORFs with multiple Ko's will appear as many times as a Ko was assigned and the ko_iteration numbers from 1-n ko's assigned. 
Then, `clean_ko` pivots the table into three columns, ORF, Ko, and Ko_iteration, dropping the Ko prefix 'Ko:'.  

In [32]:
split_into_multiple <- function(column, pattern = ",", into_prefix){
    #adapted from post on stack overflow
    cols <- str_split_fixed(column, pattern, n = Inf)
  # Replace empty matrix indicies with NA's 
    cols[which(cols == "")] <- NA
  # turn matrix into a table with unique but arbitraty column names  
    cols <- as_tibble(cols, .name_repair = make.names)
  # where m = # columns in tibble 'cols'
    m <- dim(cols)[2]
  # assign column names as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
    names(cols) <- paste(into_prefix, 1:m, sep = "_")
    print(paste('# of values in matrix w/o NA:',sum(!is.na(cols)),sep=' '))
    return(cols)
}
clean_ko <- function(df, org, into_prefix, value_col='ko_id',name_col='ko_iteration'){
  # split up ko_id's into multiple columns naming each column 
  # ko:_1 to ko:_n 
  # (remember at this point we have the same number of rows but 
  # far more columns)
    anno_iterations <- split_into_multiple(df[,2], ",", into_prefix)
  # select the orfs column from original df and bind to to 
  # split columns 
    df = df %>% select(orfs) %>% bind_cols(anno_iterations)
  # now combine all ko:_n columns so all ko_id's 
  # become one column called ko_id and each column name, 
  # 'ko:_1'...'ko:_n' becomes one column called 'ko_iteration'. 
  # This tells us how many ko_id's were assigned to a particular orf
  # the column 'orfs' will repeat values for rows with > 1 ko_iteration
  # Drop the values from the matrix which were NA
    df_clean = pivot_longer(df, cols = !orfs, values_drop_na = T,
                            values_to = value_col, names_to = name_col)
  # finally, clean up the df a bit by removing the ko_id prefixes 'ko:'
  # we now have a 3 column table with orfs repeated in for each instance 
  # a ko_id was assigned to it by eggnog!
    if(str_detect(colnames(df_clean[3]),'ko_id')==T){
    df_clean$ko_id <- gsub(into_prefix, '', df_clean$ko_id)
        }
  # the final df should have same number of rows as sum(!na(cols))
    print(paste('# rows in final df:',nrow(df_clean), sep='  '))
    #write.csv(df_clean, paste('../kegg_names/ko',org, '_ls.csv', sep=''), row.names=F)
    df_clean
    }

ko4 <- clean_ko(ko4_ls,'4','ko:')
ko6 <- clean_ko(ko6_ls,'6','ko:')
ko8 <- clean_ko(ko8_ls,'8','ko:')
ko13 <- clean_ko(ko13_ls,'13','ko:')
head(ko4)

[1] "# of values in matrix w/o NA: 14281"
[1] "# rows in final df:  14281"
[1] "# of values in matrix w/o NA: 35251"
[1] "# rows in final df:  35251"
[1] "# of values in matrix w/o NA: 12765"
[1] "# rows in final df:  12765"
[1] "# of values in matrix w/o NA: 30911"
[1] "# rows in final df:  30911"


orfs,ko_iteration,ko_id
<chr>,<chr>,<chr>
NODE_10001_length_2060_cov_5.581278_g5365_i0.p1,ko:_1,K00914
NODE_10002_length_2059_cov_37.920947_g5366_i0.p1,ko:_1,K06911
NODE_10003_length_2059_cov_28.617321_g5366_i1.p1,ko:_1,K06911
NODE_10009_length_2059_cov_5.150050_g5369_i0.p1,ko:_1,K21797
NODE_10010_length_2058_cov_58.211083_g5370_i0.p1,ko:_1,K18038
NODE_10011_length_2058_cov_52.328967_g4722_i3.p3,ko:_1,K03020


In [33]:
pfam4 = clean_ko(pfam4_ls, '4','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')
pfam8 = clean_ko(pfam8_ls, '8','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')
pfam6 = clean_ko(pfam6_ls, '6','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')
pfam13 = clean_ko(pfam13_ls, '13','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')

write.csv(pfam4, '../expression_analysis/eggnog_annotations/pfam4.csv', row.names=F)
write.csv(pfam8, '../expression_analysis/eggnog_annotations/pfam8.csv', row.names=F)
write.csv(pfam6, '../expression_analysis/eggnog_annotations/pfam6.csv', row.names=F)
write.csv(pfam13, '../expression_analysis/eggnog_annotations/pfam13.csv', row.names=F)

[1] "# of values in matrix w/o NA: 44311"
[1] "# rows in final df:  44311"
[1] "# of values in matrix w/o NA: 37925"
[1] "# rows in final df:  37925"
[1] "# of values in matrix w/o NA: 68828"
[1] "# rows in final df:  68828"
[1] "# of values in matrix w/o NA: 61554"
[1] "# rows in final df:  61554"


In [39]:
rho4 = filter(pfam4, pfam_id =='7tm_1') %>% mutate('org'='4')
rho6 = filter(pfam6, pfam_id =='7tm_1')%>% mutate('org'='6')
rho13 = filter(pfam13, pfam_id =='7tm_1')%>% mutate('org'='13')
rho_pfam = bind_rows(rho4,rho6,rho13)
write.csv(rho_pfam, '../expression_analysis/eggnog_annotations/rho_pfam.csv', row.names=F)

## 2.1 Add ISIP transcript ORFs 
The tables created above have tidy df of query (ORF) and annotation (Ko_id) for mapping to counts, but need the ISIP annotations. To do this I: 
- Read in ISIP isoform annotations
- Make a tidy df of all ISIP isoforms for each organism
- Check that ISIP's were also counted by Salmon, remove any which were not

In [None]:
## read in the files
dir <- "/work/nclab/lucy/SAB/Annotation/eggnog/isip/pep_hit/"

read.isip <- function(org, prot){
    f = t(read.delim(file=paste(dir, org, 'isip', prot,'_hits.txt', sep=""),
                 sep = ' ',header = F))
    if (nrow(f) >=1){
    f <- data.frame("orfs"=f, 
                    "ko_iteration"="ko:_1",
                    "ko_id"=paste("isip_", prot, sep=""))}
}

i41 <- read.isip("04", "1")
i41a <- read.isip("04", "1a")
i41b <- read.isip("04", "1b")
i42a <- read.isip("04", "2a")
i43 <- read.isip("04", "3")
isip4 <- rbind(i41a, i41b, i42a, i43)
isip4 <- distinct(isip4, orfs, .keep_all = T)

i81 <- read.isip("08", "1")
i81a <- read.isip("08", "1a")
i81b <- read.isip("08", "1b")
i82a <- read.isip("08", "2a")
i83 <- read.isip("08", "3")
isip8 <- rbind( i81a, i81b, i82a, i83)  
isip8 <- distinct(isip8, orfs, .keep_all = T)


i61 <- read.isip("06", "1")
i61a <- read.isip("06", "1a")
i61b <- read.isip("06", "1b")
i62a <- read.isip("06", "2a")
i63 <- read.isip("06", "3")
isip6 <- rbind(i61a, i61b, i62a, i63)
isip6 <- distinct(isip6, orfs, .keep_all = T)

i132a <- read.isip("13", "2a")
i133 <- read.isip("13", "3")
isip13 <- rbind(i132a, i133)
isip13 <- distinct(isip13, orfs, .keep_all = T)

Combine ISIP proteins and ko dataframes, checking not to repeat if the orf was already annotated.

1. Merge full ko definition list with ko's in organism using many-to-one merge relationship.  Organism's ko df have repeated ko_id, as the same annotation could be matched to multiple ORFs. So many rows in organisms's df may match to one row in full ko definition df. The column ko_id is named the same for both objects and will be used to merge the two dataframes. For this we use dplyr's left-join with relationship=many-to-one. 

2. Isip df and ko df have column query in common. We create a list of ORF's that the isip df and ko df have in common. isip rows are added to ko df for those ORFs not in common.

In [None]:

combine_save <- function(isip, ko_df, org){
    
    ko_fin <- bind_rows(ko_df, isip)
 
    write.csv(ko_fin, paste("../kegg_names/ko", org,"_ls.csv", sep=""), row.names=F)
    print(nrow(ko_fin))
    print(nrow(ko_df)+nrow(isip))
}


ko4_def <- combine_save(isip4, ko4, "4")
ko8_def <- combine_save(isip8, ko8, "8")
ko6_def <- combine_save(isip6, ko6, "6")
ko13_def <- combine_save(isip13, ko13, "13")

# 3. Create a list of unique ko's found across all organisms. 
---
Some formating the the list of unique ko's is made so that bash scripting can be used. This list is later used to find the name and symbol of the ko_id's and map it back to the organism-specific tables made above.  

In [None]:
#create a df of all unique ko's found for all annotations
all.ko <- bind_rows(ko4, ko6,ko8,ko13) %>% select(ko_id)

all.ko <- distinct(all.ko, ko_id, .keep_all = T)
#so that bash will read my file correctly, I need to start with ko= 
#and surround all ko_id's by a single quote
kk=c('ko= ')
q=c('"')
all.ko=rbind(kk, q, all.ko, q)

#write to a .txt file with ko_id's separated by a space
write.table(all.ko, "../kegg_names/all.ko.txt", quote=F, sep=" ", eol=" ", row.names = FALSE, col.names = FALSE)

## 3.1 Run bash script `koNames.sh` in script folder
a total of 7218 rows, or ko_id's 

## 3.2 Clean up ko_def table
---
Seperate the Ko ID, symbol, and name into different columns, add the ISIP genes and save. 

In [None]:
#read in table with name and symbol matches to kegg ko's
#make tidy
ko_def <- read.delim("../kegg_names/ko_pathways.txt", head=F,sep = ";")
id <- str_extract(ko_def$V1,'K[[:digit:]]*')
sym <- str_remove(ko_def$V1, 'ko:K[[:digit:]]*')
ko_def <- data.frame(ko_id = id, symbol=sym, name=ko_def$V2)
isip=data.frame(ko_id=c('isip1a','isip_2a','isip_3'), 
                symbol=c('isip_1a','isip_2a','isip_3'),
                name=c('Iron stress induced protein 1a',
                       'Iron stress induced protein 2a', 'Iron stress induced protein 3'))
ko_def=bind_rows(ko_def, isip)
print(paste(nrow(ko_def), "    =   all unique ko's found across organisms"))
write.csv(ko_def,'../kegg_names/ko_def.csv', row.names=F)
tail(ko_def)

# Using Kegg api for Kegg annotations 
Imported csv files of modules, paths, and individual ko's, which are not associated with a path or module, have associated subcategories and broad categories which I am interested in. To get the module or path name, all of the ko's and thier names and symbols from the module or path code, we loop through each module/path code and using Kegg api, extract this information. Because pathways and modules are associated with one annother, we will keep these tables seperate
## After running each loop, we should have a dataframe for each original file with a columns:
- ko_id
- name
- symbol
- pathway/module
- sub_category
- broad_category
## Extract all ko's from tables which appear in at least one sample
We will use the resulting tables for subsetting data later in making heat maps. We can create heat maps based on pathway, module, or category. 

## Pulling out pathways used for heatmaps in paper to make loop faster

In [None]:
sub_cat_path=read.csv('../kegg_names/subcategories_path.csv')

heatMapPath = filter(sub_cat_path, sub_category %in% c(
    'Carbon fixation in photosynthetic organisms', 'Photosynthesis', 
    'Photosynthesis-antenna proteins', 'Nitrogen metabolism', 'Carotenoid biosynthesis'))


In [None]:
#3. pathways
## get name for each pathway
## add pathway name to dataframe
## get list of all ko's for each pathway
## get name and symbol for each ko


url=c("https://rest.kegg.jp/")
find.path=list()
for (i in heatMapPath$Pathway){
    p=getURL(paste(url,'find/pathway/', i, sep=''))
    find.path=unname(c(find.path,p))
}

path_name=data.frame('path_name'=str_remove(find.path,'path:map[[:digit:]]{5}\t'))
path_name$path_name=str_remove(path_name$path_name, '\n')
head(path_name)
sub_path=bind_cols(heatMapPath, path_name)
colnames(sub_path)[1]='Path'

p=list()
for (i in sub_path$Path){
    link.ko=getURL(paste(url,'link/ko/', i, sep=''))
    #link.ko=str_extract(p, 'ko:K[[:digit:]]+')
    p=c(p,link.ko)
}

p=data.frame(ko_id=unlist(str_extract_all(p,'K[[:digit:]]+')), 
               path=(unlist(str_extract_all(p, 'map[[:digit:]]{5}')))) 
symbol=list()
name=list()
for (i in p$ko_id){
    g=getURL(paste(url,'find/ko/', i, sep=''))
    symbol=c(symbol,str_remove(str_extract(unname(g), '\t.+;'), ';'))
    name=c(name, str_remove(str_extract(unname(g),'; .*'), '; '))
}
path_ko=data.frame('Path'=p$path,
                   'ko_id'=p$ko_id,
                   'symbol'=str_remove(unlist(symbol),'\t'),
                   'name'= str_remove(unlist(name), '\\[EC:.*\\]'))

sub_path=left_join(path_ko, sub_path, by='Path', relationship='many-to-many')

head(sub_path)
dim(sub_path)

## add fld genes 
flavodoxin=data.frame(Path='map00195', ko_id=c('K03839','K03840','K21567','K00528'), 
                      symbol=c('fldA, nifF, isiB','fldB','fnr','fpr'), 
                      name=c('flavodoxin I','flavodoxin II','ferredoxin/flavodoxin---NADP+ reductase',
                            'ferredoxin/flavodoxin---NADP+ reductase'), sub_category='Photosynthesis',
                     broad_category='photosynthesis', path_name='Photosynthesis')
sub_path=bind_rows(sub_path, flavodoxin) 

tail(sub_path)
dim(sub_path)
filter(sub_path, path_name=='Photosynthesis')

change symbols, names, or pathways for use in heat map

In [None]:
sub_path$path_name[sub_path$path_name=='Photosynthesis - antenna proteins']='Photosynthesis'

sub_path[str_detect(sub_path$symbol, "DUF"), 'sub_category'] =  "NADH dehydrogenase"
sub_path[str_detect(sub_path$symbol, "COX|cox|CYC"), 'sub_category'] = "cytochrome c"
sub_path[str_detect(sub_path$symbol, "ATP.*V"),'sub_category'] = "V-Type ATP-ase"
sub_path[str_detect(sub_path$symbol, "ATP.*F"), 'sub_category'] = "F-Type ATP-ase"

heat.path=(filter(sub_path, (ko_id %in% ko_def$ko_id)==T))

heat.path$name= str_replace_all(heat.path$name, c('light-harvesting complex I '='LHCA ','light-harvesting complex II'='LHCB',  
                                 'chlorophyll'='Chl','photosystem I '='PSI ', 'photosystem II'='PSII',
                                 'F-type .* subunit '='F-Type ATP-ase ', 'isip_'='Iron starvation induced protein ',
                                 'fructose-bisphosphate aldolase'='FBA', 'ribulose-bisphosphate carboxylase'='RuBisCO',
                                               'MFS transporter, NNP family, '=''))

urea_cycle = filter(ko_def, ko_id%in%c('K00611','K01940','K01755','K01476'))
urea_cycle = mutate(urea_cycle, 'sub_category'='Urea cycle', 'broad_category'='Nitrogen metabolism','path_name'='Nitrogen metabolism', 'Path'='map00910')
heat.path = full_join(heat.path, urea_cycle)

In [None]:
heat.path[str_detect(heat.path$symbol, 'MDH1'), 'name'] = 'malate dehydrogenase 1'
heat.path[str_detect(heat.path$symbol, 'MDH2'), 'name'] = 'malate dehydrogenase 2'
heat.path[str_detect(heat.path$symbol, 'maeB|ppdK'), 'sub_category'] = 'CAM light'
heat.path[str_detect(heat.path$symbol, 'MDH1|MDH2|mdh|ppc'), 'sub_category'] = 'CAM dark'
heat.path[heat.path$ko_id %in% 
          c('K00855','K00927','K01100','K01601','K01623','K01624','K01783',
            'K01803','K01807','K01808','K02446','K03841','K11532','K00134',
            'K00615'), 'sub_category'] = 'Calvin cycle' 
heat.path$sub_category = str_replace(heat.path$sub_category, 'Carbon fixation in photosynthetic organisms', 'C4 Dicarboxilic acid cycle')
heat.path$name = str_remove(heat.path$name, '.(phosphorylating).')
filter(heat.path, path_name=='Carbon fixation in photosynthetic organisms')


In [None]:
heat.path[str_detect(heat.path$name, 'carbonic anhydrase'), 'sub_category'] = 'Carbonic anhydrase'
heat.path[str_detect(heat.path$name, 'nitrate/nitrite transport'), 'sub_category'] = 'Nitrogen transporters'
heat.path[str_detect(heat.path$name, 'glutamate|glutamine'), 'sub_category'] = 'GS/GOGAT and GDH'
heat.path[str_detect(heat.path$name, 'nitrite reductase'), 'sub_category'] = 'Nitrite reductase'
heat.path[str_detect(heat.path$name, 'nitrate reductase'), 'sub_category'] = 'Nitrite reductase'
heat.path[str_detect(heat.path$symbol, 'CPS1'), 'sub_category'] = 'Urea cycle'
heat.path$name = str_remove(heat.path$name, '\\[.*\\]')
filter(heat.path, path_name=='Nitrogen metabolism')


In [None]:
heat.path[str_detect(heat.path$symbol, 'psa'), 'sub_category'] = 'PSI'
heat.path[str_detect(heat.path$symbol, 'psb'), 'sub_category'] = 'PSII'
heat.path[str_detect(heat.path$symbol, 'LHCA'), 'sub_category'] = 'LHCA'
heat.path[str_detect(heat.path$symbol, 'LHCB'), 'sub_category'] = 'LHCB'
heat.path$sub_category = str_replace(heat.path$sub_category, 'Photosynthesis', 'Electron transport chain')
heat.path[str_detect(heat.path$sub_category, 'F-Type ATP-ase'), 'path_name'] = 'F-Type ATP-ase'

filter(heat.path, path_name=='Photosynthesis')

In [None]:
heat.path[str_detect(heat.path$name, 'violaxanthin|zeaxanthin'), 'path_name'] = 'Photosynthesis'
heat.path[str_detect(heat.path$name, 'violaxanthin|zeaxanthin'), 'sub_category'] = 'Xanthophyll cycle'

filter(heat.path, path_name=='Photosynthesis')

In [None]:
unique(heat.path$path_name)
write.csv(heat.path, '../kegg_names/pathwaysHeatMap.csv', row.names=F)

In [None]:
all(heat.path$ko_id %in% ko_def$ko_id)
dim(heat.path)