# Kegg annotation tables 

- This notebook will use output from Eggnogmapper to create key tables used to sum ORF counts to the 'Kegg Ko gene level' and to translate Ko IDs into gene names for visualization. 
---
## Prepare Environment

In [1]:
library('tidyverse')
library('tibble') 
library('KEGGREST')
library('RCurl')

── [1mAttaching core tidyverse packages[22m ────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘RCurl’


The following object is masked from ‘package:ti

# 1. Read in eggnog annotation tables
---
The `get.eggnog` function reads in eggnog annotation tables. Changing the directory and file name to the organism number (4,8,6 or 13) is specific to how I organized my files. The only columns of interest are subset from the annotation table.  

In [2]:
get.eggnog <- function(org){
    emap <- read.csv(paste('../expression_analysis/eggnog_annotations/',org,".emapper.annotations", sep=""),
                 sep = "\t",
                 comment.char = "#",
                 header = FALSE,
                 na.strings = "-")
    
    colnames(emap) <- c(
        'orfs', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs', 'max_annot_lvl', 'COG_category', 'Description', 
        'Preferred_name', 'GOs', 'EC', 'ko_id', 'KEGG_Pathway', 'KEGG_Module', 'KEGG_Reaction', 
        'KEGG_rclass',  'BRITE', 'KEGG_TC', 'CAZy', 'BiGG_Reaction', 'PFAMs')
    
    emap[, c("orfs", "seed_ortholog", "EC","GOs","ko_id",'KEGG_Pathway', 'KEGG_Module', 'KEGG_Reaction', 
       'KEGG_rclass',  'BRITE', 'KEGG_TC', 'CAZy', 'BiGG_Reaction',"PFAMs")]
    }

emap4 <- get.eggnog('04')
emap8 <- get.eggnog('08')
emap6 <- get.eggnog('06')
emap13 <- get.eggnog('13')

### 1.2 Create df for each organism with annotation of choice, removing the rows with NA's

In [3]:
get.anno <- function(emap, anno){
    #extract query column and annotation of choice, ex: pfams
    vars <- c("orfs", anno)
    df <- select(emap, all_of(vars))
    df <- df %>% filter(is.na(df[,ncol(df)])==FALSE)
    }

ko4_ls <- get.anno(emap4,"ko_id")
ko8_ls <- get.anno(emap8,"ko_id")
ko6_ls <- get.anno(emap6,"ko_id")
ko13_ls <- get.anno(emap13,"ko_id")

In [4]:
pfam4_ls = get.anno(emap4, 'PFAMs')
pfam8_ls = get.anno(emap8, 'PFAMs')
pfam6_ls = get.anno(emap6, 'PFAMs')
pfam13_ls = get.anno(emap13, 'PFAMs')


# 2. Create a ko-to-orf mapping table
---
The ko-to-orf mapping table will associate each ORF to a Kegg Ko, and can be used as a key. Because some ORFs were assigned multiple Ko's, these will be split into multiple columns with the `split into multiple` function adapted from a stack overflow post. ORFs with multiple Ko's will appear as many times as a Ko was assigned and the ko_iteration numbers from 1-n ko's assigned. 
Then, `clean_ko` pivots the table into three columns, ORF, Ko, and Ko_iteration, dropping the Ko prefix 'Ko:'.  

In [5]:
split_into_multiple <- function(column, pattern = ",", into_prefix){
    #adapted from post on stack overflow
    cols <- str_split_fixed(column, pattern, n = Inf)
  # Replace empty matrix indicies with NA's 
    cols[which(cols == "")] <- NA
  # turn matrix into a table with unique but arbitraty column names  
    cols <- as_tibble(cols, .name_repair = make.names)
  # where m = # columns in tibble 'cols'
    m <- dim(cols)[2]
  # assign column names as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
    names(cols) <- paste(into_prefix, 1:m, sep = "_")
    print(paste('# of values in matrix w/o NA:',sum(!is.na(cols)),sep=' '))
    return(cols)
}
clean_ko <- function(df, org, into_prefix, value_col='ko_id',name_col='ko_iteration'){
  # split up ko_id's into multiple columns naming each column 
  # ko:_1 to ko:_n 
  # (remember at this point we have the same number of rows but 
  # far more columns)
    anno_iterations <- split_into_multiple(df[,2], ",", into_prefix)
  # select the orfs column from original df and bind to to 
  # split columns 
    df = df %>% select(orfs) %>% bind_cols(anno_iterations)
    df_clean = pivot_longer(df, cols = !orfs, values_drop_na = T,
                            values_to = value_col, names_to = name_col)
    if(str_detect(colnames(df_clean[3]),'ko_id')==T){
    df_clean$ko_id <- gsub(into_prefix, '', df_clean$ko_id)
        }
  # the final df should have same number of rows as sum(!na(cols))
    print(paste('# rows in final df:',nrow(df_clean), sep='  '))
    #write.csv(df_clean, paste('../kegg_names/ko',org, '_ls.csv', sep=''), row.names=F)
    df_clean
    }

ko4 <- clean_ko(ko4_ls,'4','ko:')
ko6 <- clean_ko(ko6_ls,'6','ko:')
ko8 <- clean_ko(ko8_ls,'8','ko:')
ko13 <- clean_ko(ko13_ls,'13','ko:')
head(ko4)

[1] "# of values in matrix w/o NA: 14281"
[1] "# rows in final df:  14281"
[1] "# of values in matrix w/o NA: 35251"
[1] "# rows in final df:  35251"
[1] "# of values in matrix w/o NA: 12765"
[1] "# rows in final df:  12765"
[1] "# of values in matrix w/o NA: 30911"
[1] "# rows in final df:  30911"


orfs,ko_iteration,ko_id
<chr>,<chr>,<chr>
NODE_10001_length_2060_cov_5.581278_g5365_i0.p1,ko:_1,K00914
NODE_10002_length_2059_cov_37.920947_g5366_i0.p1,ko:_1,K06911
NODE_10003_length_2059_cov_28.617321_g5366_i1.p1,ko:_1,K06911
NODE_10009_length_2059_cov_5.150050_g5369_i0.p1,ko:_1,K21797
NODE_10010_length_2058_cov_58.211083_g5370_i0.p1,ko:_1,K18038
NODE_10011_length_2058_cov_52.328967_g4722_i3.p3,ko:_1,K03020


## 2.1 Add ISIP and PTOX transcript ORFs 
The tables created above have tidy df of query (ORF) and annotation (Ko_id) for mapping to counts, but need the ISIP annotations. To do this I: 
- Read in ISIP isoform annotations
- Make a tidy df of all ISIP isoforms for each organism
- Check that ISIP's were also counted by Salmon, remove any which were not

In [6]:
## read in the files
dir <- "../expression_analysis/eggnog_annotations/"

read.isip <- function(org, prot){
    f = t(read.delim(file=paste(dir, org, 'isip', prot,'_hits.txt', sep=""),
                 sep = ' ',header = F,))
    if (nrow(f) >=1){
    f <- data.frame("orfs"=f, 
                    "ko_iteration"="ko:_1",
                    "ko_id"=paste("isip_", prot, sep=""))}
}

i41 <- read.isip("04", "1")
i41a <- read.isip("04", "1a")
i41b <- read.isip("04", "1b")
i42a <- read.isip("04", "2a")
i43 <- read.isip("04", "3")
isip4 <- rbind(i41a, i41b, i42a, i43)
isip4 <- distinct(isip4, orfs, .keep_all = T)

i81 <- read.isip("08", "1")
i81a <- read.isip("08", "1a")
i81b <- read.isip("08", "1b")
i82a <- read.isip("08", "2a")
i83 <- read.isip("08", "3")
isip8 <- rbind( i81a, i81b, i82a, i83)  
isip8 <- distinct(isip8, orfs, .keep_all = T)


i61 <- read.isip("06", "1")
i61a <- read.isip("06", "1a")
i61b <- read.isip("06", "1b")
i62a <- read.isip("06", "2a")
i63 <- read.isip("06", "3")
isip6 <- rbind(i61a, i61b, i62a, i63)
isip6 <- distinct(isip6, orfs, .keep_all = T)

i132a <- read.isip("13", "2a")
i133 <- read.isip("13", "3")
isip13 <- rbind(i132a, i133)
isip13 <- distinct(isip13, orfs, .keep_all = T)

head(isip13)

Unnamed: 0_level_0,orfs,ko_iteration,ko_id
Unnamed: 0_level_1,<chr>,<chr>,<chr>
V1,NODE_11480_length_1968_cov_823.584169_g5802_i0.p2,ko:_1,isip_2a
V2,NODE_12355_length_1900_cov_867.879037_g5802_i1.p2,ko:_1,isip_2a
V3,NODE_14509_length_1755_cov_775.599287_g5802_i2.p2,ko:_1,isip_2a
V4,NODE_1485_length_3787_cov_81.809908_g739_i0.p1,ko:_1,isip_2a
V5,NODE_19097_length_1503_cov_1567.323776_g9985_i0.p2,ko:_1,isip_2a
V6,NODE_29963_length_1047_cov_1763.823409_g3409_i1.p3,ko:_1,isip_2a


In [189]:
read.csv('..//expression_analysis/eggnog_annotations/04ptox_hits.txt')

orfs,ko_iteration,ko_id
<chr>,<chr>,<chr>
NODE_15645_length_1522_cov_47.151139_g8781_i0.p1,ko:_1,ptox_a
NODE_15735_length_1514_cov_43.294934_g8781_i1.p1,ko:_1,ptox_b
NODE_2807_length_3757_cov_7.428882_g1403_i0.p2,ko:_1,ptox_c


In [190]:
ptox4 = read.csv('..//expression_analysis/eggnog_annotations/04ptox_hits.txt')
ptox8 = read.csv('..//expression_analysis/eggnog_annotations/08ptox_hits.txt')
ptox6 = read.csv('..//expression_analysis/eggnog_annotations/06ptox_hits.txt')
ptox13 = read.csv('..//expression_analysis/eggnog_annotations/13ptox_hits.txt')
ptox13

orfs,ko_iteration,ko_id
<chr>,<chr>,<chr>
NODE_14000_length_1790_cov_57.659872_g7170_i0.p1,ko:_1,ptox_a
NODE_16240_length_1655_cov_63.367257_g7170_i1.p1,ko:_1,ptox_b
NODE_18734_length_1522_cov_64.064872_g8003_i1.p2,ko:_1,ptox_c


Combine ISIP and PTOX proteins and ko dataframes, checking not to repeat if the orf was already annotated.

1. Merge full ko definition list with ko's in organism using many-to-one merge relationship.  Organism's ko df have repeated ko_id, as the same annotation could be matched to multiple ORFs. So many rows in organisms's df may match to one row in full ko definition df. The column ko_id is named the same for both objects and will be used to merge the two dataframes. For this we use dplyr's left-join with relationship=many-to-one. 

2. Isip df and ko df have column query in common. We create a list of ORF's that the isip df and ko df have in common. isip rows are added to ko df for those ORFs not in common.

In [191]:

combine_save <- function(isip, ko_df, ptox, org){
    
    ko_fin <- bind_rows(ko_df, isip, ptox)
 
    write.csv(ko_fin, paste("../expression_analysis/kegg_files/ko", org,"_ls.csv", sep=""), row.names=F)
    ko_fin
}


ko4_def <- combine_save(isip4, ko4, ptox4, "4")
ko8_def <- combine_save(isip8, ko8, ptox8, "8")
ko6_def <- combine_save(isip6, ko6, ptox6, "6")
ko13_def <- combine_save(isip13, ko13, ptox13, "13")


# 3. Create a list of unique ko's found across all organisms. 
---
Some formating the the list of unique ko's is made so that bash scripting can be used. This list is later used to find the name and symbol of the ko_id's and map it back to the organism-specific tables made above.  

In [9]:
#create a df of all unique ko's found for all annotations
all.ko <- bind_rows(ko4, ko6,ko8,ko13) %>% select(ko_id)

all.ko <- distinct(all.ko, ko_id, .keep_all = T)
nrow(all.ko)
#so that bash will read my file correctly, I need to start with ko= 
#and surround all ko_id's by a single quote
kk=c('ko=')
q=c('"') 
all.ko=rbind(kk, q, all.ko, q)

#write to a .txt file with ko_id's separated by a space
#write.table(all.ko, "../expression_analysis/kegg_files/all.ko.txt", quote=F, sep="", eol=" ", row.names = FALSE, col.names = FALSE)

## 3.1 Run bash script `koNames.sh` in script folder
a total of 7218 rows, or ko_id's 

## 3.2 Clean up ko_def table
---
Seperate the Ko ID, symbol, and name into different columns, add the ISIP genes and save. 

In [192]:
#read in table with name and symbol matches to kegg ko's
#make tidy
ko_def <- read.delim("../expression_analysis/kegg_files/ko_pathways.txt", head=F,sep = ";")
id <- str_extract(ko_def$V1,'K[[:digit:]]*')
sym <- str_remove(ko_def$V1, 'ko:K[[:digit:]]*')
ko_def <- data.frame(ko_id = id, symbol=sym, name=ko_def$V2)

isip_ptox=data.frame(ko_id=c('ptox1','ptox2','ptox3','ptox4','isip_1a','isip_2a','isip_3'), 
                symbol=c('ptox1','ptox2','ptox3','ptox4','isip_1a','isip_2a','isip_3'),
                name=c('Plastid terminal oxidase 1','Plastid terminal oxidase 2','Plastid terminal oxidase C',
                       'Plastid terminal oxidase D','Iron stress induced protein 1a',
                       'Iron stress induced protein 2a', 'Iron stress induced protein 3'))
ko_def=bind_rows(ko_def, isip_ptox)
print(paste(nrow(ko_def), "    =   all unique ko's found across organisms"))
write.csv(ko_def,'../expression_analysis/kegg_files/ko_def.csv', row.names=F)
tail(ko_def)

[1] "7195     =   all unique ko's found across organisms"


Unnamed: 0_level_0,ko_id,symbol,name
Unnamed: 0_level_1,<chr>,<chr>,<chr>
7190,ptox2,ptox2,Plastid terminal oxidase B
7191,ptox3,ptox3,Plastid terminal oxidase C
7192,ptox4,ptox4,Plastid terminal oxidase D
7193,isip_1a,isip_1a,Iron stress induced protein 1a
7194,isip_2a,isip_2a,Iron stress induced protein 2a
7195,isip_3,isip_3,Iron stress induced protein 3


# Using Kegg api for Kegg annotations 
To create a Kegg pathway key with ko_id, name, and symbol, for the pathways of interest, use the kegg api starting with the Kegg pathway codes:
- Photosynthesis
- Photosynthesis - antenna proteins
- Carbon fixation in photosynthetic organisms
- Nitrogen metabolism
- Carotenoid biosynthesis
Get the pathway name and all of the ko's within that path byloop through each path code and using Kegg api, extract this information. 
Get the kegg ko name and symbol for each pathway by looping through each ko_id.
## After running each loop, we should have a dataframe with columns:
- ko_id
- name
- symbol
- pathway/module
- sub_category
- broad_category
## Add in genes not on Kegg, or not associated with a pathway
After adding the additional genes, clean up the table by adding appropriate pathway names and shortening gene names for the heatmap.
## Extract all ko's from tables which appear in at least one sample
We will use the resulting tables for subsetting data later in making heat maps. We can create heat maps based on pathway, module, or category. 

In [154]:
#3. pathways
## get name for each pathway
## add pathway name to dataframe
## get list of all ko's for each pathway
## get name and symbol for each ko
heatMapPath = data.frame('Pathway'=c('map00195','map00196','map00710','map00910','map00906'), 
                         sub_category=c('Photosynthesis','Photosynthesis-antenna proteins','Carbon fixation in photosynthetic organisms',
                                       'Nitrogen metabolism','Carotenoid biosynthesis'))

url=c("https://rest.kegg.jp/")
find.path=list()
for (i in heatMapPath$Pathway){
    p=getURL(paste(url,'find/pathway/', i, sep=''))
    find.path=unname(c(find.path,p))
}

path_name=data.frame('path_name'=str_remove(find.path,'path:map[[:digit:]]{5}\t'))
path_name$path_name=str_remove(path_name$path_name, '\n')

sub_path=bind_cols(heatMapPath, path_name)
colnames(sub_path)[1]='Path'

p=list()
for (i in sub_path$Path){
    link.ko=getURL(paste(url,'link/ko/', i, sep=''))
    #link.ko=str_extract(p, 'ko:K[[:digit:]]+')
    p=c(p,link.ko)
}

p=data.frame(ko_id=unlist(str_extract_all(p,'K[[:digit:]]+')), 
               path=(unlist(str_extract_all(p, 'map[[:digit:]]{5}')))) 
symbol=list()
name=list()
for (i in p$ko_id){
    g=getURL(paste(url,'find/ko/', i, sep=''))
    symbol=c(symbol,str_remove(str_extract(unname(g), '\t.+;'), ';'))
    name=c(name, str_remove(str_extract(unname(g),'; .*'), '; '))
}
path_ko=data.frame('Path'=p$path,
                   'ko_id'=p$ko_id,
                   'symbol'=str_remove(unlist(symbol),'\t'),
                   'name'= str_remove(unlist(name), '\\[EC:.*\\]'))

sub_path=left_join(path_ko, sub_path, by='Path', relationship='many-to-many')

tail(sub_path)
dim(sub_path)

Unnamed: 0_level_0,Path,ko_id,symbol,name,sub_category,path_name
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
259,map00906,K22492,CYP175A,beta-carotene 3-hydroxylase,Carotenoid biosynthesis,Carotenoid biosynthesis
260,map00906,K22502,crtY,lycopene beta-cyclase,Carotenoid biosynthesis,Carotenoid biosynthesis
261,map00906,K23037,"CRTS, ASY",beta-carotene 4-ketolase/3-hydroxylase,Carotenoid biosynthesis,Carotenoid biosynthesis
262,map00906,K25072,crtNc,"4,4'-diapolycopenoate synthase",Carotenoid biosynthesis,Carotenoid biosynthesis
263,map00906,K25073,cruO,1'-hydroxy-gamma-carotene C-4' ketolase,Carotenoid biosynthesis,Carotenoid biosynthesis
264,map00906,K25074,crtU,carotenoid chi-ring synthase,Carotenoid biosynthesis,Carotenoid biosynthesis


In [193]:
read.vsd.k <- function(org){
    vsd.k <- read.csv(paste('../expression_analysis/vsd_files/', org,'vsd.kegg.csv',sep=''))
    colnames(vsd.k)[1] <- 'ko_id'
    colnames(vsd.k) <- gsub('X', '', colnames(vsd.k))
    vsd.k
    }

vsd.4.k = read.vsd.k('04')

vsd.8.k = read.vsd.k('08')

vsd.6.k = read.vsd.k('06')

vsd.13.k = read.vsd.k('13')


In [194]:
a=unique(vsd.4.k$ko_id)
a=a[a%in%sub_path$ko_id]

b=unique(vsd.8.k$ko_id)
b=b[b%in%sub_path$ko_id]

c=unique(vsd.6.k$ko_id)
c=c[c%in%sub_path$ko_id]

d=unique(vsd.13.k$ko_id)
d=d[d%in%sub_path$ko_id]


length(b)
length(c)
length(d)
length(a)
x=union(a,b)
y=union(c,d)
z=union(x,y)

length(x)
length(y)
length(z)
sub_path=sub_path[(sub_path$ko_id)%in%z,]

In [195]:
## add flavodoxin and ferredoxin genes 
flavodoxin=data.frame(Path='map00195', ko_id=c('K03839','K03840','K21567','K00528'), 
                      symbol=c('fldA, nifF, isiB','fldB','fnr','fpr'), 
                      name=c('flavodoxin I','flavodoxin II','ferredoxin/flavodoxin---NADP+ reductase',
                            'ferredoxin/flavodoxin---NADP+ reductase'), sub_category='Photosynthesis',
                      path_name='Photosynthesis')
sub_path=bind_rows(sub_path, flavodoxin)  

## add urea cycle genes
urea_cycle = filter(ko_def, ko_id%in%c('K00611','K01940','K01755','K01476'))
urea_cycle = mutate(urea_cycle, 'sub_category'='Urea cycle', 'path_name'='Nitrogen metabolism', 'Path'='map00910')
sub_path = full_join(sub_path, urea_cycle)

## add ISIP genes
sub_path = bind_rows(sub_path, isip_ptox)


[1m[22mJoining with `by = join_by(Path, ko_id, symbol, name, sub_category, path_name)`


### Filter pathway table for genes found across organisms

In [201]:
heat.path=(filter(sub_path, (ko_id %in% ko_def$ko_id)==T))

### Correct any pathway names

In [202]:
heat.path$path_name[heat.path$path_name=='Photosynthesis - antenna proteins']='Photosynthesis'

heat.path[str_detect(heat.path$symbol, 'ATP.*F'), 'path_name'] = 'F-Type ATP-ase'

heat.path[str_detect(heat.path$name, 'violaxanthin|zeaxanthin|Iron stress.*|Plastid.*'), 'path_name'] = 'Photosynthesis'
tail(heat.path)

Unnamed: 0_level_0,Path,ko_id,symbol,name,sub_category,path_name
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
111,,ptox2,ptox2,Plastid terminal oxidase B,,Photosynthesis
112,,ptox3,ptox3,Plastid terminal oxidase C,,Photosynthesis
113,,ptox4,ptox4,Plastid terminal oxidase D,,Photosynthesis
114,,isip_1a,isip_1a,Iron stress induced protein 1a,,Photosynthesis
115,,isip_2a,isip_2a,Iron stress induced protein 2a,,Photosynthesis
116,,isip_3,isip_3,Iron stress induced protein 3,,Photosynthesis


### Correct subcategories

In [203]:

sub_symbol = function(sym_pat, new_sub_cat){
    heat.path[str_detect(heat.path$symbol, sym_pat), 5] = new_sub_cat
     heat.path
    }
heat.path=sub_symbol('DUF', 'NADH dehydrogenase')
heat.path=sub_symbol("COX|cox|CYC", "cytochrome c")
heat.path=sub_symbol("ATP.*V", "V-Type ATP-ase")
heat.path=sub_symbol("ATP.*F", "F-Type ATP-ase") 
heat.path=sub_symbol( 'maeB|ppdK', 'CAM light')
heat.path=sub_symbol('MDH1|MDH2|mdh|ppc', 'CAM dark')
heat.path=sub_symbol('CPS1', 'Urea cycle') 
heat.path=sub_symbol('psa', 'PSI') 
heat.path=sub_symbol('psb', 'PSII') 
heat.path=sub_symbol('LHCA', 'LHCA')
heat.path=sub_symbol('LHCB', 'LHCB') 
heat.path=sub_symbol('isip', 'ISIP') 
heat.path=sub_symbol('ptox', 'PSII') 

sub_name = function(name_pat, new_sub_cat){
    heat.path[str_detect(heat.path$name, name_pat), 5] = new_sub_cat
    heat.path
    }
heat.path=sub_name('carbonic anhydrase', 'Carbonic anhydrase')
heat.path=sub_name('nitrate/nitrite transport', 'Nitrogen transporters')
heat.path=sub_name('glutamate|glutamine', 'GS/GOGAT') 
heat.path=sub_name('nitrite reductase', 'Nitrite reductase')
heat.path=sub_name('nitrate reductase', 'Nitrate reductase') 
heat.path=sub_name('violaxanthin|zeaxanthin', 'Xanthophyll cycle')

heat.path[heat.path$ko_id %in% 
          c('K00855','K00927','K01100','K01601','K01623','K01624','K01783',
            'K01803','K01807','K01808','K02446','K03841','K11532','K00134',
            'K00615'), 'sub_category'] = 'Calvin cycle' 

heat.path$sub_category = str_replace(heat.path$sub_category, 'Photosynthesis', 'Electron transport chain')

heat.path$sub_category = str_replace(heat.path$sub_category, 'Carbon fixation in photosynthetic organisms', 'C4 Dicarboxilic acid cycle')

### Correct gene names

In [204]:
heat.path[str_detect(heat.path$symbol, 'MDH1'), 'name'] = 'malate dehydrogenase 1'
heat.path[str_detect(heat.path$symbol, 'MDH2'), 'name'] = 'malate dehydrogenase 2'

heat.path$name= str_replace_all(heat.path$name, c(
    'light-harvesting complex I '='LHCA ',
    'light-harvesting complex II'='LHCB', 
    'chlorophyll'='Chl','photosystem I '='PSI ', 
    'photosystem II'='PSII', 
    'F-type .* subunit '='F-Type ATP-ase ', 
    'isip_'='Iron starvation induced protein ',
    'fructose-bisphosphate aldolase'='FBA', 
    'ribulose-bisphosphate carboxylase'='RuBisCO',
    'MFS transporter, NNP family, '=''))

heat.path$name = str_remove(heat.path$name, '.(phosphorylating).')
heat.path$name = str_remove(heat.path$name, '\\[.*\\]')

In [205]:
unique(heat.path$sub_category)
addGenes=filter(ko_def, ko_id %in%c('K03320','K01427','K04564','K04565','K08717','K08716'))
addGenes$name=str_remove(addGenes$name, '\\[.*\\]')
addGenes= mutate(addGenes, Path='map1111', 
                 sub_category=c('Nitrogen transporters', 'Nitrogen transporters','Superoxide dismutase','Nitrogen transporters','Superoxide dismutase','Urea cycle'),
                 broad_category=c('Nitrogen metabolism','Nitrogen metabolism','Photosynthesis','Nitrogen metabolism','Photosynthesis','Nitrogen metabolism'),
                 path_name=c('Nitrogen metabolism','Nitrogen metabolism','Photosynthesis','Nitrogen metabolism','Photosynthesis','Nitrogen metabolism'))

heat.path=bind_rows(heat.path, addGenes)

heat.path$name.2=str_squish(heat.path$name)

heat.path[heat.path$symbol==' ureG','name.2']='urease G'
heat.path[heat.path$symbol==' ureF','name.2']='urease F'
heat.path[heat.path$symbol=='CPS1','name.2']='carbamoyl-phosphate synthase 1'
heat.path[heat.path$symbol=='SLC14A','name.2']='SCF 14 urea transporter'
#filter(all.path, sub_category=='Superoxide dismutase')
heat.path$name.2 = str_replace(heat.path$name.2,'.*superoxide dismutase,','SOD')

#filter(all.path, sub_category=='Electron transport chain')
heat.path[heat.path$symbol=='petC','name.2'] = 'cytochrome b6f iron-sulfur subunit'

#filter(all.path, sub_category=='PSI')
heat.path$name.2 = str_replace(heat.path$name.2, 'psa', 'PSI')

filter(heat.path, sub_category=='PSII')
heat.path$name.2 = str_replace(heat.path$name.2, 'psb', 'PSII')
heat.path$name.2 = str_remove(heat.path$name.2, 'reaction center')
heat.path$name.2 = str_replace(heat.path$name.2, 'oxygen-evolving enhancer', 'OEC')
heat.path$name.2 = str_replace(heat.path$name.2, '13kDa', 'Psb28')
#filter(all.path, sub_category%in%c('LHCA','LHCB'))
heat.path$name.2 = str_remove(heat.path$name.2, 'Chl a/b binding protein ')
heat.path$name.2 = str_remove(heat.path$name.2, 'catalytic subunit')
heat.path$name.2 = str_remove(heat.path$name.2, '.system')
heat.path$name.2= str_replace(heat.path$name.2, 'nitrate/nitrite transport substrate-binding protein', 'nitrate/nitrite transport substrate-binding\nprotein')
heat.path[str_detect(heat.path$name.2,'glutamate dehydrogenase'),'sub_category'] = 'GDH'
heat.path[str_detect(heat.path$symbol,'ncd2|CYP55'),'sub_category'] = 'Nitrogen recycling'
heat.path$name.2 = str_replace(heat.path$name.2, 'dehydrogenase \\(oxaloacetate', 'dehydrogenase\n\\(oxaloacetate')
heat.path[str_detect(heat.path$ko_id, 'K22502'),8] ='lycopene beta-cyclase a'
heat.path[str_detect(heat.path$ko_id, 'K06443'),8] ='lycopene beta-cyclase b'
heat.path[str_detect(heat.path$symbol, 'nrtC'), 8]='nitrate/nitrite transport ATP-binding protein C'
heat.path[str_detect(heat.path$symbol, 'nrtD'), 8]='nitrate/nitrite transport ATP-binding protein D'
heat.path[str_detect(heat.path$ko_id, 'K01673'), 8]='carbonic anhydrase 1'
heat.path[str_detect(heat.path$ko_id, 'K01674'), 8]='carbonic anhydrase 2'

tail(heat.path,10)


Path,ko_id,symbol,name,sub_category,path_name,broad_category,name.2
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
map00195,K02703,psbA,PSII P680 reaction center D1 protein,PSII,Photosynthesis,,PSII P680 reaction center D1 protein
map00195,K02704,psbB,PSII CP47 Chl apoprotein,PSII,Photosynthesis,,PSII CP47 Chl apoprotein
map00195,K02705,psbC,PSII CP43 Chl apoprotein,PSII,Photosynthesis,,PSII CP43 Chl apoprotein
map00195,K02706,psbD,PSII P680 reaction center D2 protein,PSII,Photosynthesis,,PSII P680 reaction center D2 protein
map00195,K02714,psbM,PSII PsbM protein,PSII,Photosynthesis,,PSII PsbM protein
map00195,K02716,psbO,PSII oxygen-evolving enhancer protein 1,PSII,Photosynthesis,,PSII oxygen-evolving enhancer protein 1
map00195,K02717,psbP,PSII oxygen-evolving enhancer protein 2,PSII,Photosynthesis,,PSII oxygen-evolving enhancer protein 2
map00195,K02719,psbU,PSII PsbU protein,PSII,Photosynthesis,,PSII PsbU protein
map00195,K08901,psbQ,PSII oxygen-evolving enhancer protein 3,PSII,Photosynthesis,,PSII oxygen-evolving enhancer protein 3
map00195,K08902,psb27,PSII Psb27 protein,PSII,Photosynthesis,,PSII Psb27 protein


Unnamed: 0_level_0,Path,ko_id,symbol,name,sub_category,path_name,broad_category,name.2
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
113,,ptox4,ptox4,Plastid terminal oxidase D,PSII,Photosynthesis,,Plastid terminal oxidase D
114,,isip_1a,isip_1a,Iron stress induced protein 1a,ISIP,Photosynthesis,,Iron stress induced protein 1a
115,,isip_2a,isip_2a,Iron stress induced protein 2a,ISIP,Photosynthesis,,Iron stress induced protein 2a
116,,isip_3,isip_3,Iron stress induced protein 3,ISIP,Photosynthesis,,Iron stress induced protein 3
117,map1111,K03320,"amt, AMT, MEP","ammonium transporter, Amt family",Nitrogen transporters,Nitrogen metabolism,Nitrogen metabolism,"ammonium transporter, Amt family"
118,map1111,K08717,utp,urea transporter,Nitrogen transporters,Nitrogen metabolism,Nitrogen metabolism,urea transporter
119,map1111,K04564,SOD2,"superoxide dismutase, Fe-Mn family",Superoxide dismutase,Photosynthesis,Photosynthesis,SOD Fe-Mn family
120,map1111,K08716,SLC14A,solute carrier family 14 (urea transporter),Nitrogen transporters,Nitrogen metabolism,Nitrogen metabolism,solute carrier family 14 (urea transporter)
121,map1111,K04565,SOD1,"superoxide dismutase, Cu-Zn family",Superoxide dismutase,Photosynthesis,Photosynthesis,SOD Cu-Zn family
122,map1111,K01427,URE,urease,Urea cycle,Nitrogen metabolism,Nitrogen metabolism,urease


In [206]:
unique(heat.path$path_name)
write.csv(heat.path, '../expression_analysis/kegg_files/pathwaysHeatMap.csv', row.names=F)

### PFAM Rhodopsin protein

pull out rhodopsins, and using the ko_ls dataframe, check what other proteins were assigned to each rhodopsin orf

In [31]:
pfam4 = clean_ko(pfam4_ls, '4','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')
pfam8 = clean_ko(pfam8_ls, '8','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')
pfam6 = clean_ko(pfam6_ls, '6','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')
pfam13 = clean_ko(pfam13_ls, '13','pfam', value_col = 'pfam_id',name_col = 'pfam_iteration')

write.csv(pfam4, '../expression_analysis/eggnog_annotations/pfam4.csv', row.names=F)
write.csv(pfam8, '../expression_analysis/eggnog_annotations/pfam8.csv', row.names=F)
write.csv(pfam6, '../expression_analysis/eggnog_annotations/pfam6.csv', row.names=F)
write.csv(pfam13, '../expression_analysis/eggnog_annotations/pfam13.csv', row.names=F)

[1] "# of values in matrix w/o NA: 44311"
[1] "# rows in final df:  44311"
[1] "# of values in matrix w/o NA: 37925"
[1] "# rows in final df:  37925"
[1] "# of values in matrix w/o NA: 68828"
[1] "# rows in final df:  68828"
[1] "# of values in matrix w/o NA: 61554"
[1] "# rows in final df:  61554"


In [229]:
rho4 = filter(pfam4, pfam_id =='7tm_1') %>% mutate('org'='4',symbol='rho',name='Rhodopsin')
rho6 = filter(pfam6, pfam_id =='7tm_1')%>% mutate('org'='6',symbol='rho',name='Rhodopsin')
rho13 = filter(pfam13, pfam_id =='7tm_1')%>% mutate('org'='13',symbol='rho',name='Rhodopsin')
colnames(rho4)

In [224]:
getnew=function(rho){
    m=1:nrow(rho)
rho$ko_id=paste('rho_',m,sep='')
rho$name=paste('Rhodopsin',m,sep=' ')
rho$symbol=paste('rho_',m,sep='')
    rho
    }
getnew(rho13)


orfs,pfam_iteration,pfam_id,org,ko_id,name,symbol
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
NODE_14933_length_1730_cov_18.881110_g7679_i0.p1,pfam_1,7tm_1,13,rho_1,Rhodopsin 1,rho_1
NODE_1698_length_3663_cov_38.943175_g96_i7.p3,pfam_1,7tm_1,13,rho_2,Rhodopsin 2,rho_2
NODE_1698_length_3663_cov_38.943175_g96_i7.p7,pfam_1,7tm_1,13,rho_3,Rhodopsin 3,rho_3
NODE_186_length_5911_cov_236.270983_g96_i0.p10,pfam_1,7tm_1,13,rho_4,Rhodopsin 4,rho_4
NODE_186_length_5911_cov_236.270983_g96_i0.p1,pfam_1,7tm_1,13,rho_5,Rhodopsin 5,rho_5
NODE_186_length_5911_cov_236.270983_g96_i0.p7,pfam_1,7tm_1,13,rho_6,Rhodopsin 6,rho_6
NODE_21070_length_1411_cov_11.938714_g11103_i0.p1,pfam_1,7tm_1,13,rho_7,Rhodopsin 7,rho_7
NODE_241_length_5596_cov_245.657070_g96_i1.p5,pfam_1,7tm_1,13,rho_8,Rhodopsin 8,rho_8
NODE_241_length_5596_cov_245.657070_g96_i1.p3,pfam_1,7tm_1,13,rho_9,Rhodopsin 9,rho_9
NODE_241_length_5596_cov_245.657070_g96_i1.p10,pfam_1,7tm_1,13,rho_10,Rhodopsin 10,rho_10
