# 2_6 Interproscan results analysis

Manuel Jara-Espejo$^{1}$\
Aboobaker lab, Department of Biology, University of Oxford

## Contents of notebook
1. Introduction
2. Identify transcription factors  based on Pfam domains
3. GO terms

## Files
* Input: Interproscan output (.tsv)
* Output:  tf_by_family.txt, interpro_GOterms.csv

### 1. Introduction
The non-redundant peptide file generated using Transdecoder and CH-HIT was annotated using Interproscan to get Pfam domain and ontology terms information.

### 2. Identify transcription factors  based on Pfam domains

In [73]:
%%bash
cd ../annotation/protein_coding_genes/pfam_search/

#ruby get_TF-ids.rb > TF_families.stats.txt
head -15 TF_families.stats.txt

Genes having domains: 14350
Pfam families indentified: 4711
Pfam families descritpions: 4528
Genes having TF domains: 983
TF Pfam families indentified: 64
Zinc finger, C2H2 type	504
RFX DNA-binding domain	4
Homeodomain	93
TAZ zinc finger	3
THAP domain	34
HMG (high mobility group) box	33
Helix-loop-helix DNA-binding	47
TATA-binding protein (TBP)	3
bZIP Maf transcription factor	6
P53 DNA-binding domain	4


In [20]:
%%bash
cd ../annotation/protein_coding_genes/pfam_search/

In [71]:
import pandas as pd
import numpy as np
tf_by_family = pd.read_table("/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/pfam_search/tf_genes_by_family.txt", sep="\t", header = None )
tf_by_family = tf_by_family.assign(gene_id=tf_by_family[1].str.split(',')).drop(1,axis=1).explode('gene_id')
tf_by_family.rename(columns={0: 'TF_family'}, inplace=True)
tf_by_family = tf_by_family.reset_index(drop=True).drop_duplicates()
tf_by_family.to_csv("/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/pfam_search/tf_by_family.txt", 
                    sep='\t',index=False)

In [72]:
tf_by_family.head()

Unnamed: 0,TF_family,gene_id
0,"Zinc finger, C2H2 type",MSTRG.35407
1,"Zinc finger, C2H2 type",MSTRG.55036
4,"Zinc finger, C2H2 type",MSTRG.51612
7,"Zinc finger, C2H2 type",MSTRG.55232
8,"Zinc finger, C2H2 type",MSTRG.67235


### 3. GO terms
The GO terms were extracted from the Interproscan output file

#### 3.1 Calling Orthogroups table

In [232]:
interproscan_df = pd.read_table("/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/pfam_search/peps_filt.tsv", sep="\t", header = None)
interproscan_df =interproscan_df.iloc[:,[0,13]]

In [233]:
interproscan_df[["gene","transcript","d1","d2"]] = interproscan_df[0].str.split("\\.", expand = True)
interproscan_df.drop(['d1','d2'],axis=1)
interproscan_df["gene_id"] = interproscan_df[['gene', 'transcript']].agg('.'.join, axis=1)
interproscan_df = interproscan_df[["gene_id",13]].drop_duplicates()
interproscan_df.rename(columns={13: 'GO_term'}, inplace=True)
interpro_GOterms = interproscan_df[interproscan_df["GO_term"].str.contains("G",na=False)]

In [238]:
interpro_GOterms= interpro_GOterms.assign(GO_term=interpro_GOterms["GO_term"].str.split('|')).explode('GO_term')
interpro_GOterms
#interpro_GOterms.to_csv("/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/pfam_search/interpro_GOterms.csv", 
#                    sep='\t',index=False)

Unnamed: 0,gene_id,GO_term
2,MSTRG.3923,GO:0005576
2,MSTRG.3923,GO:0016829
6,MSTRG.2511,GO:0005515
9,MSTRG.28453,GO:0005524
9,MSTRG.28453,GO:0016887
...,...,...
53317,MSTRG.19665,GO:0015074
53319,MSTRG.25706,GO:0005840
53361,MSTRG.33262,GO:0003735
53361,MSTRG.33262,GO:0005840
