# TACOS
In notebook TACOS_201, we generated files for upload to TACOS. Here, we analyze the results. Each file was uploaded manually to the TACOS server, and the corresponding cell line was selected from the pull-down menu. Each generated CSV file was saved to disk. The 10 results files contained three fields per line: FASTA defline, predicted localization (either ‘Nucleus’ or ‘Cytoplasm’), and a probability (between 0 and 1). We considered ‘Nucleus’ to be a correct prediction if the true CNRCI was zero or below, and ‘Cytoplasm’ to be a correct prediction if the true CNRCI was zero or above.

Use the TACOS [web server](https://balalab-skku.org/TACOS/).
Assume the TACOS_output directory was populated with 10 csv files from the TACOS server.   
This notebook generates performance statistics.    

In [1]:
import numpy as np
np.random.seed(seed=1234)
FASTA_SIZE = 100
RESULTS='TACOS_output/'
# The lncATLAS database has data for 15 cell lines. 
# TACOS supports 10 of those. 
TACOS_CELL_LINES=['A549','GM12878','HELA','HEPG','HESC','HT1080','HUVEC','NHEK','SKMEL','SKNS']
ATLAS_CELL_LINES=['A549','GM12878','HeLa.S3','HepG2','H1.hESC','HT1080','HUVEC','NHEK','SK.MEL.5','SK.N.SH']
# Labels returned by TACOS server
nuc_label = 'Nucleus'
cyto_label = 'Cytoplasm'

## Processing
Process the files returned by TACOS. 
Each file is in FASTA format with extra information in the deflines.
A typical defline contains TID, GID, CNRCI truth, predicted class, predicted probability.
For example:

    ENST00000607222 ENSG00000272106 -0.524915,Nucleus,0.487

In [2]:
# For each of 10 cell lines:
# parse the FASTA file returned by TACOS.
# From each defline, extract CNRCI (ground truth), predicted class (TACOS), 
# and predicted probability (TACOS).
# Count prediction as correct if CNRCI<=0 and prediction==nucleus
# or CNRCI>0 and prediction==cytoplasm, incorrect otherwise.
# Show statistics per cell line.

accucacy_per_cell_line = list()
total_correct = 0
total_incorrect = 0
for i in range(len(TACOS_CELL_LINES)):
    cell_line = TACOS_CELL_LINES[i]
    real_name = ATLAS_CELL_LINES[i]
    print('Processing cell line:', real_name, cell_line)
    correct = 0
    incorrect = 0
    total = 0
    probabilities = list()
    outfn = RESULTS+cell_line+'.csv'         # match saved web server results
    try:
        with open (outfn, 'r') as fin:
            header = None
            for line in fin:
                if header is None:
                    header = line
                    continue
                # expect text like this:
                # ENST00000607222 ENSG00000272106 -0.524915,Nucleus,0.487
                line = line.strip()
                inputs,prediction,prob = line.split(',')
                prob = float(prob)
                tid,gid,rci = inputs.split(' ')
                rci = float(rci)
                # Give benefit of doubt if rci is exactly zero!
                if rci <= 0 and prediction==nuc_label:
                    correct += 1
                elif rci >= 0 and prediction==cyto_label:
                    correct += 1
                else:
                    incorrect +=1
                total += 1
                probabilities.append(prob)
        rounded_accuracy_percent = int(0.5+100*correct/(total))
        accucacy_per_cell_line.append(rounded_accuracy_percent)
        total_correct += correct
        total_incorrect += incorrect
        print('Average probability: %f' % np.mean(probabilities))
        print('Correct / Incorrect: %d/%d' % (correct,incorrect))
        print('Cell line: %s Accuracy: %d%%' % (cell_line,rounded_accuracy_percent))
        print()
    except:
        print('File not found:', outfn)
        print()


Processing cell line: A549 A549
Average probability: 0.496537
Correct / Incorrect: 59/36
Cell line: A549 Accuracy: 62%

Processing cell line: GM12878 GM12878
Average probability: 0.463760
Correct / Incorrect: 47/49
Cell line: GM12878 Accuracy: 49%

Processing cell line: HeLa.S3 HELA
Average probability: 0.376287
Correct / Incorrect: 47/40
Cell line: HELA Accuracy: 54%

Processing cell line: HepG2 HEPG
Average probability: 0.487021
Correct / Incorrect: 53/42
Cell line: HEPG Accuracy: 56%

Processing cell line: H1.hESC HESC
Average probability: 0.480206
Correct / Incorrect: 65/32
Cell line: HESC Accuracy: 67%

Processing cell line: HT1080 HT1080
Average probability: 0.642303
Correct / Incorrect: 50/39
Cell line: HT1080 Accuracy: 56%

Processing cell line: HUVEC HUVEC
Average probability: 0.410225
Correct / Incorrect: 55/34
Cell line: HUVEC Accuracy: 62%

Processing cell line: NHEK NHEK
Average probability: 0.479372
Correct / Incorrect: 49/45
Cell line: NHEK Accuracy: 52%

Processing cell

In [10]:
# For each cell line,
# show the accuracy.
# Also show overall accuracy with standard deviation across cell lines.

print('Accuracy per cell line:')
for i in range(len(TACOS_CELL_LINES)):
    print(TACOS_CELL_LINES[i], accucacy_per_cell_line[i])
print('Mean:', np.mean(accucacy_per_cell_line))
print('Standard deviation:', np.std(accucacy_per_cell_line))

Accuracy per cell line:
A549 62
GM12878 49
HELA 54
HEPG 56
HESC 67
HT1080 56
HUVEC 62
NHEK 52
SKMEL 52
SKNS 51
Mean: 56.1
Standard deviation: 5.503635162326805


In [8]:
overall = 100.0*total_correct/(total_correct+total_incorrect)
print('Overall accuracy of 10*100=1000 genes: %f%%' % overall)

Overall accuracy of 10*100=1000 genes: 56.147987%


In [9]:
print('done')

done
