# Python to count co-occurrence
In a previous notebook, with K=2,
we noted which K-mer pairs are most indicative of ProteinCoding or NonCoding.
Here, we wrote python to count co-occurrence of those pairs.
For any pair (a,b), count(a,b):=min(count(a),count(b)).

In [None]:
#!/usr/bin/env python
# coding: utf-8

'''
Input: 2mer count csv
Output: input plus added columns for co-occurence
'''
import sys
import pandas as pd

def make_comatrix(inprefix,outprefix):
    SUFFIX='.features.csv'
    infile= inprefix + SUFFIX
    outfile= outprefix + SUFFIX
    print("in/out: "+infile+" / "+outfile)

    pairs_of_interest=[]
    # indicators of nc
    pairs_of_interest.append('CA-CG')
    pairs_of_interest.append('CC-CG')
    pairs_of_interest.append('CG-AC')
    pairs_of_interest.append('CG-AG')
    pairs_of_interest.append('CG-GC')
    pairs_of_interest.append('CG-CT')
    pairs_of_interest.append('CG-GA')
    pairs_of_interest.append('CG-GG')
    pairs_of_interest.append('CG-GT')
    pairs_of_interest.append('CG-TG')
    pairs_of_interest.append('CG-AA')
    pairs_of_interest.append('CG-AT')
    pairs_of_interest.append('CG-TA')
    pairs_of_interest.append('CG-TC')
    pairs_of_interest.append('CG-TT')
    pairs_of_interest.append('GC-GG')
    # indicators of pc
    pairs_of_interest.append('AA-AT')
    pairs_of_interest.append('AA-CA')
    pairs_of_interest.append('AA-CT')
    pairs_of_interest.append('AA-TT')
    pairs_of_interest.append('AC-TA')
    pairs_of_interest.append('CC-TA')
    pairs_of_interest.append('CT-TT')
    pairs_of_interest.append('GA-TA')
    pairs_of_interest.append('TA-AA')
    pairs_of_interest.append('TA-AG')
    pairs_of_interest.append('TA-AT')
    pairs_of_interest.append('TA-CA')
    pairs_of_interest.append('TA-CT')
    pairs_of_interest.append('TA-TC')
    pairs_of_interest.append('TA-TG')
    pairs_of_interest.append('TA-TT')

    all_seqs=[]
    df2 = pd.read_csv (infile)
    rows=df2.shape[0]
    for r in range(rows):
        features_per_seq=[]
        for pair in pairs_of_interest:
            mer0=pair[:2]
            mer1=pair[3:]
            val0=df2.iloc[r].loc[mer0] # this is slow
            val1=df2.iloc[r].loc[mer1] # convert to hashes?
            minval=min(val0,val1)
            features_per_seq.append(minval)
        all_seqs.append(features_per_seq)
    df3=pd.DataFrame(all_seqs,columns=pairs_of_interest)
    dfc=pd.concat([df2,df3],axis='columns')
    dfc.to_csv (outfile,index_label='index')

In [None]:
#make_comatrix('ncRNA.2mer','ncRNA.2mer_co')
# This took an hour.
# This read in ncRNA.2mer.features.csv
# This generated ncRNA.2mer_co.features.csv

In [None]:
#make_comatrix('pcRNA.2mer','pcRNA.2mer_co')
# This took an hour.
# This read in pcRNA.2mer.features.csv
# This generated pcRNA.2mer_co.features.csv

## Results with SVM and RF
First we tested SVM and RF on 16 2-mer counts. Scores were about 79%.
Then we added co-occurrence counts.
We ranked 2-mers by their ratio of co-occurrence in nc vs pc sequences.
We took the 16 K-mers at each end of the distribution.
We counted the co-occurrence as min of both 2-mer counts.
We observed that the co-occurrence basically enforced AT vs GC sequence.
This was quick & dirty; we did not exclude overlapping K-mers.
Lastly we tested SVM and RF on 16 2-mers plus 32 co-occurrence counts. Scores were about 74%.