# Task description 
### Take the ‘proteins.fasta’ file as input and produce the following tables as output: 
1.	A list of each of the unique 10mer sequences in the input file, with a 
count of how many times each occurs. It should be two tab-separated columns with the 10mer sequence in the first column and its counts in the second column, sorted in decreasing order by count. As a bonus, include a third column that contains an ‘X’ if the counts of the peptide are in the 20th to 30th percentile of overall counts (with the first percentile being the highest). 
Example: 
          DGTRREEFQW     35
          HHPWVWLKSS     28
          PRPRRRPRWS     5
          ...



installing Biopython package for handling Sequence data

In [1]:
pip install biopython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


importing necessary packages for the analysis. 
mounting google drive for accessing file from gdrive to use in google colab.

In [2]:
import Bio
import pandas as pd
import numpy as np
from Bio import SeqIO
from collections import Counter
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


**Main body of program.**

In [3]:
# assigning empty list to hold the 10mer seq
fragment_10mer = []

#reading FASTA seq from the fasta file

for seq_record in SeqIO.parse("/content/gdrive/My Drive/Colab Notebooks/proteins.fasta","fasta"):
  for i in range(0,len(seq_record)):
    
    #splicing sequence into 10mer
    tenMer = seq_record.seq[i:i+10]

    #assigning 10mers to the list
    if len(tenMer) == 10:
      fragment_10mer.append(tenMer)
      #print(tenMer)
    else: break

  
  #print("\n -------------- \n")
  #print(fragment_10mer)
#Counter(fragment_10mer)

In [4]:
#counting the unique seq
final_out = Counter(fragment_10mer)
#final_out

**Tranforming data into Dataframes for further data manipulation**

In [5]:
df = pd.DataFrame.from_records(final_out.most_common(),columns=['sequence','count_10mer'])
df['sequence'] = df['sequence'].astype('str') 
df

Unnamed: 0,sequence,count_10mer
0,GEDGCWYGME,500
1,EDGCWYGMEI,500
2,VVTTDISEMG,500
3,VTTDISEMGA,500
4,TTDISEMGAN,500
...,...,...
28847,WLETKGVDRL,1
28848,LETKGVDRLK,1
28849,ETKGVDRLKR,1
28850,TKGVDRLKRM,1


**‘X’  if the counts of the peptide are in the 20th to 30th percentile of overall counts**

In [6]:
condition = [
              (df['count_10mer'] >= df.count_10mer.quantile(.2)) & (df['count_10mer'] <= df.count_10mer.quantile(.3)),
]
values = [' X ']

df['X_mark']=np.select(condition,values)
df[df['X_mark']==' X ']

Unnamed: 0,sequence,count_10mer,X_mark
17442,ALALGMMVLR,2,X
17443,LALGMMVLRI,2,X
17444,ALGMMVLRIV,2,X
17445,LGMMVLRIVR,2,X
17446,GMMVLRIVRN,2,X
...,...,...,...
28847,WLETKGVDRL,1,X
28848,LETKGVDRLK,1,X
28849,ETKGVDRLKR,1,X
28850,TKGVDRLKRM,1,X


**converting the result to a CSV and TSV file**

In [7]:
df.to_csv('/content/gdrive/My Drive/Colab Notebooks/10mer_task1.csv', index=False)
df.to_csv('/content/gdrive/My Drive/Colab Notebooks/10mer_task1.tsv', index=False, sep='\t')

# New Section

2.	A file that contains the following summary statistics on the counts: min, max, median, mean, and variance. There should be one label and value per row, separated by a colon and a space. 
Example: 

      min: 1 
      max: 52 
      median: 6 
      ... 


**Further descriptive Analysis**

In [8]:
df.describe()

Unnamed: 0,count_10mer
count,28852.0
mean,58.479066
std,89.831048
min,1.0
25%,1.0
50%,10.0
75%,89.0
max,500.0


In [9]:
varience = df['count_10mer'].var()

In [10]:
print(df['count_10mer'].std())
print((df['count_10mer'].std())**2)

89.83104818473653
8069.617217968456


In [11]:
minimum = df['count_10mer'].min()
maximum = df['count_10mer'].max()
median = df['count_10mer'].median()
mean = df['count_10mer'].mean()
standard_deviation = df['count_10mer'].std()
variance = df['count_10mer'].var()

print(minimum,"\n",maximum,"\n",median,"\n",mean,"\n",standard_deviation,"\n",variance )

1 
 500 
 10.0 
 58.479065576043254 
 89.83104818473653 
 8069.617217968457


In [12]:
summary_dict = {
    "minimum" : df['count_10mer'].min(),
    "maximum" : df['count_10mer'].max(),
    "median" : df['count_10mer'].median(),
    "mean" : df['count_10mer'].mean(),
    "standard_deviation " : df['count_10mer'].std(),
    "variance" : df['count_10mer'].var()
}

for keys,values in summary_dict.items():
  print(keys," : ",'{:.2f}'.format(values))

minimum  :  1.00
maximum  :  500.00
median  :  10.00
mean  :  58.48
standard_deviation   :  89.83
variance  :  8069.62


In [13]:

with open('/content/gdrive/My Drive/Colab Notebooks/summary.txt', 'w') as f:
  for keys,values in summary_dict.items():
    f.write('%s : %s\n' % (keys, values))