# Master DSLS / Programming 3 / Assignment 6
# Final Assignment

## Introduction
This is the final for programming 3. In this assignment, I will develop scikit-learn machine learning models to predict the function of the proteins in the specific dataset. This model will use small InterPro_annotations_accession to predict large InterPro_annotations_accession.
The definition of small InterPro_annotations_accession and large InterPro_annotations_accession is defined as below:

If InterPro_annotations_accession's feature length(Stop_location-Start_location) / Sequence_length > 0.9, it is large InterPro_annotations_accession.

Otherwise, it is a small InterPro_annotations_accession.

We can briefly rewrite as:

            |(Stop - Start)|/Sequence >  0.9 --> Large

            |(Stop - Start)|/Sequence <= 0.9 --> small

I will also check the "bias" and "noise" that does not make sense from the dataset.

ie. lines(-) from the TSV file which don't contain InterPRO numbers

ie. proteins which don't have a large feature (according to the criteria above)

## 1. Goal

The goal of this assignment is to predict large InterPro_annotations_accession by small InterPro_annotations_accession.

I will use the dataset from /data/dataprocessing/interproscan/all_bacilli.tsv file on assemblix2012 and assemblix2019. However, this file contains ~4,200,000 protein annotations, so I will put a subset of all_bacilli.tsv on GitHub and on local for code testing.

In [32]:
# Output format : https://interproscan-docs.readthedocs.io/en/latest/OutputFormats.html
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [63]:
schema = StructType([
    StructField("Protein_accession", StringType(), True),
    StructField("Sequence_MD5_digest", StringType(), True),
    StructField("Sequence_length", IntegerType(), True),
    StructField("Analysis", StringType(), True),
    StructField("Signature_accession", StringType(), True),
    StructField("Signature_description", StringType(), True),
    StructField("Start_location", IntegerType(), True),
    StructField("Stop_location", IntegerType(), True),
    StructField("Score", FloatType(), True),
    StructField("Status", StringType(), True),
    StructField("Date", StringType(), True),
    StructField("InterPro_annotations_accession", StringType(), True),
    StructField("InterPro_annotations_description", StringType(), True),
    StructField("GO_annotations", StringType(), True),
    StructField("Pathways_annotations", StringType(), True)])
spark = SparkSession.builder.master("local[16]").appName("InterPro").getOrCreate()
df = spark.read.option("sep","\t").option("header","False").csv("all_bacilli.tsv",schema=schema)

In [64]:
df.printSchema()

root
 |-- Protein_accession: string (nullable = true)
 |-- Sequence_MD5_digest: string (nullable = true)
 |-- Sequence_length: integer (nullable = true)
 |-- Analysis: string (nullable = true)
 |-- Signature_accession: string (nullable = true)
 |-- Signature_description: string (nullable = true)
 |-- Start_location: integer (nullable = true)
 |-- Stop_location: integer (nullable = true)
 |-- Score: float (nullable = true)
 |-- Status: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- InterPro_annotations_accession: string (nullable = true)
 |-- InterPro_annotations_description: string (nullable = true)
 |-- GO_annotations: string (nullable = true)
 |-- Pathways_annotations: string (nullable = true)



In [65]:
dfPandas = df.toPandas()

                                                                                

In [139]:
# remove InterPro_annotations_accession == "-"
dfPandas_filter_InterPro = dfPandas[dfPandas["InterPro_annotations_accession"] !="-"]
# get the length of protein
dfPandas_filter_InterPro["Length"]= (dfPandas_filter_InterPro["Stop_location"] - dfPandas_filter_InterPro["Start_location"]).abs()
# get the ratio to distinguish them to large and small InterPro_annotations_accession
dfPandas_filter_InterPro["Ratio"] = dfPandas_filter_InterPro["Length"]/dfPandas_filter_InterPro["Sequence_length"]
# 1 for large, 0 for small InterPro_annotations_accession
dfPandas_filter_InterPro["Size"] = dfPandas_filter_InterPro["Ratio"].apply(lambda x: 1 if x>0.9 else 0)
# get the intersection to make sure there is a match of large and small InterPro_annotations_accession
set(dfPandas_filter_InterPro[dfPandas_filter_InterPro["Size"] == 1]["Protein_accession"]).intersection(dfPandas_filter_InterPro[dfPandas_filter_InterPro["Size"] == 0]["Protein_accession"])

80