# Master DSLS / Programming 3 / Assignment 6
# Final Assignment

## Introduction
This is the final for programming 3. In this assignment, I will develop scikit-learn machine learning models to predict the function of the proteins in the specific dataset. This model will use small InterPro_annotations_accession to predict large InterPro_annotations_accession.
The definition of small InterPro_annotations_accession and large InterPro_annotations_accession is defined as below:

If InterPro_annotations_accession's feature length(Stop_location-Start_location) / Sequence_length > 0.9, it is large InterPro_annotations_accession.

Otherwise, it is a small InterPro_annotations_accession.

We can briefly rewrite as:

            |(Stop - Start)|/Sequence >  0.9 --> Large

            |(Stop - Start)|/Sequence <= 0.9 --> small

I will also check the "bias" and "noise" that does not make sense from the dataset.

ie. lines(-) from the TSV file which don't contain InterPRO numbers

ie. proteins which don't have a large feature (according to the criteria above)

## 1. Goal

The goal of this assignment is to predict large InterPro_annotations_accession by small InterPro_annotations_accession.

I will use the dataset from /data/dataprocessing/interproscan/all_bacilli.tsv file on assemblix2012 and assemblix2019. However, this file contains ~4,200,000 protein annotations, so I will put a subset of all_bacilli.tsv on GitHub and on local for code testing.

In [1]:
# Output format : https://interproscan-docs.readthedocs.io/en/latest/OutputFormats.html
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
import numpy as np
import pandas as pd
import warnings
import time
warnings.filterwarnings('ignore')

In [2]:
# Create a df by PySpark
schema = StructType([
    StructField("Protein_accession", StringType(), True),
    StructField("Sequence_MD5_digest", StringType(), True),
    StructField("Sequence_length", IntegerType(), True),
    StructField("Analysis", StringType(), True),
    StructField("Signature_accession", StringType(), True),
    StructField("Signature_description", StringType(), True),
    StructField("Start_location", IntegerType(), True),
    StructField("Stop_location", IntegerType(), True),
    StructField("Score", FloatType(), True),
    StructField("Status", StringType(), True),
    StructField("Date", StringType(), True),
    StructField("InterPro_annotations_accession", StringType(), True),
    StructField("InterPro_annotations_description", StringType(), True),
    StructField("GO_annotations", StringType(), True),
    StructField("Pathways_annotations", StringType(), True)])
path = "/data/dataprocessing/interproscan/all_bacilli.tsv"
# path = "all_bacilli.tsv"
spark = SparkSession.builder.master("local[16]").config("spark.debug.maxToStringFields", "100").appName("InterPro").getOrCreate()
df = spark.read.option("sep","\t").option("header","False").csv(path,schema=schema)

22/09/25 10:13:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
df.printSchema()

root
 |-- Protein_accession: string (nullable = true)
 |-- Sequence_MD5_digest: string (nullable = true)
 |-- Sequence_length: integer (nullable = true)
 |-- Analysis: string (nullable = true)
 |-- Signature_accession: string (nullable = true)
 |-- Signature_description: string (nullable = true)
 |-- Start_location: integer (nullable = true)
 |-- Stop_location: integer (nullable = true)
 |-- Score: float (nullable = true)
 |-- Status: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- InterPro_annotations_accession: string (nullable = true)
 |-- InterPro_annotations_description: string (nullable = true)
 |-- GO_annotations: string (nullable = true)
 |-- Pathways_annotations: string (nullable = true)



In [4]:
# One line code
# remove InterPro_annotations_accession == "-"
# get the length of protein
# get the ratio to distinguish them to large and small InterPro_annotations_accession
# 1 for large, 0 for small InterPro_annotations_accession
start = time.time()
df = df.filter(df.InterPro_annotations_accession != "-")\
        .withColumn("Length", abs(df["Stop_location"] - df["Start_location"]))\
        .withColumn("Ratio", (abs(df["Stop_location"] - df["Start_location"])/df["Sequence_length"]))\
        .withColumn("Size", when((abs(df["Stop_location"] - df["Start_location"])/df["Sequence_length"])>0.9,1).otherwise(0))
# get the intersection to make sure there is a match of large and small InterPro_annotations_accession(at least one large and one small InterPro_annotations_accession)
intersection = df.filter(df.Size == 0).select("Protein_accession").intersect(df.filter(df.Size == 1).select("Protein_accession"))
intersection_df = intersection.join(df,["Protein_accession"])
# get the number of small InterPro_annotations_accession in each Protein_accession
small_df = intersection_df.filter(df.Size == 0).groupBy(["Protein_accession"]).pivot("InterPro_annotations_accession").count()
# There are several InterPro_annotations_accession with the same Protein_accession. I only choose the largest one.
large_df = intersection_df.filter(df.Size == 1).groupby(["Protein_accession"]).agg(max("Ratio").alias("Ratio"))
large_df = large_df.join(intersection_df,["Protein_accession","Ratio"],"inner").dropDuplicates(["Protein_accession"])
# Drop the useless columns
columns = ("Sequence_MD5_digest","Analysis","Signature_accession","Signature_description",
        "Score","Status","Date","InterPro_annotations_description","GO_annotations",
        "Pathways_annotations","Ratio","Size","Stop_location","Start_location","Sequence_length","Length")
large_df = large_df.drop(*columns)
# Create the df for ML
ML_df = large_df.join(small_df,["Protein_accession"])
end = time.time()
print(end-start)
# 20.608827114105225 in Hanze

                                                                                

17.291804790496826


In [None]:
# Multiple line code
start = time.time()
# remove InterPro_annotations_accession == "-"
df = df.filter(df.InterPro_annotations_accession != "-")
# get the length of protein
df = df.withColumn("Length", abs(df["Stop_location"] - df["Start_location"]))
# get the ratio to distinguish them to large and small InterPro_annotations_accession
df = df.withColumn("Ratio", df["Length"]/df["Sequence_length"])
# 1 for large, 0 for small InterPro_annotations_accession
df = df.withColumn("Size", when(df.Ratio>0.9,1).otherwise(0))
# get the intersection to make sure there is a match of large and small InterPro_annotations_accession(at least one large and one small InterPro_annotations_accession)
intersection = df.filter(df.Size == 0).select("Protein_accession").intersect(df.filter(df.Size == 1).select("Protein_accession"))
intersection_df = intersection.join(df,["Protein_accession"])
# get the number of small InterPro_annotations_accession in each Protein_accession
small_df = intersection_df.filter(df.Size == 0).groupBy(["Protein_accession"]).pivot("InterPro_annotations_accession").count()
# There are several InterPro_annotations_accession with the same Protein_accession. I only choose the largest one.
large_df = intersection_df.filter(df.Size == 1).groupby(["Protein_accession"]).agg(max("Ratio").alias("Ratio"))
large_df = large_df.join(intersection_df,["Protein_accession","Ratio"],"inner").dropDuplicates(["Protein_accession"])
# Drop the useless columns
columns = ("Sequence_MD5_digest","Analysis","Signature_accession","Signature_description",
        "Score","Status","Date","InterPro_annotations_description","GO_annotations",
        "Pathways_annotations","Ratio","Size","Stop_location","Start_location","Sequence_length","Length")
large_df = large_df.drop(*columns)
# Create the df for ML
ML_df = large_df.join(small_df,["Protein_accession"])
end = time.time()
print(end-start)
# 20.974359273910522 in Hanze

In [46]:
y = ML_df.select("InterPro_annotations_accession")
X = ML_df.select(ML_df.columns[2:]).fillna(0)
# X.write.option("header",True).csv('X_dfSpark.csv')
# y.write.option("header",True).csv('y_dfSpark.csv')

In [39]:
# df_spark = spark.read.option("header",True).format("csv").load("X_dfSpark.csv")
# df_spark.toPandas()

Unnamed: 0,IPR000182,IPR000192,IPR000223,IPR000291,IPR000297,IPR000398,IPR000456,IPR000515,IPR000522,IPR000644,...,IPR043141,IPR043519,IPR043797,IPR043894,IPR044920,IPR045304,IPR045562,IPR045865,IPR046342,IPR046357
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
77,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
78,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
y = np.array(y.collect())
X = np.array(X.collect())

                                                                                

In [48]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics 
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)

In [49]:
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
# using metrics module for accuracy calculation
accuracy=metrics.accuracy_score(y_test, y_pred)
accuracy

0.0625