<a href="https://colab.research.google.com/github/Srividhyak2011/Demo-Datascienceproject/blob/main/M3_MP1_NB_Essential_Genes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Mini Project 01: Prediction of Essential Genes from Networks

## Learning Objectives

At the end of the mini project, you will be able to -

* Get an understanding of the dataset.
* Build and analyze Networks (or Graphs)
* Extract features from the network
* Predict Essential Genes using the classification algorithm

## Information

### Background of the project

This Mini-Project is based on the research work based out of Robert Bosch Center for Data Science and Artificial Intelligence (RBCDSAI) at IIT Madras. More details can be found in this article [https://doi.org/10.3389/fgene.2021.722198](https://www.frontiersin.org/articles/10.3389/fgene.2021.722198/full).

The goal of this project is to apply machine learning to predict Essential Genes using the Protein network as the features of the STRING dataset.

### About the paper cited above

Features Used in the Paper

267 Genetic Featues + 16 Network Centrality features.

12 Centralities [1 to 12] + 4 other Auxillary network metrics

These features are computed from the graph. Once extracted, they translate the Omics-Data into a typical machine learning data, which can be further developed with Machine learning Models.


### About the Dataset

The dataset will be directly downloaded from the [String Database](https://string-db.org/cgi/download) , in a very convenient manner. 
We are downloading and working on the bacterium *Actinomyces coleocanis* as it is a small dataset suitable for the runtime and quick reruns.
[Actinomyces coleocanis](https://stringdb-static.org/download/protein.links.v11.5/525245.protein.links.v11.5.txt.gz) will be downloaded and unzipped. The text file contains 3 columns - protein1, protein2 and score.
This 3 column data is a graph data.

The Netgenes contains essential gene predictions for 2,700+ bacteria predicted using features derived from STRING protein–protein functional association networks. It contains a re fined version to access and download the data with some information as well. The dataset contains the essential genes for each bacteria. 
Clicking on the specific bacteria name will navigate to an interactive
page. 
[Netgenes Database](https://rbc-dsai-iitm.github.io/NetGenes/)


### Small note on Proteins

Proteins are large, complex molecules that play many critical roles in the body. They are necessary for building the structural components of the human body, such as muscles and organs. Proteins also determine how the organism looks, how well its body metabolises food or fights infection and sometimes even how it behaves. Proteins are chains of chemical building blocks called amino acids. A protein may contain a few amino acids or it could have several thousands.



### Small note on Genes

A gene is a basic unit of heredity in a living organism that normally resides in long strands of DNA called chromosomes. Genes are coded instructions that decide what the organism is like, how it behaves in its environment and how it survives. They hold the information to build and maintain an organism’s cells and pass genetic traits to offspring. A gene consists of a long combination of four different nucleotide bases namely adenine, cytosine, guanine and thymine.


### Relationship between GENES and PROTEINS
Gene and protein are two functionally-related entities found in the cell of an living organism.
Most genes contain the information require to make proteins. Please note Gene is not a part of Protein and vice-versa.
For more information, click [Here](https://pediaa.com/difference-between-gene-and-protein/).




### Importance of Essential Genes
Essential genes are genes required for a cell or an organism to survive. Some of the functinalities are cell growth and metabolism, cell reproduction, its well-being etc. Disruption or deletion of such genes causes cell death, indicating that these genes perform essential biological functions. A majority of the Genes in an organism are NON-ESSENTIAL. Only a small fraction are Essential.

**Python Packages used:**  
* [`networkx`](https://networkx.org/documentation/stable/_downloads/networkx_reference.pdf) for graph analysis
* [`requests`](https://docs.python-requests.org/en/latest/) for fetching data over the internet 
* [`Pandas`](https://pandas.pydata.org/docs/reference/index.html) for data frames and easy to read csv files  
* [`Numpy`](https://numpy.org/doc/stable/reference/index.html#reference) for array and matrix mathematics functions  
* [`sklearn`](https://scikit-learn.org/stable/user_guide.html) for the metrics and pre-processing
* [`seaborn`](https://seaborn.pydata.org/) and [`matplotlib`](https://matplotlib.org/) for plotting


In [1]:
#@title Download dataset
%%capture
!gdown "1SjpKNyYWqpEs3JwXF_cTF3980ZtZOld9"
!unzip "Essential genes data.zip"

In [2]:
!pip install networkx==2.4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting networkx==2.4
  Downloading networkx-2.4-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: networkx
  Attempting uninstall: networkx
    Found existing installation: networkx 3.0
    Uninstalling networkx-3.0:
      Successfully uninstalled networkx-3.0
Successfully installed networkx-2.4


## Importing the packages

In [3]:
### The required libraries and packages ###
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from collections import Counter
from operator import itemgetter
from google.colab import drive
import os
from tqdm import tqdm
import time
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import networkx as nx

## Importing the Data

In [4]:
df_raw = pd.read_csv('525245.protein.links.v11.5.txt', sep = '\s',engine = "python")
print(df_raw.shape)
df_raw.head()

(213398, 3)


Unnamed: 0,protein1,protein2,combined_score
0,525245.HMPREF0044_0001,525245.HMPREF0044_1430,635
1,525245.HMPREF0044_0001,525245.HMPREF0044_1224,170
2,525245.HMPREF0044_0001,525245.HMPREF0044_0084,164
3,525245.HMPREF0044_0001,525245.HMPREF0044_0281,165
4,525245.HMPREF0044_0001,525245.HMPREF0044_0968,165


In [5]:
df = df_raw.copy()
df.head()

Unnamed: 0,protein1,protein2,combined_score
0,525245.HMPREF0044_0001,525245.HMPREF0044_1430,635
1,525245.HMPREF0044_0001,525245.HMPREF0044_1224,170
2,525245.HMPREF0044_0001,525245.HMPREF0044_0084,164
3,525245.HMPREF0044_0001,525245.HMPREF0044_0281,165
4,525245.HMPREF0044_0001,525245.HMPREF0044_0968,165


In [8]:
print(f"df.shape = {df.shape}")
n_uniq_protein1 = df["protein1"].nunique()
n_uniq_protein2 = df["protein2"].nunique()
print(f"n_uniq_protein1 = {n_uniq_protein1}, n_uniq_protein2 = {n_uniq_protein2}")

df.nunique()

df.shape = (213398, 3)
n_uniq_protein1 = 1530, n_uniq_protein2 = 1530


protein1          1530
protein2          1530
combined_score     850
dtype: int64

In [9]:
df.isnull().sum()

protein1          0
protein2          0
combined_score    0
dtype: int64

## Graded Exercises (10 points)

Exercises 1 to 4 deal with the data, the graph structure, its visualization and data preparation of **FEATURES** only.

Exercises 5 deals with linking the Feature data with the target data 

Exercise 6 deals with the classification model.

### Exercise 1 (1 point): Create the networkx graph object

**Hint** : Use the `networkx`'s function `add_weighted_edges_from`

In [15]:
# YOUR CODE HERE
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_weighted_edges_from(df.values)
G.edges(data=True)


EdgeDataView([('525245.HMPREF0044_0001', '525245.HMPREF0044_1430', {'weight': 635}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_1224', {'weight': 170}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0084', {'weight': 164}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0281', {'weight': 165}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0968', {'weight': 165}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_1540', {'weight': 234}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0143', {'weight': 182}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_1465', {'weight': 153}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0732', {'weight': 204}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0766', {'weight': 157}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0054', {'weight': 154}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0969', {'weight': 177}), ('525245.HMPREF0044_0001', '525245.HMPREF0044_0535', {'weight': 201}), ('525245.HMPREF0044_1430', '525245.HMPREF0044_0084', {'weight':

### Exercise 2 (2 points): Network Analysis

Provide the following Graph parameters

1. Display the information of the network, using networks using `networkx`'s    `.info`
2. Compute number of nodes, number of edges and the average degree of the network using  `networkx`'s   `.number_of_nodes `,   `.number_of_edges ` and `.degree` of each node and then taking its average`
3. Density of a network  using  `networkx`'s   `.density`
4. Compute the minimum Spanning Tree using  `networkx`'s   `.minimum_spanning_tree` and draw it using  `.spring_layout` and `.draw_networkx`
5. Determine the Diameter and Center of the graph  using  `networkx`'s   `.diameter` and `.center`
6. Visualise the degree distribution using a histogram    using  `networkx`'s   `.degree`
7. List the components in a network   using  `networkx`'s   `.connected_components`
8. Create a subrgraph   using  `networkx`'s. `.subgraph` and Print the largest Component of the network using the `max` of components
`

**Hints**: Refer to the `nx.<method>` highlighted above to achieve the respective tasks

In [18]:
nx.info(G)

'Name: \nType: Graph\nNumber of nodes: 1530\nNumber of edges: 106699\nAverage degree: 139.4758'

In [21]:
print("Number of Nodes : ", G.number_of_nodes())

Number of Nodes :  1530


In [20]:
print("Number of Edges : ", G.number_of_edges())

Number of Edges :  106699


In [25]:
G.degree()

DegreeView({'525245.HMPREF0044_0001': 13, '525245.HMPREF0044_1430': 17, '525245.HMPREF0044_1224': 174, '525245.HMPREF0044_0084': 212, '525245.HMPREF0044_0281': 202, '525245.HMPREF0044_0968': 47, '525245.HMPREF0044_1540': 138, '525245.HMPREF0044_0143': 118, '525245.HMPREF0044_1465': 425, '525245.HMPREF0044_0732': 273, '525245.HMPREF0044_0766': 36, '525245.HMPREF0044_0054': 153, '525245.HMPREF0044_0969': 117, '525245.HMPREF0044_0535': 133, '525245.HMPREF0044_0002': 94, '525245.HMPREF0044_0663': 114, '525245.HMPREF0044_0158': 533, '525245.HMPREF0044_0569': 145, '525245.HMPREF0044_0231': 165, '525245.HMPREF0044_0013': 104, '525245.HMPREF0044_1052': 192, '525245.HMPREF0044_1300': 67, '525245.HMPREF0044_1379': 102, '525245.HMPREF0044_0575': 348, '525245.HMPREF0044_0047': 323, '525245.HMPREF0044_1427': 67, '525245.HMPREF0044_0841': 342, '525245.HMPREF0044_0619': 93, '525245.HMPREF0044_0160': 400, '525245.HMPREF0044_0618': 142, '525245.HMPREF0044_0067': 175, '525245.HMPREF0044_1194': 88, '5252

In [None]:
# This is just a guideline.
# Please use seperate cells to perfom the required tasks


#===========================================
# Compute number of nodes, number of edges 
# and the average degree of the network "g"
#===========================================
# YOUR CODE HERE

#===========================================
# Compute the density of 
#===========================================
# YOUR CODE HERE

#===========================================
# Compute the minimum spanning tree in the 
# network "g" and draw it.
#===========================================
# YOUR CODE HERE

#===========================================
# Draw the degree distribution histogram.
#===========================================
# YOUR CODE HERE

#===========================================
# Compute largest connected component (LC) 
# of the network "g"
#===========================================
# YOUR CODE HERE

#===========================================
# List the components in the network "g"
#===========================================
# YOUR CODE HERE

#===========================================
# Get the SubGraph
#===========================================
# YOUR CODE HERE

### Exercise 3  (3 points): Centrality Feature Extraction

Compute the Centralities

The reason we need centralities is already established in the introduction. We are generating features for the network data to transform the network data into a Machine-learning features.

For specific information, click the link adjacent to the name, or for a full list click [here](https://networkx.org/documentation/stable/reference/algorithms/centrality.html#reaching).


In the graph/network analysis, centrality measures are vital tools for understanding the networks in detail.

These algorithms use graph theory to calculate the importance of any given node in a network. They cut through noisy data, revealing parts of the network that need attention – but they all work differently. Each measure has its own definition of 'importance'. There are plenty of parameters.
However, the following network metrics are used in the paper.



1. **closeness centrality** [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.closeness_centrality.html#networkx.algorithms.centrality.closeness_centrality) {**has been provided as an example with code in the next cell**},
2. betweenness centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality),
3. degree centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.degree_centrality.html#networkx.algorithms.centrality.degree_centrality),
4. eigenvector centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.eigenvector_centrality.html#networkx.algorithms.centrality.eigenvector_centrality),
5. subgraph centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.subgraph_centrality.html#networkx.algorithms.centrality.subgraph_centrality),
8. load centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.load_centrality.html#networkx.algorithms.centrality.load_centrality),
9. harmonic centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.harmonic_centrality.html#networkx.algorithms.centrality.harmonic_centrality),
10. reaching (local) centrality [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.local_reaching_centrality.html#networkx.algorithms.centrality.local_reaching_centrality),
11. pagerank [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html),
12. clustering coefficient [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.clustering.html#networkx.algorithms.cluster.clustering),
13. average_neighbor_degree [link](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.assortativity.average_neighbor_degree.html)

**Note**: 

- Most of the methods mentioned above return a dictionary ( key-value pairs of node_name: value)

- Some of the methods mentioned above return only 1 number, So make sure to look into the documentation as to what it returns. In that case run the method for each node to create a dictionary of node-names and its values.

In [None]:
# closeness centrality
centrality_closeness = nx.closeness_centrality(g)

df_centrality_closeness = pd.DataFrame(data = np.zeros([n_uniq_protein1,2]), columns=["protein1", "centrality_closeness"])
df_centrality_closeness["protein1"] = list(centrality_closeness.keys())
df_centrality_closeness["centrality_closeness"] = list(centrality_closeness.values())
df_centrality_closeness = df_centrality_closeness.set_index("protein1")
df_centrality_closeness.head(2)

In [None]:
# betweenness centrality

# YOUR CODE HERE

In [None]:
# degree centrality

# YOUR CODE HERE

In [None]:
# eigenvector centrality

# YOUR CODE HERE

In [None]:
# subgraph centrality

# YOUR CODE HERE

In [None]:
# information centrality

# YOUR CODE HERE

In [None]:
# random-walk centrality

# YOUR CODE HERE

In [None]:
# load centrality

# YOUR CODE HERE

In [None]:
# harmonic centrality

# YOUR CODE HERE

In [None]:
# local reaching centrality

# YOUR CODE HERE

In [None]:
# pagerank

# YOUR CODE HERE

In [None]:
# clustering coefficient

# YOUR CODE HERE

In [None]:
# average_neighbor_degree

# YOUR CODE HERE

### Exercise 4 (2 points): Feature Engineering and Data Preparation

 - Add the above computed values as new columns to the existing dataframe to form new features for machine learning. 
 - Remove the columns *protein2* and the *combined_score*
 - Check for the null values. Drop if any
 - Scale the values of each column
 - Check for correlations of every feature with every other using `seaborn`'s **annotated heatmap**. Drop one of the features in the pair which exhibits a high correlation coefficient, *i.e.* $r>0.9$

In [None]:
#Add the above computed values as new columns to the existing dataframe to form new features for machine learning.

# YOUR CODE HERE

In [None]:
# Check for the null values. Drop if any

# YOUR CODE HERE

In [None]:
# Scale the features

# YOUR CODE HERE

In [None]:
# Display the correlation matrix using Heatmap

# YOUR CODE HERE

In [None]:
# Drop Highly Correlating Pairs of features

# YOUR CODE HERE

### Excerice 5 (1 Point) : Target Data

Obtain the Target Data from the file **"Actinomyces coleocanis.csv"**

In [None]:
df_target = pd.read_csv( "Actinomyces_coleocanis_Essential_Genes.csv")
print(df_target.shape)
df_target.head(3)

In [None]:
# Create a list (or a set or numpy array) of Essential Genes from the above DataFrame
# YOUR CODE HERE

In [None]:
# Create a new feature called "gene_essentiality"
# Assign 1 to the protein1 if it is present in the list of essential genes
# This becomes your target variable

# YOUR CODE HERE

### Exercise 6 (1 point) : Gene Essentiality Classification

Determine the Essential Protein using any of your favourite `sklearn`'s classifier models

- Split the data into training and testing datasets
- Build a model, fit and predict
- Print the classification report, Confusion Matrix and ROC curve

In [None]:
# Train-test split the features and target

# YOUR CODE HERE

In [None]:
# Instantiate a model (Classifier)
# YOUR CODE HERE

# Fit on Train data
# YOUR CODE HERE

# Predict on test data
# YOUR CODE HERE

In [None]:
# Print the confusion Matrix
# YOUR CODE HERE

# Print the classification Report
# YOUR CODE HERE

# Plot the ROC Curve
# YOUR CODE HERE

#### Discuss your findings and the learning that happened with this mini-project to your Mentor.

## Additional Ungraded Exercise for Practice:

- Try out for other smaller data sets from the [STRING DB LINK](https://string-db.org/cgi/download). Select a species name for example example dog, human, cat etc. It will display the corresponding latin name. Download all the relevant Datasets, explore and use them