![Russ College Logo](images/logo.png)

<b>
    <p style="text-align:center;color:#00694E;font-family:copperplate;font-size:40px">
        CS 4170 - Data Mining with Applications in the Life Sciences
    </p>
</b>

## **Course Description**
CS 4170 - Data Mining with Applications in the Life Sciences - is a computer science elective offered here at Ohio University that is often taken by undergraduate students in their third or fourth year. If you have ever had any interest in mining data or working with data to solve complex problems, CS 4170 would be the perfect course for you! Students in this course will design and develop their own custom software to solve real-world life science problems. Some of the topics students will cover in this course include: processing DNA sequences and protein sequences, restriction maps, data pipelines, and the Entrez programming utilities. In this class, you will learn about data classification (decision trees and random forests), association rule mining, and data cleaning. CS 4170 is an extremely interesting course where students will learn various data mining techniques to help solve modern problems in the life sciences.

## **Learning Outcomes**
- Students will gain the ability to develop programs that combine third party tools to form customized data analysis pipelines
- Students will gain the ability to use the programming language to architect and construct software packages that solve computational biology problems
- Students will gain the ability to develop programs that perform processing of biological sequence data
- Students will learn basic concepts of database management

## **What You'll Learn**

### **Data Classification**
Credit to https://www.geeksforgeeks.org/basic-concept-classification-data-mining/ for the following information:

"Data mining involves mining or "digging" deep into data that is in different forms to develop patterns, and to gain knowledge on those patterns. In the process of data mining, large data sets are first sorted, then patterns are identified and relationships are established to perform data analysis and to solve problems."
<br> <br>
"Data classification is a data analysis task, or a process of finding a model that describes and distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of categories (subpopulations), a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known. Two examples of data classification that you will learn more about in this class are decision trees and random forests."

#### **Decision Trees**

Credit to https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-mining/ for the following information:

![Russ College Logo](images/cs4170-decisiontree.png)

"A decision tree is a supervised learning algorithm that works for both discrete and continuous variables. It splits the dataset into subsets on the basis of the most significant attribute in the dataset. How the decision tree identifies this attribute and how this splitting is done is decided by the algorithms.

The most significant predictor is designated as the root node, splitting is done to form sub-nodes called decision nodes, and the nodes which do not split further are terminal or leaf nodes.

In the decision tree, the dataset is divided into homogeneous and non-overlapping regions. It follows a top-down approach as the top region presents all the observations at a single place which splits into two or more branches that further split. This approach is also called a greedy approach as it only considers the current node between the worked on without focusing on the future nodes.

The decision tree algorithms will continue running until a stop criteria such as the minimum number of observations etc. is reached.

Once a decision tree is built, many nodes may represent outliers or noisy data. Tree pruning method is applied to remove unwanted data. This, in turn, improves the accuracy of the classification model.

To find the accuracy of the model, a test set consisting of test tuples and class labels is used. The percentages of the test set tuples are correctly classified by the model to identify the accuracy of the model. If the model is found to be accurate then it is used to classify the data tuples for which the class labels are not known."

Watch the following video to learn more about decision trees! https://www.youtube.com/watch?v=7VeUPuFGJHk

#### **Random Forests**
Watch the following video to learn more about random forests! https://www.youtube.com/watch?v=J4Wdy0Wc_xQ <br>

A random forest is a data classification algorithm that is built from many decision trees. In the real world, a single decision tree can sometimes be inaccurate because they are NOT flexible when creating new samples. A random forest algorithm uses bagging (random samples with replacement) and feature randomness when building each individual tree. This process creates an uncorrelated "random forest" -  whose prediction is more accurate than that of any individual tree.


The following examples below use the third party library, "scikit-learn", which is a Python machine learning library that provides an implementation of Random Forest for machine learning. In the example below, the function make_classification() creates a synthetic binary classification problem with 1,000 examples and 20 input features. Using a random forest algorithm, the following code evaluates the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. The code outputs the mean and standard deviation of the accuracy of the model across all repeats and folds.

In [None]:
# Ensure kernel is set to Python3
# CREDIT to https://machinelearningmastery.com/random-forest-ensemble-in-python/ for example

# evaluate random forest algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

This Random Forest has an accuracy of 90.5%, which is pretty high!
<br>
<br>
Random forests can also be used for regression problems. Similarly to the last example, the following code evaluates the model using repeated k-fold cross-validation, with three repeats and 10 folds. The code outputs the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0. Run the following code to see the output.

In [None]:
# Ensure kernel is set to Python3
# CREDIT to https://machinelearningmastery.com/random-forest-ensemble-in-python/ for example

# evaluate random forest ensemble for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import RandomForestRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=2)
# define the model
model = RandomForestRegressor()
# evaluate the model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Since the Mean Absolute Error statistic is a large negative number, it means that the random forest classified the data well. Learn more the sklearn python library at https://scikit-learn.org/stable/, and feel free to experiment with it!

### **Data Cleaning**
Data cleaning is a very important step in the process of mining and analyzing data. As indicated in its name, data cleaning involves cleaning up the raw data that you mine in order to analyze it. This involves but is not limited to, fixing/removing incorrect data, incomplete data, irrelevant data, and duplicate data. The process of data cleaning can also involve removing "noise" from a dataset, which is data that is meaningless and does not contribute to the overall trend of the data.
Refer to the following article to learn more about the process of data cleaning! https://www.javatpoint.com/data-cleaning-in-data-mining

Credit to https://algodaily.com/lessons/introduction-to-data-cleaning-and-wrangling/why-do-we-clean-data for the following image.

![Data cleaning](images/cs4170-cleaning.png)

Consider the following example below of a dataset. The data has a positive correlation, but has a fairly large range and a lot of variation among each point. This data set contains a lot of "noise" or data that is meaningless information when looking at the entire trend as a whole. Run the code below to view the graph of a data set. 

In [None]:
# Ensure kernel is set to Python3
# Credit to https://stackoverflow.com/questions/37598986/reducing-noise-on-data for example
import numpy as np
import matplotlib.pyplot as plt

mu, sigma = 0, 500
x = np.arange(1, 100, 0.1)  # x axis
z = np.random.normal(mu, sigma, len(x))  # noise
y = x ** 2 + z  # signal + noise

plt.plot(x, y, linewidth = 2, linestyle = "-", c = "b")  # includes some noise
plt.show()

This dataset is in need of some data cleaning! If we clean some of the meaningless points away from the graph, it can help us to better analyze the data and to see any general trends. To do this, we are using scipy, a Python third party library, to clean this dataset by filtering out the noise. Run the code below to see the result after we clean the dataset.

In [None]:
# Ensure kernel is set to Python3
# Credit to https://stackoverflow.com/questions/37598986/reducing-noise-on-data for example
from scipy.signal import savgol_filter
w = savgol_filter(y, 101, 2)
plt.plot(x, w, 'b')  # high frequency noise removed

**Coding challenge!** - Change the '101' value in the savgol_filter function to 11 and then 501 to see how much noise is filtered out of the data.

## **Conclusion**
CS 4170 is one of the most interesting computer science electives offered here at Ohio University. If you have ever had any interest working with large data sets or solving problems in the life sciences, you should seriously consider taking this course. Topics in this class include data classification, association rule mining, data cleaning, and more! Also, one of the best parts about this course is that students get to learn how to contribute to the world of life sciences by using software to solve modern day problems.

<b>
    <p style="text-align:center;color:#00694E;font-family:copperplate;font-size:13px">
        © 2022 GAMA: Gavin Dassatti, Alex Heffner, Matthew Lang, and Aaron Begy. All rights reserved.
    </p>
</b>