# Introduction to Numpy and Matplotlib

First, lets check that everything is installed correctly. Try to import all these packages.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis

## 1. Creating arrays with NumPy

### Vectors

In [None]:
a = np.array([1,2,3,4])
print(a)
print(type(a))

### Matrices

In [None]:
# Integers
b = np.array([[1,2,3,4],[5,6,7,8]])
print(b)

In [None]:
# Floats
c = np.array([[1.2,2.1,3.7,4.1],[5.3,6.1,7.2,8.1]])
print(c)

We can check the dimensions of the array with `.shape`

In [None]:
print(b.shape)

We can make operations like transposing or reshaping

In [None]:
print(b.T)

In [None]:
print(b.reshape((1,8)))

We can initialize arrays in different ways with zeros, ones or a range.

In [None]:
print(np.ones((3,3)))
print(np.zeros((3,3)))
print(np.arange(12))

## 2. Matrices operations

In [None]:
# Vectors
x = np.array([1,2,3,4])
y = np.array([5,6,7,8])
# Matrices
X = np.array([[1,2,3,4],[5,6,7,8]])
Y = np.array([[4,3,1,2],[8,6,5,7]])

In [None]:
print(x)
print(y)
print(X)
print(Y)

### Sum

In [None]:
print(x + y)

In [None]:
print(X+Y)

### Scalar product

In [None]:
print(4*X)

In [None]:
print(X/3)

### Element-wise multiplication

In [None]:
print(X*Y)

### Matrix-matrix product
We need to always check the dimensions of our matrices before multiplying

In [None]:
print(X.shape)
print(Y.shape)

We cannot multiply them because the dimension do not match. `(2,4) x (2,4)` this gives an error.

In [None]:
print(np.dot(X,Y))

 Therefore, we need to transpose either X or Y. So if we transpose X, we will have `(4,2) x (2,4)`, and now the dimensiones match (the last dimension of the first matrix is equal to the first dimension of the last matrix). 

In [None]:
t_X = X.T
print(np.dot(t_X,Y))

To multiply a vector by a matrix, you need reshape the vector.

In [None]:
print(x.shape)
print(X.shape)

If we try to multiply them, the dimensions are `(4) x (2,4)`, which does not match. Then, we need to reshape `x` and transpose `X`.

In [None]:
x_res = x.reshape((1,4))
print(x_res.shape)

In [None]:
print(np.dot(x_res,t_X))

## 3. Slicing matrices
Slicing matrices is similar to slicing regular list in python.

We can access the first row of a matrix or the first column.

In [None]:
# Everything
print(X[:])
# Row
print(X[0])
# Column
print(X[:,0])

We can access certain elements or a range of them.

In [None]:
# 1 and 3 column
print(X[:,(0,2)])
# From 1 to 3
print(X[:,:3])

You can also slice based on a condition.

In [None]:
print(X[X > 3])
print(X[Y == 3])

## 4. Plotting with matplotlib
Lets get first some datasets that we can plot easily.

In [None]:
import sklearn.datasets as datasets

In [None]:
data = datasets.load_iris()
all_X = data.data
X = all_X[:,0]
Y = all_X[:,1]
t = data.target
t_n = data.target_names.tolist()
print(X.shape)
print(Y.shape)

### Scatter plot

In [None]:
plt.scatter(X,Y)
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(X,Y, c=t)
plt.title('Sepal Length vs Length Width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.grid()
plt.show()

### Histogram

In [None]:
plt.hist(X)
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.hist(x=X, color='blue', alpha=0.7, rwidth=0.85)
plt.title('Sepal Length distribution')
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.grid(alpha=0.5, linestyle='--')
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.hist(x=t, bins='auto', color='red', alpha=0.7, rwidth=0.85)
plt.title('Classes distribution')
plt.xlabel('Iris plant')
plt.ylabel('Frequency')
plt.xticks([0.1,1,1.9],t_n)
plt.grid(alpha=0.5, linestyle='--')
plt.show()

### Boxplot

In [None]:
plt.boxplot(all_X)
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(all_X)
plt.title('Features distribution')
plt.ylabel('Centimeters')
plt.xticks([1,2,3,4],data.feature_names)
plt.grid(alpha=0.5, linestyle='--')
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot([X[t==0],X[t==1],X[t==2]])
plt.title('Sepal Length distribution in the different classes')
plt.ylabel('Centimeters')
plt.grid(alpha=0.5, linestyle='--')
plt.xticks([1,2,3],data.target_names)
plt.show()

### Line plot

In [None]:
plt.plot(np.arange(150),all_X[:,0])
plt.plot(np.arange(150),all_X[:,1])
plt.plot(np.arange(150),all_X[:,2])
plt.plot(np.arange(150),all_X[:,3])
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.plot(np.arange(150),all_X[:,0], label=data.feature_names[0])
plt.plot(np.arange(150),all_X[:,1], label=data.feature_names[1])
plt.plot(np.arange(150),all_X[:,2], label=data.feature_names[2])
plt.plot(np.arange(150),all_X[:,3], label=data.feature_names[3])
plt.xlabel('Examples')
plt.ylabel('Centimeters')
plt.grid(alpha=0.5, linestyle='--')
plt.legend()
plt.show()

# Exercises

The aim of this exercise is for you to get familiar with how to work with vector and matrix operations and get you familiar with different python libraries commonly used

In this exercise you will learn:

- how to work with vector operations and matrix operations by hand

- How to work with vector operations and matrix operations using the python library NumPy

- How to plot using the comprehensive python library to do visualizations called Matplotlib 

- and be introduced to a widely used python machine learning library, namely Scikit-learn


#### First let's start with with vector operations and matrix operations by hand

**Q1**: Create a new vector using scalar-vector multiplication **(1 point)**

! Remember to use different numbers from the slides else you won't get a point !

**Q2**: Create a new matrix using matrix-matrix product **(1 point)**

! Remember to use different numbers from the slides and example else you won't get a point !

#### Now let's work with vector and matrix operations using the NumPy library

**Q3**: Create a matrix $X$ of shape `(4,2)` of integers (they should not be all zeros or ones). Do the transpose of $X$ and calculate the matrix-matrix product of $X^{T}X$ and $XX^{T}$ **(1 point)**

**Q4**: Create a vector $b$ of shape `(5,)` of integers (they should not be all zeros or ones). 
Try to do element-wise multiplication of $X^{T}*b$ and sum $X^{T} + b$.
What happened and why? **(1 point)**



**Q5**: Do element-wise multiplication of $X^{T}*x$ and sum $X^{T} + x$.
What happened and why? **(1 point)**

**Q6**. Calculate the mean of the vector $b$. Hint: `np.sum` **(1 point)**

**Q7**: Create a matrix `W` of shape `(2,5)` of floats (in between 0 and 1). Calcuate $y$; $y=XW +b$ **(1 point)**

**Q8**: Calculate the following function $Z = 1/(1 - e^{y})$. Hint: np.exp **(1 point)**

**Q9**: Extract the values of $Z$ that are positive. What do you see and why? **(1 point)**

#### Now we will work with the popular Scikit-learn and Matplotlib libraries

**Q10**: Inspect the diabetes dataset from Scikit-learn (sklearn). **(1 point)**
 - How many samples does the dataset consist of?
 - what does the target data represent?


In [None]:
data = datasets.load_diabetes()
all_X = data.data
target = data.target

In [None]:
data

**Q11**: Produce 3 different plots based on the features and target of that dataset and describe what you observe in the plots **(2 points)**

# Dataset for machine learning
The aim of this exercise is for you to learn how to cluster proteins by similarity, extract features from the protein sequences and perform homology partitioning. There is one package to install before you start the exercise. Do `pip install biopython`. For the exercise you need to show in this notebook the code for each exercise and the answer for each question.

### Exercise 1

We will be usinf CD-HIT on the dataset `multiloc_secretory.fasta`. You can read more about CD-HIT and how it works [here](https://github.com/weizhongli/cdhit/wiki/3.-User%27s-Guide#CDHIT).


Once that the job is finished, a `*.clstr` file and a file containg the fasta sequences will be generated. 

In [None]:
%cd
!rm -Rf ml_data
!mkdir ml_data
%cd ml_data
!ln -s /exercises/ml_intro/ml_data/multiloc_secretory.fasta ./multiloc_secretory.fasta # command to make symbolic link
!pwd
!ls

In [None]:
# cd-hit command
!cd-hit -i "multiloc_secretory.fasta" -o multiloc40 -c 0.40 -n 2 -G 0 -aS 0.80

**Q1**: Describe briefly the CD-HIT algorithm and what the different parameters used in the command means **(1 point)**

**Q2**: Inspect the `*.clstr` file and describe briefly the format **(1 point)**

**Q3**: How many clusters has CD-HIT produced? **(1 point)**

### Exercise 2
**Q4**: Read the `*.clstr` file and create a dictionary (`cluster_dict`) where the **key** is the **accession number** and the **value** is the **cluster number** **(2 points)**

Hint: You will need to code this 

### Exercise 3
**Q5**: Inspect the `multiloc_secretory.fasta` file and get familiar with it. Read the `multiloc_secretory.fasta` file into python and create two dictionaries.
- One dictionary (`frequencies_dict`) where the **key** is the **accession number** and the **value** is a **list of the amino acid frequencies of the N-terminal part of the protein**. We will define the N-terminal part as the **first 30** amino acids in the protein sequence.
- One dictionary (`class_dict`) where the **key** is the **accession number** and the **value** is the **class** (Secretory or Non-Secretory). **(3 points)**

To read the fasta file you can use Biopython or create your own function. Biopython example: https://biopython.org/wiki/SeqIO under `Sequence Input` section.

To calculate the list of the amino acid frequencies use the following code: 



Hint: You will need to code this 

In [None]:
# n_terminal_seq is a string with the protein sequence
analysed_seq = ProteinAnalysis(n_terminal_seq)
# aa_frquencies will be a list with 20 values, one for the frequency of each amino acid
aa_frequencies = list(analysed_seq.get_amino_acids_percent().values())

### Exercise 4
Run the `dataset_partition` function, which takes as input the dictionary with the **cluster number** (`cluster_dict`) as the first argument and the dictionary with the **class** (`class_dict`) as the second argument. This will create a new dictionary `partition_dict`, which assigns each protein to either **Train** or **Validation** set based on the cluster number.

In [None]:
def dataset_partition(cluster_dict, class_dict):
    ''' Function to separate proteins into N partitions with balanced classes'''
    
    
    n_partitions = 5
    n_class = 2
    sec_dict = {'Non-secretory':0,'Secretory':1}
    label_list = []
    for prot_id in cluster_dict:
        label_list.append(sec_dict[class_dict[prot_id]])
        
    cluster_vector = np.array(list(cluster_dict.values()))  
    label_vector = np.array(label_list)            
    
    
    # Unique cluster number                    
    u_cluster = np.unique(cluster_vector)
    
    # Initialize matrices
    loc_number = np.ones((n_partitions,n_class))
    cl_number = np.zeros(cluster_vector.shape[0])
    
    for i in u_cluster:
        # Extract the labels for the proteins in that cluster
        positions = np.where(cluster_vector == i)
        cl_labels = label_vector[positions]
        
        # Count number of each class
        u, count = np.unique(cl_labels, return_counts=True)
        
        u = u.astype(np.int32)
        temp_loc_number = np.copy(loc_number)
        temp_loc_number[:,u] += count
        loc_per = loc_number/temp_loc_number
        best_group = np.argmin(np.sum(loc_per,axis=1))
        loc_number[best_group,u] += count
        
        # Store the selected partition
        cl_number[positions] = best_group
    
    part_numbers = loc_number.astype(np.int32)-np.ones(loc_number.shape)
    
    tr_numbers = part_numbers[:4]
    val_numbers = part_numbers[4]
    print('Training set. Secretory: %i; Non-secretory: %i\n' % (np.sum(tr_numbers[:,1]),np.sum(tr_numbers[:,0])))
    print('Validation set. Secretory: %i; Non-secretory: %i\n' % (val_numbers[1],val_numbers[0]))
    index = 0
    output_dict = {}
    for prot_id in cluster_dict:
        if cl_number[index] == 4:
            output_dict[prot_id] = 'Valid'
        else:
            output_dict[prot_id] = 'Train'
        index += 1
    
    return output_dict

**Q6**: Why do we need to take into consideration the cluster number when dividing the dataset into train and validation set? **(2 points)**

**Q7**: How many secretory and non-secretory samples do we see in the training and test set? What is the total sample size of the training set? What is the total sample size of the test set? **(1 point)**

### Exercise 5
**Q8**: Create 4 output files based on the created dictionaries: 
- **freq_train.txt**. Amino acid frequencies for the proteins belonging to the train set (Tabular file with 20 columns).
- **freq_valid.txt**. Amino acid frequencies for the proteins belonging to the validation set (Tabular file with 20 columns). 
- **class_train.txt**. Label for the proteins belonging to the train set. Write `0` for `Non-secretory` and `1` for `Secretory`. (Tabular file with 1 column).
- **class_valid.txt**. Label for the proteins belonging to the validation set. Write `0` for `Non-secretory` and `1` for `Secretory`. (Tabular file with 1 column). **(2 points)**

Note. You can create these 4 files using a single `for loop` through `partition_dict`, to make sure that order of `freq_*.txt` and `class_*.txt` is the same.

Hint: To convert a list into text separated by tabs use: `'\t'.join(map(str,your_list))`.


Hint: You will need to code this 