# Part 1: Introduction and brief overview

This workshop is conducted by ***Milena Vujović$^{1}$***,  ***Frederikke Isa Marin$^{1}$*** and ***Anna-lisa Schaap-Johansen$^{1}$***. 

Data wrangling, visualisation and basic statistics have become one of the staples in every day researchers life. It doesn't matter whether your field of interest is chemistry, biology, immunology or bioinformatics, it is of utmost importance that you are able to present the results of your analysis in a clear and concise way. Furthermore, it is paramount that you are able to choose the best analysis to answer your scientific question. 

With this in mind, today we will go over
1. manipulating data with pandas
2. visualisation techniques available in python 

At the end of this day you should be able to:
1. import and manipulate your data effortlesly within the pandas software library 
2. confidently choose the best way to visualise your data 
3. make your desired graph with little effort 


We will go over two datasets in order to fullfil our goals. 

The first dataset stored in "aa_frequency_location.tsv" has information on the N-terminus of proteins. The dataset cosists of two classes Secretory and Non-secretory proteins. The input consists of 20 features, which are the amino acid frequencies of the first 30 amino acids of a protein (N-terminal part). Our main question is whether or not we can see any differnces in amino acid usage between secretory and non secretory proteins and how can we use this to classify the proteins. 

The second dataset is stored in "tissue_expression.tsv". It contains gene expression levels for 189 samples and 7 tissues. 

(source: http://genomicsclass.github.io/book/pages/pca_svd.html)


Suggested reading, useful links and inspiration: 
- Python graph gallery: https://python-graph-gallery.com/
- Pandas cheat sheet: http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3
- Pandas cheat sheet for data science in Python: https://www.datacamp.com/community/blog/python-pandas-cheat-sheet?utm_source=adwords_ppc&utm_campaignid=1655852085&utm_adgroupid=77088685371&utm_device=c&utm_keyword=%2Bpandas%20%2Bcheat%20%2Bsheet&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=353755544529&utm_targetid=kwd-589281899014&utm_loc_interest_ms=&utm_loc_physical_ms=1005023&gclid=Cj0KCQjwnv71BRCOARIsAIkxW9EbjTYewJGXuT-YGeJu1TCijpeHLcYetiFa73kju8JJbRh9IVYhk7gaAvNjEALw_wcB

- Seaborn cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf

***
You can contact us at<br>
Milena Vujovic: milvu@dtu.dk (twitter: *@sciencisto* ) <br>
Frederikke Isa Marin: frisa@dtu.dk (twitter: *@fimarin42) <br> 
Anna-Lisa Schaap-Johansen: alsj@dtu.dk (twitter: *@SchaapJohansen) <br>
and also for the duration of this course on wechat :)

$^{1}$ Bioinformatics section, DTU Health Technology, Technical University of Denmark, Greater Copenhagen area, Denmark<br>


In [None]:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
from pandas.plotting import parallel_coordinates

import umap
from sklearn.manifold import TSNE


import matplotlib.pyplot as plt

from sklearn import decomposition
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

In [None]:
plt.rcParams['figure.figsize'] = [10, 10]

In [None]:
## Symbolic link to the data: 
%cd
%cd ml_data
!ln -s /exercises/ml_intro/ml_data/aa_frequency_location.tsv ./aa_frequency_location.tsv # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/aa_frequency_location_incomplete.tsv ./aa_frequency_location_incomplete.tsv # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/tissue_expression.tsv ./tissue_expression.tsv # command to make symbolic link


!pwd
!ls

# Load the data 

For loading data into a pandas dataframe, we are using the **pandas.read_csv()** function. Good practice when using a function in python (or any other programming language) is to look at documentation for that function. You can find it online at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Alternatively, in jupyter, you can type the function name with a question mark after it. Try it in the cell bellow. 

In [None]:
pd.read_csv?

You will now have all the documentation for this function as output  with the function arguments listed and their descriptions bellow. 
Let's try to load in our dataframe. 

**Note**
If you one day want to run this in a jupyter notebook on your local computer and not on the server like we are doing now, a separate window will open. Once you have seen what options you have available are now free to close the pop-up window by clicking x in the top-right corner of the pop-up. 




In [None]:
aa_freq_loc_df = pd.read_csv("aa_frequency_location.tsv")
aa_freq_loc_df 

This doesn't look so good. Does it? 

Our first dataset is stored in a .tsv file, meaning tab separated file. Because the functions default is .csv meaning comma-separated file, we need to specify that our separtor is a tab. Let's try again bellow. 

In [None]:
aa_freq_loc_df = pd.read_csv("aa_frequency_location.tsv", sep = "\t")
aa_freq_loc_df

# Data manipulation
A pandas dataframe is a two-dimensional data structure used to store data in rows and columns. It has the following structure: 

<img src = "Figures_2020_05_19/Data_Frame.png">



## Q1 What are the number of rows and columns in the dataframe? 

Now we have our data stored in a pandas dataframe. The rows correspond to individual proteins. Using the **shape** function within pandas determine how many rows and columns we have. This is a function that is performed on a dataframe so the syntax used is dataframe.function? (insert function name insted \*function*)

*Hint: you can look at the functions documentation using ? in a new jupyter cell*

*Hint: you have used the same function in yesterdays exercises*

In [None]:
pd.DataFrame.shape?

**Answer:** 

What are the column names in the dataframe? 

Use the columns function within pandas to list all the column names: 


In [None]:
aa_freq_loc_df.columns

Another popular option within pandas is to list() the dataframe. Because the dataframe information is organised in columns the list function lists all the columns available in the dataframe. 

In [None]:
list(aa_freq_loc_df)

The output is a list of column names. Notice that the type of object differs in the output of the columns function. In the first instance we receive a pandas index object because it lists the index along the column axis. 

You can check this by using the **type()** function around the object you want to inspect. Try it bellow on the aa_freq_loc_df.columns 


In [None]:
type(aa_freq_loc_df.columns)

## Q2 What do the rows and columns correspond to? 

**Answer**: 

In a pandas dataframe we can choose individual columns to look at by specifying the column name. For the location column we do this in the following way: 

In [None]:
aa_freq_loc_df.location

Another way to get all the values in a column is: 

In [None]:
aa_freq_loc_df["location"]

Now we have the one column with it's corresponding index. If we want to extract a list we do convert our column to a list: 

In [None]:
list(aa_freq_loc_df["location"])

If we want to see the length of the list, in this case this corresponds to the number of rows, we use the len function on our list. Try it out: 

In [None]:
len(list(aa_freq_loc_df["location"]))

Indeed it corresponds to the number of rows. This way we have performed a sanity check that our list contains all the information for all the rows in the dataframe. 

## Q3 how many different categories of proteins do we have based on the location? 

In order to check this, we can apply the pandas **unique()** function on our extracted column and will output a list of the unique values. 

**Hint** if you want to see information on a function perfromed on a pandas column the syntax is pd.Series.function? because a pandas dataframe column is a series of values with coresponding indices. 
Alternatively because this is a pandas function and not a dataframe function you can also use pd.function? 

There are therofore 2 equally valid ways to get to the answer: 
1. apply the unique on the series df[["column name"]].unique()
2. apply the pd function on a column: pd.unique(df[["column name"]])


In [None]:
pd.Series.unique?

**Answer**: 



## Q4 extract a single row from the dataframe (1 point)

In pandas we can also select rows that we are interested in by it's index. 

Select the first row of the dataframe. This is achieved by using the **loc** function in pandas. Check what is the index of the first row. 

*Hint Use the ? option with pd.dataframe.loc and check which arguments it uses. 

When you have done that extract all the values in the first row by substituting the correct index in the following code: 
`aa_freq_loc_df.loc[row_index,]`
*

In [None]:
pd.DataFrame.loc?

## Q5 what is the location of the first protein in the dataframe and what is the frequency of Serine? 

*Hint: look at the row we have just extracted from the dataframe*

**Answer**: 

## Q6 extract a single value from the dataframe 

You might have noticed that the function takes in two arguments, the row and the column index. 

`df[row_index, column_index]`

Try to extract the row with the index 23 and column for valine. What is the value that you get? 

**Answer**: 

It is also important that we can filter values in our dataframe by some condition. For example let's extract all the rows that correspond to Non-secretory proteins. To do this we have to look within the dataframe and find within the column location all rows that have the value "Secretory". To make a condition we use a boolean index which is == meaning equal to.

In [None]:
aa_freq_loc_df[aa_freq_loc_df['location'] == "Secretory"]

## Q7 What is the shape of the new dataframe? 

To be able to perform new functions on our dataframe it's best to save it as a new variable: aa_freq_loc_df_sec 

Check what is the shape of the new dataframe. What is the starting index and the last index of the dataframe? 
How many secretory proteins are there? 

In [None]:
aa_freq_loc_df_sec = aa_freq_loc_df[aa_freq_loc_df["location"] == "Secretory"]


**Answer**: 

## Q8 Filter all the rows in which Histidine has a frequency above 0.066 

Using the above example as inspiration create a dataframe histidine_above_0066, where the value of the histidine frequency is filtered to be above 0.066. 

How many rows does this dataframe contain and how many columns? 

*Hint: Change the column you are filtering on (no longer location) and Change the conditon from equal to "value" into more than value*



In [None]:
histidine_above_0066 = 

**Answer**: 

#Reset the index of the histidine_above_0066 dataframe

You might notice that you still have the original index from the new dataframe in your new histidine_above_0066 dataframe. Use the **reset_index** function to set the index from 0. Use the drop = True argument to avoid saving the old index in a new column. Remember to save the dataframe in the histidine_above_0066 dataframe. 

In [None]:
pd.DataFrame.reset_index?

In [None]:
histidine_above_0066 = histidine_above_0066.reset_index(drop = True)
histidine_above_0066

## Q9 Type of data

It is always important to know how your data is stored in the dataframe, that is which type it has. You can achieve this with the **dtypes** function in pandas.

Pandas supports most data types, and it's important to note that columns with mixed and string types are stored with the object type.

What types of data do you have in your dataframe? 


In [None]:
aa_freq_loc_df.dtypes

**Answer**: 

# Data visualisation

Now that we have familiarised ourselves with the dataset we can proceed to visualise the data. We will go through all the chart types mentioned in the lecture, analyse and explain why each of these might be a good choice for visualisation. 

## Scatter plot. 

The figures in today's lecture have been generated with the following code: 

In [None]:
md = {'Variable1': [1.3, 3.4, 2.3, 3.5, 2.2], 
      'Variable2': [1.1, 4.3, 2.1, 9.4, 7.8], 
      'Variable3': [3.5, 2.5, 7.8, 1.2, 3.4],
      "Group" : ["A", "A", "A", "B", "B"]}

mv_df = pd.DataFrame(data = md)
mv_df

In [None]:
sns.set_style("ticks")
scPlt = sns.lmplot(x="Variable1", y="Variable2", data=mv_df, fit_reg=False, hue = "Group", legend = True)
scPlt.set(xlabel='Variable 1', ylabel='Variable 2')
scPlt.set(xlim = (0,5), ylim = (0,10))
plt.title("Scatter Plot with legend")


### Q10 Using the code above as inspiration create a scatter plot 

Let's compare the frequency of alanine vs Tryptophan. Plot Alanine on the x axis, Tryptophan on the y axis and colour by location. Set the limits on both axis to 0.5. Use our complete dataframe aa_freq_loc_df as the data. Set apropriate title for axes and the plot.

 ### Q11 Move the legend inside the plot 
 
 Instead of using seaborns default legend positon within the **sns.lmplot** function, put the legend by adding the **plt.legend()** function. at the end of your plotting code. 
 
 *Hint* Use loc argument to specify the position. Use the plt.legend? in jupyter to learn which values does the loc argument take

You might notice that even though we have 3058 samples we don't see a lot of points. 

### Q12 Can you think why this is? 

**Answer**: 

### Q13 Reduce the opacity of points in the plot 

In order to visualise the number of overlapping points we can reduce the transparency of the dots in the scatter plot.

Try adding the **scatter_kws={'alpha':0.1}** argument to your lmplot function. How does this change the plot? 

**Answer**: 

### Q14 Try substituting the value of 0.1 with 1 and 0.01. How does this change the plot? Which one of the three alpha values would you choose for visualisation and why?

In [None]:
# Plot with opacity 1

In [None]:
# Plot with opacity 0.01

**Answer**: 

### Q15 Visualise Alanine and Cystein frequency relationship with transparency set to 0.1

Instead of Alanine vs Tryptophan now visualise the Cystein frequnency. 


As you can see the plots don't differ a lot and visualing them one by one would be very tedious since we have 20 amino acids which gives us 190 combinations! We need to find a better way to visualise our data. 


## Correlogram

In order to visulise all the relationships of frequencies use a correlogram like it was shown in the lectures. The code for producing the plot in the lectures is: 
`
sns.pairplot(mv_df, kind="scatter", hue = "Group")
plt.show()
`

### Q16 Create a correlogram of amino acid frequencies 

Using the code above as inspiration create a correlogram of aminoacid frequencies in the aa_freq_loc_df dataset. 




### Q17 How many plots have you visualised? 

**Answer**: 

As you can see it might be diffucult to draw any conclusions from this high number of scatter plots. For instance som amino acid frequencies seem to differ between Non-secretory and Secretory proteins. However, for visualising relationships between amino acid frequencies, this manner of representation is overwhelming. Since the clarity of information is not good and we cannot visualise any relationships this manner in a clear an coherent way we would say for this type of data, scatter plot visualisations are not optimal.


## Histogram

The code used to produce histograms in the lecture slides is: 

`
sns.set_style("ticks")
hist = sns.distplot(mv_df['Variable1'], bins = 10, hist=True, kde = False)
hist.set(xlabel='Variable 1', ylabel='Count')
plt.title("Histogram variable 1")`
### Q18 Create a histogram of the data for Alanine 

Using as inspitation the code from above create a historgram of the Alanine frequency in the protein data: aa_freq_loc_df. 



### Q19 increase the bin number to 100 and visualise the plot again. Why has the y axis changed?  

**Answer**: 

## Q20 Create a density plot with Alanine frequency. 

*Hint* code used to generate the density plot from the lectures: 

`sns.set_style("ticks")
dens_plt = sns.kdeplot(mv_df['Variable1'], shade=True)
dens_plt.set(xlabel = "Value", ylabel = "Probability Density")
plt.title("Density plot Var 1 ")`





### Q21 What are the diffences of the density plot from the histogram? 

**Answer**  

You can also create density plots (and histograms) with multiple variables on the same axis. 

`sns.set_style("ticks")
dens_plt = sns.kdeplot(mv_df['Variable1'], shade=True)
dens_plt = sns.kdeplot(mv_df['Variable2'], shade=True)
dens_plt.set(xlabel = "Variable value", ylabel = "Probability Density")
plt.title("Density plot Variable 1 and 2")`

### Q22 create a density plot with Alanine and Tryptophan shown on the same axis. Change the area colours to red and blue respectively by adding the arguments color = "r" and color = "b" to the kdeplot function, respectively

*Hint* use the color argument when creating the plot. 


Since we have a lot of variables (20 amino acids) it would be better if we don't need to specify each density plot by hand. You can use a **for** loop to iterate through the column names. An example from todays teaching material is created by: 
`
sns.set_style("ticks")
for col in list(mv_df)[:-1]:
    dens_plt = sns.kdeplot(mv_df[col], shade=True)
    
dens_plt.set(xlabel = "AA Frequency", ylabel = "Probability Density")
`

The list of column names is subseted by the [:-1] because our last column name was the "Group" which is a categorical variable we need to exclude from the density plot which takes in numerical variables. 

### Q23 Create a density plot of all amino acid frequnencies 

*Hint* Using the inspiration above iterate trhough the column names removing the column which has categorical data. 

*Hint2* For moving the legend outside the plot paste this after the for loop: 

`plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left, borderaxespad=0.)`

If interested how this works, again you can use the ? in jupyter to get information on the plt.legend function

We have now created a plot which contains all the density distributiions of amino acids in the dataframe
### Q24 Can we use this plot to say anything about the amino acid usage in secretory vs non-secretory proteins? 

**Answer** 

## Box plot

In todays lecture we have also mentioned boxplots. Specifically three types: 
1) Group on the x axis
2) Variable on the x axis 
3) all variables on the X axis grouped by categorical variable. 

### Q25 Considering the data frame that we have are analysing, which two types of boxplot are suitable for plotting our data? And which one of those two do you think will reveal most information about our dataset and why? 

**Answer**: 

### Q26 Genereate a boxplot with variables on the X axis grouped by location 

To produce the boxplot from today's lecture we had to transform the data frame from what is known as the "tidy" to "long" format. This is achieved using the **melt** function. (Remember ?) When data is transformed from the tidy to long format, each row is and obervation and each column is a variable. 

Tidy format             |  Long Format
:-------------------------:|:-------------------------:
<img src = "Figures_2020_05_19/DF_tidy.png">  |  <img src = "Figures_2020_05_19/DF_long.png"> 

`
mv_df_long = pd.melt(mv_df, "Group", var_name="Variables", value_name="Value")
`

This new dataframe can now be plotted with the **boxplot** function: 

`
sns.set_style("ticks")
bx_plt = sns.boxplot(x="Variables", hue="Group", y="Value", data=df_long, palette= "Set2")
bx_plt.set( ylabel = "Value")
plt.title("Variables across groups")
`

### Q27 Q Use the code above as inspiration, convert the aa_freq_loc_df to the long format, store it in aa_freq_loc_df_long and plot grouped by location. Name the variable AminoAcids and values Frequency. 

*Hint* using the example from the density plot, move the legend outside of plot area. 

In [None]:
# convert the data to long format: 


In [None]:
# Plot the box plot


### Q28 Which amino acid frequency is mostly different between Non-secretory and Secretory proteins? 


**Answer** 


## Violin plot 
In order to also understand group size distribution, we might resort to the violin plot instead of the boxplot. 

### Q29 Using the **sns.violinplot** function, create a grouped violin plot for all amino acids 

*Hint* It is almost the same as creating a boxplot of the same type, you just need to substitute the box plot function with the violin plot


## Parallel plot 

The plot in today's lecture is generated by the following code: 

`
parallel_coordinates(mv_df, "Group", colormap= plt.get_cmap("Set2"))`


### Q30 Create a parallel plot of amino acid frequencies and colour based on the protein group 

*Hint use the tidy data instead of the long format*
*Hint* Use the code above as inspiration


### Q31 Can you see the same differences in the Leucine Frequencies as you can see in the boxplot? Why do you think that is? 


**Answer** 


Let's try and calculate the mean expression per group and plot the trends on the parallel plot. This way we might get better visualisation but we might lose information on individual points. 

We need to group values in the dataframe by the categorical variables first, and then apply the mean function on the grouped data. 

`
mv_df_mean =  mv_df.groupby('Group', as_index=False).mean()
sns.set_style("ticks")
parallel_coordinates(mv_df_mean, "Group", colormap= plt.get_cmap("Set2"), axvlines=False)
`


### Q32 create a parallel plot of mean amino acid frequencies by group 

*Hint* use the ? option in jupyter to get more information of the pd.DataFrame.groupby function if you would like to know more about the options. 


### Q33 What can you observe about the Leucine frequencies now? Is it more limilar to the boxplot or the parallel plot with all the points?

**Answer** 

## PCA 

Another popular visualisation technique is the dimensionality reduction, Principal components analysis. For datasets with large number of observables it shifts the coordinate system along the so that the first dimension aligns with the direction of maximal variance in the data - known as Principal Component 1 (PC1). The next coordinate is othogonal to the first Principal component and explains less variance in the data (PC2). Same follows for PC3 and so on until all the dimensions of the data are explained. In our case we have 20 observabels corresponding to 20 Amino Acid frequencies. 

Because we are visualising in 2D e only need to choose 2 principal components along which we can visulise the points (the proteins). 

We will follow these steps: 

1. Separate data into numerical and categorical variables
2. Normalize the data by the mean and scale by the standard deviation
3. Fit the PCA and visualize the Principal components. 

First we need to separate the categorical variable from the observables. We will store observables in x, and categorical variable in y. 


`observables = list(mv_df)[:-1]
#Separating out the features
x = mv_df.loc[:, observables].values
#Separating out the target
y = mv_df.loc[:,['Group']].values`


### Q34 Separate the data in x (observables) and y (categorical data). What is the shape of x? Does x have column names? 


**Answer**: 

You can check the type of variable x by using the function type(). You can see that x is no longer a pandas dataframe but a numpy array:

In [None]:
type(x)

Now that the numerical data (amino acid frequencies) are stored in x, we need to normalize and sacle the data. 
We need to normalize by the mean and scale by the standard deviation. 


In [None]:
x_scaled = StandardScaler().fit_transform(x)

### Q35 Use information on the StandardScaler? function to find out which argurments it uses by default? Have be used both the mean and standard deviation? 

In [None]:
StandardScaler?

**Answer**: 


### Q36 What is the shape of the scaled observables? Is it different from the unscaled data? 

In [None]:
x_scaled.shape

**Answer**: 

### Q37 Inspect the mean and standard deviation of the original numerical data x and the normalized and scaled data x_scaled. What is the mean and standard deviation in both cases? 

*Hint* Use np.mean and np.std on the data to calculate the mean and standard deviation
*Hint*: Observe the standard deviation in the scaled data and comment on the value. 

**Answer**: 

We now need to perform principal compoents analysis on the scaled data.

`#define pca number of components:
pca_scaled = decomposition.PCA(n_components=3)

#fit the PCA on the scaled data:
principalComponents_scaled = pca_scaled.fit_transform(x_scaled)

#convert the PCA result into a dataframe for easier manipulation
principalDf_scaled = pd.DataFrame(data = principalComponents_scaled)

#add the categorical variables that we can use to label the data.
finalDf_scaled = pd.concat([principalDf_scaled, mv_df[["Group"]]], axis = 1)`


### Q38 use the code from the lectures as inspiration to perform PCA on x_scaled. Set the number of components to 12. 




### Q39 What is the maximum number of components we can use in the PCA? Why? 

**Answer** 

### Q40 What is the shape of the finalDf_scaled and what do the column names correspond to? 

**Answer**: 

### Q41 Visualize PCA with PC1 and PC2. How much of the variance is explained by PC1 and PC2? Can we see a clear separation between secretory and non-secretory proteins? 

*Hint* Use the code from the lectures as inspiration:

`#Percent of variance explained by PC1 and PC2 with indices 0 and 1, respectively
per_PC1 = str(round(pca_scaled.explained_variance_ratio_[0]*100, 2)) 
per_PC2 = str(round(pca_scaled.explained_variance_ratio_[1]*100, 2))

fig, ax = plt.subplots(figsize=(10,10))
#define number of colours in the colour pallete:
cmap = sns.color_palette("hls", 2) 
p = sns.scatterplot(x=0, y=1, # Visualise PC1: index 0 and PC2: index 1
                     hue="Group",
                     palette=cmap, s=100,
                     data=finalDf_scaled, 
                   alpha = 0.5)

#set axis outside the plot
p.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) 
#make axis lables using per_PC1 and per_PC2 variables
p.set(xlabel="".join(["PC1 (",per_PC1,"%)"]), ylabel="".join(["PC2 (",per_PC2,"%)"])) 
plt.tight_layout()
p.get_figure().savefig("Teaching_figures/PCA_on_scaled_data.png")`

### Q42 using the code above as inspiration plot PC1 vs PC3 on the PCA plot. How much variance is explained by PC3? 

**Answer**: 

### Q43 Plot the cumulative variance using the code bellow. How many Principal Components are needed to explain at least 60% of the variance in the data? 

In [None]:
var = np.cumsum(pca_scaled.explained_variance_ratio_*100)
plt.figure(figsize=(20,10))
plt.ylabel('Variance Explained')
plt.xlabel('Number of Principal Components')
plt.xlim(0,12)
plt.xticks(np.arange(0, 12, step=1), np.arange(1,13))
plt.yticks(np.arange(0, 100, step=10), np.arange(0,110, step = 10))

plt.title('PCA Analysis')
plt.plot(var, marker='o', linestyle='--')
plt.show()

**Answer**: 

## t-SNE

We will now proceed to do tSNE as part of the scikit-learn library. In order to familiarise our selves with the options of the function run TSNE? bellow. The parameters we're mostly interested in are the perplexity, number of iterations and number of components. In the end of the documentation shows how to use TSNE to embed x (our data) into x_embeded. 

### Q44 Use the code in the example in documentation to embed the data x into 2 dimensional space using tSNE. 

Set the number of iterations to 2000 and perplexity to 50. Store the tSNE embedding into x_embedded. 

*Hint* x_embedded = TSNE(arguments you need to use).fit_transform(x)
*Hint* You only need the 4$ˆ{th}$ line of the example. 

In [None]:
TSNE?

### Q45 what is the shape of the embedded data? Why do we have that number of columns? 


**Answer**: 

You might notice that this takes some time to run. If you rememeber from the lectures we have mentioned that when dealing with data with a large number of points a good idea is to do PCA dimensionality reduction before applying the t-SNE algorithm. This speedup will become more evident when dealing with even larger datasets. However, keep in mind that these are two different dimensionality reduction techniques therefore it's important to understand your data and when this speed-up is justified. You should be careful of the number of dimensions you are reducing the data with the PCA prior to tSNE, as you do not want to reduce the number of dimensions already to 2 or 3 which would make the tSNE redundant. 


Now, we have already done PCA on the frequencies data before by running the following code: 

`
pca_scaled = decomposition.PCA(n_components=12)
principalComponents_scaled = pca_scaled.fit_transform(x_scaled)
`



### Q46 Use the principalComponents_scaled data instead of x in your TSNE embedding. What is the shape of x_embedded now? 

**Answer**: 

Visualise the results: 

In [None]:
#convert tSNE result into pandas dataframe
tsne_df = pd.DataFrame(data = x_embedded)
#append the categorical variables to the dataframe
tSNE_finalDf = pd.concat([tsne_df, aa_freq_loc_df[["location"]]], axis = 1)

#visualize the tSNE plot
fig, ax = plt.subplots(figsize=(10,10))
cmap = sns.color_palette("hls", 2) # notice that there are only 2 colours because we have 2 protein locations
p = sns.scatterplot(x=0, y=1,
                     hue="location",
                     palette=cmap, s=100,
                     data=tSNE_finalDf, 
                   alpha = 0.5)

p.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
p.set(xlabel="t-SNE 1", ylabel="t-SNE 2")
plt.tight_layout()



### Q47 Run the tsne embedding again in the same way in the cells bellow and visualise the tSNE plot. Is the plot exactly the same as before? 

In [None]:
#convert tSNE result into pandas dataframe
tsne_df = pd.DataFrame(data = x_embedded)
#append the categorical variables to the dataframe
tSNE_finalDf = pd.concat([tsne_df, aa_freq_loc_df[["location"]]], axis = 1)

#visualize the tSNE plot
fig, ax = plt.subplots(figsize=(10,10))
cmap = sns.color_palette("hls", 2) # notice that there are only 2 colours because we have 2 protein locations
p = sns.scatterplot(x=0, y=1,
                     hue="location",
                     palette=cmap, s=100,
                     data=tSNE_finalDf, 
                   alpha = 0.5)

p.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
p.set(xlabel="t-SNE 1", ylabel="t-SNE 2")
plt.tight_layout()



**Answer** 

Because the tSNE alorithm starts by choosing a random point in the dataset, each time we start the algorithm it will choose based on the random seed of the computer. Therfore we need to set the value of the random seed to a fixed value within the TSNE function, so that our results are reproducible. 

### Q48 Using the random_state argument in the TSNE function set the random seed to 42. Rerun the tSNE and generate the plot as before. Does the tSNE plot change if you rerun the cells again? 


**Answer**: 

## UMAP

Even though we can see more separation between groups with the tSNE analysis compared to PCA, there is still overlap. To circumvent this probably we would have to keep the original data PCA unscaled, increase the perplexity and number of iterations which is are all computationally expensive techniques. 

UMAP seeks to improve on the limitations of tSNE, by implementing a slighly different learning algorithm. It also doesn't necesarilly start with random initialisation, so there is in theory no need to set the random seed as with tSNE. This being said, some of the more specialised arguments of the UMAP functon might require random initialisation so it is good practice to regardless set the random_state to a fixed value, so that you are certain you are always able to reproduce the same UMAP low-dimensional embedding. 

### Q49 Create a UMAP embedding of the x data using the code bellow as inspiration: 

`
x_embedded_UMAP = umap.UMAP(arguments).fit_transform(x)
`
Set the following parameters: 
* number of neighbours: 20
* number of final dimensions: 2 
* random state: 42
* metric: manhattan
* number of epocs: 700
* minimal distance: 0.01 



*Hint* look at the documentatiton of UMAP using umap.UMAP? in the cell below and find the arguments you need to put into the UMAP function. 






In [None]:
umap.UMAP?

### Q49 Now plot the UMAP result using the code bellow. Do you see any clear separation of clusters with no overlap? Having this in mind you think that the separation of proteins has improved compared to the use of PCA and tSNE? 

In [None]:
#convert UMAP result into pandas dataframe
umap_df = pd.DataFrame(data = x_UMAP_embedded)
#append the categorical variables to the dataframe
umap_finalDf = pd.concat([umap_df, aa_freq_loc_df[["location"]]], axis = 1)

#visualize the tSNE plot
fig, ax = plt.subplots(figsize=(10,10))
cmap = sns.color_palette("hls", 2)
p = sns.scatterplot(x=0, y=1,
                     hue="location",
                     palette=cmap, s=100,
                     data=umap_finalDf, 
                   alpha = 0.5)

p.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
p.set(xlabel="UMAP 1", ylabel="UMAP 2")
plt.tight_layout()
p.get_figure().savefig("UMAP.png")


**Answer** 

# What to do when you have missing data - Data imputing 

Sometimes you will have data with missing values, usually denoted as NA or NaN values in your dataframe. This is fairly common in biological data and can arrise from many a number of reasons, such as experimental error, mislabeling etc. We will now work through a dataframe which has such isssues. 

Scenario: 

The researcher wants to have access to the same dataset as we were analysing before: aa_frequency_location.tsv. However, he lives in a very windy country and the internet cable to his building is very poorly attached. Every time the wind blows his internet runs out for a split second. He tried to download our dataset, but it had some missing values because the internet keeps blinking out. By the time he downloaded the file the wind blew so hard and took the cable completely off the building! He couldn't access the original data anymore and because he has a tight deadline to perform the analysis he had to work on the dataset as it is. This dataset is therefore named: aa_frequency_location_incomplete.tsv

Here we will go through some of the techniques and best practices when working with missing data. 


#### Dealing with empty values in the dataset


Missing values are problematic making it difficult to do analysis on the data.
    
Here are some common alternatives to deal with missing values:

- Removing the row or column with the missing data
- Setting the value to a dummy value like 0
- using an imputing method such as setting the value to for example the feauture or sample mean, or using the most frequent value from one column



In [None]:
aa_freq_loc_df_incomplete = pd.read_csv("aa_frequency_location_incomplete.tsv", sep = "\t")

aa_freq_loc_df_incomplete

### Q Our dataset consists of 3058 samples, we want to figure out how many of these samples have missing values, and where we see the highest amount of missing values

using the following commands:

aa_freq_loc_df_incomplete.isna().sum()

and

aa_freq_loc_df_incomplete[aa_freq_loc_df_incomplete.isnull().any(axis=1)]


### Removing missing values

One way to handle missing values, like NaN in our case, is to remove them. Let's to try to filter samples with at least 1 missing value and see what happens:

In [None]:
aa_freq_loc_df_incomplete.dropna()

# Q has the data changed after using the df.dropna() funtion? If you see any changes, which kind of changes have occurred? 


**Answer**: 

# Q use the following calculation to figure out, how much the data has changed, and write whether it is a lot and whether it affect us modelling down stream too much

result = (x_original - x_new)/ x_original

final_result = result*100  #to get it in percentage

### Setting the value to a dummy value

We can for example also set the missing values to a dummy value like 0. In some cases this will suffice, however, in our case this might have a negative impact since we need the information about the amino acid frequencies for downstream analysis.

Here we show how to do it:

In [None]:
aa_freq_loc_df_incomplete.fillna(0)

### Imputing

In general df.dropna() removes data, and that is not always ideal. If we for example don't have a lot of data to begin with, or if what we are missing is just a tiny part of the dataset, in these cases we could for example impute the data.
When imputing data, we have to take the dataset into consideration. In our case, we have both secretory and non secretory sequences in our dataset. Furthermore, our theory is that the amino acid distribution will be different depending on whether secretory or non-secretory, so we should take this into account when we use an imputing method.


First we will go through how to use mean:

In [None]:
# we do the mean values for the non-secretory proteins to fill in the missing values for non-secretory proteins
aa_freq_imputed_mean_non_sec = aa_freq_loc_df_incomplete.loc[aa_freq_loc_df_incomplete['location'] == "Non-secretory"].fillna(aa_freq_loc_df_incomplete.loc[aa_freq_loc_df_incomplete['location'] == "Non-secretory"].mean())


# Q we have shown how to do it for non-secretory now do i for secretory. Do we still have missing values when we use the .isna().sum() function?

In [None]:
# we do the mean values for the secretory proteins to fill in the missing values for secretory proteins

aa_freq_imputed_mean_sec = ?

In [None]:
# We concatenate the two dataframes together
aa_freq_imputed_mean = pd.concat([aa_freq_imputed_mean_non_sec, aa_freq_imputed_mean_sec])

#### Here we will show how to use the most frequent value in a column to fill the missing values

In [None]:
# we do the most frequent values 
#for the non-secretory proteins to fill in the missing values for non-secretory proteins
aa_freq_imputed_most_frequent_non_sec = aa_freq_loc_df_incomplete.loc[aa_freq_loc_df_incomplete['location'] == "Non-secretory"].apply(lambda x:x.fillna(x.value_counts().index[0]))

In [None]:
# we do the most frequent values 
#for the non-secretory proteins to fill in the missing values for secretory proteins
aa_freq_imputed_most_frequent_sec = aa_freq_loc_df_incomplete.loc[aa_freq_loc_df_incomplete['location'] == "Secretory"].apply(lambda x:x.fillna(x.value_counts().index[0]))

In [None]:
# We concatenate the two dataframes together
aa_freq_imputed_most_frequent = pd.concat([aa_freq_imputed_most_frequent_non_sec, aa_freq_imputed_most_frequent_sec])

In [None]:
aa_freq_loc_df_incomplete.loc[2992]

In [None]:
aa_freq_imputed_most_frequent.loc[2992]

# Exercise part
# Tissue expression dataset 

For this part of the practical session we will use the tissue expression dataset located in the  tissue_expression.tsv file. 

## Q1 Load the data  (1 point)
Load the data as before into the tissue_expression_df. 

### Q2 What is the shape of the data? (1 point)


**Answer**: 

### Q3 How many different tissue types do we have present in the data and what are they? (1 point)

*Hint* Use the unique function

**Answer**: 

### Q4 How many gene columns do we have in the dataset? (1 point)


**Answer**: 

## Data visualisation 

### Q5 Which three analysis of the ones that we did today, do you think would fit for the analysing this dataset? (3 points )

*Hint*: Think about how many observables to we have in the data. 

**Answer**: 


### Q6 What is the difference between PCA and tSNE? (2 points)


**Answer**: 

### Q7 what is the difference between tSNE and UMAP? (2 point)

**Answer**: 

### Q8 separate the data into observables and categorical variables. Store the observables in X_tissue and features in Y_tissue. What is the shape of X_tissue? (2 points)

*Hint* Use the code from the analysis of the previous dataset with amino acid frequencies as inspiration. 


**Answer**:

## Q9 Do the PCA analyis on the X_tissue data and visualise the first two principal components. (7 points)

*hint* Use the code from the previous dataset for inspiration. 
Steps: 

- scale the data: Save the scaled data in X_tissue_scaled. 
- perform principal components with 20 dimensions 
- convert PCA result into a dataframe and append the categorical data.
- plot the result ot the function. Rember to change the number of colours in the colour pallete to the number of tissue types we have in the dataset. 

In [None]:
# Scale the data 


In [None]:
# Perform PCA 


# Convert result to dataframe 




### How much variance is explained by PC1 and PC2? How well is the data separated into groups by tissue? (2 points)


**Answer**: 

## Q10 Perform tSNE analysis on the data. (7 points)

Steps: 
1. Perform PCA on the data again, this time choose 100 componenets

    `pca_scaled = decomposition.PCA(n_components=20) 
     principalComponents_tissue_scaled = pca_scaled.fit_transform(X_tissue_scaled)`
2. Embbed the data in tSNE using these parameters: 
    * number of components 2, number of iterations 2000, perplexity 50, random state 42. 
3. Convert the tSNE result into a dataframe. 
4. Visualise the results
    
    
*Hint* Use the code from the previous dataset as inspiration. 

In [None]:
#1. Step1: 



In [None]:
#2. Step2:


In [None]:
#3. 
#convert tSNE result into pandas dataframe


#append the categorical variables to the dataframe




In [None]:
#4. 

#visualize the tSNE plot


### Q11 Can you see separation into groups now and how many roughly are there? Are they the groups coloured by tissue of origin? (2 points)

**Answer**: 

## Q12 Perform UMAP analysis on the data. (7)

Steps: 
1. Embed the X_tissue data in UMAP. Use the following parameters:
    * number of neighbours 20
    * number of components 2
    * random state 42
    * metric manhattan
    * number of epochs 700
    * minimum distance 0.1
    
2. Convert the UMAP result to a dataframe and append the tissue labels. 
3. Visualise the UMAP plot

*Hint: use the UMAP analysis on the amino acid frequency dataset as inspiration*

In [None]:
# 1. Embbed X_tissue into UMAP


In [None]:
#convert UMAP result into pandas dataframe


#append the categorical variables to the dataframe




In [None]:
#visualize the UMAP plot


### Q13 Can you see separation into groups now and how many roughly are there? Are they the groups coloured by tissue of origin? (2 points)

**Answer** 

### Q14 Do you think the separation is better when using UMAP to tSNE? Why do you think that is? (2 points)

**Answer**: 