## Scientific Questions: Which kinases are higher expressed in the phosphorylation of tau protein in regards to Alzheimer's disease?

Alzheimer's is a neurodegenerative disease characterized by the accumulation of amyloid-beta plaques in neuron cells and neurofibrillary tangles caused by adnormal hyperphosphorylation of the tau protein. These lead to synaptic loss and neuronal death which essentially affects regions of the brain associated with memory. 

Studies have found that the cause of hyperphosphorylation of Tau is heavily linked to kinase phosphorylation, of equal importance are the sites where tau is being phoshporylated. Studies have found that serine 626 is hyperphosphprylated in eary AD progression and before neurofibrillary is formed. (Azorsa, D.O., Robeson, R.H., Frost, D. et al.) Therefore inhibiting these phosphorylating kinases at this stie can be insightful in the development of new therapies for Alzheimer's disease patients. 

Nevertheless, there are 352 human kinases known to be  able to phosphorylate tau (Cavallini A, Brewerton S, Bell A, et al), which of these kinases are higher expressed in the phosphorylation of tau's epitopes is not widely understood. Several studies have found that translation associated kinases such as GSK3α, GSK3β, and MAPK13 were found to be higher expressed in AD and had  higher activity in phosphorylating all four region of tau's epitopes. (Cavallini A, Brewerton S, Bell A, et al) In order to get data about kinases phosporylation activity,the Qiagen Kinome siRNA librarry screen was used as part of the dataset. 

### Scientific Hypothesis: If tau protein is hyperphosphorylated then overexpression of translation-associated kinases will be higher expressed in Alzheimer's disease. 

There are 352 human kinases known to phosphorylate tau's epitopes that lead to neurofibrallary, studies have found that the direct contributors to the phosphorylation of tau is caused by translation associated kinases such as GSK3α, GSK3β, and MAPK13.(Cavallini A, Brewerton S, Bell A, et al) The Qiagen kinome si-RNA library database was used which contains 572 known and predicted kinases which uses two siRNAs screens, standard paries and two-tailed T-tests to determine the effects of relative non-silencing siRNA in 12E8 (Tau epitope). The data is then organized by gene name (kinase, number of cells transfected, average fold effect, standard deviation, p-values for 12E8, total tau, and ratio of 12E8 tau/ total tau. Where 12E8 refers to tau's epitope serine 262/serine 356. (Azorsa, D.O., Robeson, R.H., Frost, D. et al., 2010)

To answer scientific quesition and test my hypothesis, I first had to search for related AD tau phosphorylation research papers that contained their data, I used the database found in the paper "High-content siRNA screening of the Kinome identifies Kinases involved in Alzheimer's disease related Tau hyperphosphorylation". By using the kinome siRNA data and screening through it I utilized the data on the kinases of interest (GSK3α, GSK3β, and MAPK13) and compared them to the rest of the kinases in the dataset to identify wheather these translation associated kinases are in fact higher expressed in the increase or decrease pS262 tau phosphorylation. 

#### Load Packages
- **Pandas** is a python package for data analysis and machine learning tasks.
- **NumPy** is a python library for multi-dimensional array and matrices. 
- **Matplotlib** is also a python library but specifically a plotting library which contains functions to creat plots, figures, adds labels, etc. 
- **Seaborn** is a library utilixed to create statistical graphics and it is closely associated with matplotlib and pandas data strucutes. 


In [1]:
#load packages

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns




### Load in data and Bioinformatics Analysis 

The csv file was obtained from the research article "High-content siRNA screening of the Kinome identifies Kinases involved in Alzheimer's disease related Tau hyperphosphorylation" which contained information on the siRNA levels of 572 kinases that are known to be human phosphorylation kinases. The dataset contained the gene name, average cell count, average fold effect, std, pvalues, total tau, and ratio of 12E8 tau/total tau. 

In the code below, I cleaned my data so that only the relevant information was kept, this is because the dataset contained two siRNAs per kinase and were additionally screened in triplicate for a total of 3,432 target siRNAs screens. Therefore, I limited myself to single siRNA screens. I then attempted at selecting my target kinases GSK3α, GSK3β, and MAPK13 from the rest of the kinases in order to observe how they compare to the 10 candidate kinases mentioned in the article. (Azorsa, D.O., Robeson, R.H., Frost, D. et al, 2010)

In [4]:
#importing data
import pandas as pd 
df = pd.read_csv('siRNA_project3_data.csv')

#rename row names
df = df.set_index('Well Contents')

df

Unnamed: 0_level_0,Average Cell Count,pTau,pTau2,pTau3,Total Tau,Total Tau2,Total Tau3,ptau/total Tau,ptau/total Tau2,ptau/total tau3,...,pvalue,Z score,Average Total Tau,St Dev2,pvalue2,Z score3,Average pTau/total Tau,St Dev3,pvalue3,Z score2
Well Contents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAK1 s1,155,1.06182,1.20902,1.28131,1.04109,0.85652,0.99569,1.01992,1.41154,1.28686,...,0.10422,0.31564,0.96443,0.09617,0.58740,0.56083,1.23944,0.20007,0.17394,-0.27517
AAK1 s2,53,1.09388,1.24184,0.84360,1.27356,0.94028,1.16454,0.85892,1.32070,0.72441,...,0.65823,-1.00221,1.12613,0.16992,0.32733,1.84107,0.96801,0.31276,0.87569,-1.76811
AATK s1,109,1.01821,1.18966,1.33320,0.89499,0.73863,0.91132,1.13768,1.61064,1.46293,...,0.18613,-0.66081,0.84831,0.09534,0.11032,-0.35857,1.40375,0.24197,0.10177,-0.22956
AATK s2,15,1.51576,1.02901,1.44839,0.66848,1.10942,0.85471,2.26748,0.92751,1.69459,...,0.16174,0.41577,0.87754,0.22136,0.43907,-0.12717,1.62986,0.67232,0.24613,0.82951
ABL1 s1,72,0.96955,0.98117,0.82610,0.85111,1.09443,0.70459,1.13917,0.89652,1.17246,...,0.27427,-1.26035,0.88337,0.19691,0.41283,-0.08096,1.06938,0.15063,0.50866,-1.14071
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SMG1 s1,265,0.80189,0.73718,1.35025,0.77440,0.72418,0.71088,1.03550,1.01795,1.89939,...,0.86704,-0.53434,0.73649,0.03350,0.00534,-1.24825,1.31761,0.50391,0.38894,0.25587
SMG1 s2,340,0.92254,0.75075,1.32081,0.90673,0.77885,0.73902,1.01744,0.96392,1.78724,...,0.99178,-1.43402,0.80820,0.08762,0.06306,-0.67615,1.25620,0.46067,0.43705,-0.66090
SNARK s1,465,1.05773,1.27677,0.90792,0.97046,0.70574,0.45806,1.08992,1.80911,1.98209,...,0.52930,-0.71408,0.71142,0.25625,0.19040,-1.44241,1.62704,0.47313,0.14861,0.97274
SNARK s2,368,1.17672,0.96725,0.00000,0.63795,0.77056,0.00000,1.84453,1.25526,#DIV/0!,...,0.51356,-2.28756,0.70425,0.09377,0.15546,-1.49919,1.54989,0.41668,#DIV/0!,0.95325


**NOTE**
In the code below, I attempted at selecting data I wanted to work with by selecting my kinases of interest using loc, but when I ran the code it for some reason did not recognize 'Well Contents'. 

In [1]:
#selecting translation-associated kinases
import pandas as pd 

df = pd.read_csv('siRNA_project3_data.csv')

#rename row names
df = df.set_index('Well Contents')


df.loc[df['Well Contents'] == 'GSK3A s1', 'GSK3B s1', 'MAPK13']

df

KeyError: 'Well Contents'

**note** 

For the code below, I tried importing the candidate kinase data and attempted at modifying the column names just as I did for siRNA data, it did not recognize 'Gene Name' as a column value. 

In [22]:
#importing data of candidate kinases
import pandas as pd 
df = pd.read_csv('Candidate_Kinases.csv')

# rename row names
df = df.set_index('Gene Name')

df





KeyError: "None of ['Gene Name'] are in the columns"

### Create Visualization 

Here I wanted to make a histogram using seaborn that wouldv'e been down below. The x and y axes will be labeled "Kinase" and "Averafe Tau expression". Specifically I would've liked to have the target kinases GSK3α, GSK3β, and MAPK13 together so that it would easily be compared to candidate kinases the research article proposed. 

In [67]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('Candidate_Kinases.csv')
df






Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,Gene Name,Cell Count,Average,St Dev,p value,Average,St Dev,p value,Average,St Dev,p value
1,EIF2AK2,217,0.606,0.075,0.012,0.647,0.102,0.026,0.962,0.237,0.806
2,EIF2AK2,318,0.719,0.069,0.018,0.706,0.082,0.025,1.038,0.257,0.822
3,CDKL1,398,0.852,0.032,0.015,0.761,0.085,0.04,1.132,0.157,0.281
4,DCK,363,1.068,0.015,0.017,0.701,0.051,0.01,1.53,0.107,0.013
5,DGKQ,406,1.119,0.042,0.039,0.805,0.059,0.029,1.394,0.111,0.026
6,PFKFB3,458,1.204,0.046,0.016,0.813,0.029,0.008,1.483,0.088,0.011
7,ERK8,466,1.234,0.063,0.024,0.807,0.009,0.001,1.528,0.084,0.008
8,STK19,347,1.334,0.103,0.03,0.654,0.115,0.035,2.065,0.231,0.015
9,PRKG2,413,1.384,0.152,0.048,0.721,0.107,0.046,1.959,0.412,0.056


In [None]:
#This was my attempt at creating a histogram based on the candidate kinases
#'Unnamed: 0' = kinases, 'Unnamed: #'= average phosphorylation of Tau
plt.figure(figsize = (20,20))

plt.subplot(1,2,1)
sns.countplot('Unnamed: 0', hue = 'Unnamed: 2', data = df)

plt.subplot(1,2,2)
sns.countplot('Unnamed: 0', hue = 'Unnamed: 5', data = df)

plt.subplot(2,2,3)
sns.countplot('Unnamed: 0', hue = 'Unnamed: 8', data = df)

### Analysis of results 


Here I was hoping to have two histograms measuring the average phosphorylation of Tau such that one histogram showed my interest kinases GSK3α, GSK3β, and MAPK13 (translation-associated kinases) and the other histogram containing the candidate kinases proposed by the research paper which consisted of 10 kinases. Through this histogram, I expected to answer my scientific question: Which kinases are higher expressed in the phosphorylation of tau protein in regards to Alzheimer's disease? I hypothesized that translation associated kinases would be higher expressed and hyperphosphorylate Tau protein, I was hoping to find a correlation from my selected kinases to the candidates, where some of the candidates were also linked to translation therefore proving my hypothesis. 