<h1 style="background-color:#DC143C; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">ALSgeneScanner: a pipeline for the analysis and interpretation of DNA sequencing data of ALS patients</h1>

Authors: Alfredo Iacoangeli,Ahmad Al Khleifat, William Sproviero, Aleksey Shatunov,Ashley R. Jones, Sarah Opie-Martin - https://doi.org/10.1080/21678421.2018.1562553
Pages 207-215 | Received 12 Sep 2018, Accepted 27 Nov 2018, Published online: 05 Mar 2019

"Genetic factors are an important cause of ALS, with variants in more than 25 genes having strong evidence, and weaker evidence available for variants in more than 120 genes. With the increasing availability of next-generation sequencing data, non-specialists, including health care professionals and patients, are obtaining their genomic information without a corresponding ability to analyze and interpret it. Furthermore, the relevance of novel or existing variants in ALS genes is not always apparent. Here the authors present ALSgeneScanner, a tool that is easy to install and use, able to provide an automatic, detailed, annotated report, on a list of ALS genes from whole-genome sequencing (WGS) data in a few hours and whole exome sequence data in about 1 h on a readily available mid-range computer. This will be of value to non-specialists and aid in the interpretation of the relevance of novel and existing variants identified in DNA sequencing data."

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/end-als/end-als/clinical-data/filtered-metadata/metadata/aals_released_files.csv', encoding='ISO-8859-2')
pd.set_option('display.max_columns', None)
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

**<span style="color:#DC143C;">whole-genome sequencing (WGS)</span>**

In [None]:
df["experiment"].value_counts()

Computational performance of the pipeline to process whole-genome sequencing and whole exome sequencing data from fastq file to the generation of the final result report.

Authors: Alfredo Iacoangeli,Ahmad Al Khleifat,William Sproviero,Aleksey Shatunov,Ashley R. Jones,Sarah Opie-Martin - https://doi.org/10.1080/21678421.2018.1562553

![](https://www.tandfonline.com/na101/home/literatum/publisher/tandf/journals/content/iafd20/2019/iafd20.v020.i03-04/21678421.2018.1562553/20200214/images/medium/iafd_a_1562553_f0003_c.jpg)https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553

In [None]:
df["data_level"].value_counts()

In [None]:
df["data_level"].value_counts().plot.bar(color=['blue', 'red','lime','purple'], title='ALS Released Files Data Level');

In [None]:
df["differentiation"].value_counts()

In [None]:
corr_matrix = df.corr()
corr_matrix['data_level'].sort_values().plot(kind="bar")
print(corr_matrix['data_level'].sort_values())
plt.show()

In [None]:
sns.clustermap(corr_matrix, annot=True, fmt=".3f", figsize=(10,10))
plt.title("Correlation Between Features")
plt.show()

In [None]:
# Lets first handle numerical features with nan value
numerical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes!='O']
numerical_nan

In [None]:
# categorical features with missing values
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>0 and df[feature].dtypes=='O']
print(categorical_nan)

In [None]:
# replacing missing values in categorical features
for feature in categorical_nan:
    df[feature] = df[feature].fillna('None')
    
df[categorical_nan].isna().sum()

In [None]:
df = pd.get_dummies(df)

In [None]:
Y = df['data_level'].values
X = df.drop(labels=['data_level'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.2, random_state=58)

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_val.shape)
print(y_val.shape)

#input dim is taken from the numbers above 5278 X train shape.

In [None]:
#Code by Muhammed Halil Akkaynak https://www.kaggle.com/halilakkaynak/tps-apr-eda-and-ann/notebook
from keras.layers import Dense
from keras.models import Sequential

def create_ann_model():
    model = Sequential()
    model.add(Dense(8, activation="relu", input_dim=5278))
    model.add(Dense(4, activation="relu"))
    model.add(Dense(2, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(loss="binary_crossentropy", metrics=['accuracy'])
    return model

model = create_ann_model()
model.summary()

In [None]:
model.fit(x_train, y_train, epochs=5, batch_size=32) # ohe kullan

In [None]:
#Code by Muhammed Halil Akkaynak https://www.kaggle.com/halilakkaynak/tps-apr-eda-and-ann/notebook

from sklearn.metrics import confusion_matrix, accuracy_score

pred = model.predict(x_val)
pred = (pred > 0.5)
y_true = np.int64(y_val)
y_true = y_true.round()
pred = pred.round()
cm = confusion_matrix(y_true, pred)
score = accuracy_score(y_true, pred)
print("Score: ", score)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(cm, annot=True, linewidths=0.01, cmap="Blues", linecolor="green", fmt=".2f", ax=ax)
plt.xlabel("Predict")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()

In [None]:
#Since there is No submission, I save that for next time.

#pred = model.predict(df)
#submission['data_level'] = (pred[:, 0] > 0.5).astype(int)
#submission.to_csv('submission.csv', index=False)

In [None]:
#Code by Olga Belitskaya https://www.kaggle.com/olgabelitskaya/sequential-data/comments
from IPython.display import display,HTML
c1,c2,f1,f2,fs1,fs2=\
'#eb3434','#eb3446','Akronim','Smokum',30,15
def dhtml(string,fontcolor=c1,font=f1,fontsize=fs1):
    display(HTML("""<style>
    @import 'https://fonts.googleapis.com/css?family="""\
    +font+"""&effect=3d-float';</style>
    <h1 class='font-effect-3d-float' style='font-family:"""+\
    font+"""; color:"""+fontcolor+"""; font-size:"""+\
    str(fontsize)+"""px;'>%s</h1>"""%string))
    
    
dhtml('Thank you Muhammed Halil Akkaynak @halilakkaynak for the code' )