# Deep Learning

# Utilizing Apache Spark and PySpark for Deep Learning in Neuroimaging Studies

## Why Spark?

- **Efficient Large Data Handling:**
   - Spark is designed for fast computation and handling big datasets, essential for neuroimaging data.

- **PySpark - Python Integration:**
   - PySpark provides a Python interface to Spark, making it easier to implement deep learning algorithms using Python libraries.

- **Distributed Computing:**
   - Spark's distributed nature allows for parallel processing, speeding up the deep learning tasks.

- **MLlib for Machine Learning:**
   - Spark's MLlib offers machine learning algorithms optimized for big data, which we can leverage for our deep learning models.

## Advantages:

- Spark handles our complex and large participant data efficiently.
- PySpark allows seamless integration with Python's deep learning libraries.
- Distributed processing capability accelerates our deep learning computations.
- Scalability ensures that our system can grow with our data needs.

In summary, Apache Spark and PySpark provide a robust, scalable, and efficient platform for deep learning in our neuroimaging study, enabling us to process large datasets with speed and ease.


## 1 - Setting up Spark cluster

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("PySpark_VSCode").getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/31 14:47:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/31 14:47:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
import pandas as pd
from tensorflow import keras
from tensorflow.keras import models
from keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input


2024-01-31 14:47:01.669109: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-31 14:47:01.743298: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-31 14:47:01.743334: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-31 14:47:01.754796: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-31 14:47:01.781649: I tensorflow/core/platform/cpu_feature_guar

In [None]:
combined_df = pd.read_csv('/home/skander/combined.csv')

In [None]:

# Set 'participant_id' as index
combined_df_with_index = combined_df.set_index('participant_id')

# Extract features for pattern discovery
features = combined_df_with_index  # All columns except 'participant_id'

# Now features DataFrame is ready for unsupervised learning techniques



## 2 - Neuron layer

In [None]:
# Adjust the input dimension to match the number of features in your dataset
input_dim = features.shape[41]  # Number of features

# Define the size of the encoded representations
encoding_dim = 32  # You can experiment with this value

# Input layer
input_layer = Input(shape=(input_dim,))

# Encoded layer - where the data is compressed
encoded = Dense(encoding_dim, activation='relu')(input_layer)

# Decoded layer - reconstruction of the input
decoded = Dense(input_dim, activation='sigmoid')(encoded)

# Autoencoder model
autoencoder = Model(input_layer, decoded)

# Compile the model
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train the autoencoder
autoencoder.fit(features, features,  # Input and output are the same for an autoencoder
                epochs=50,          # Number of epochs
                batch_size=256,     # Batch size
                shuffle=True)       # Shuffle the data