**DeepCryoPicker**

Description:

Deep learning methods for CryoEM data analysis Cryo-electron microscopy (Cryo-EM) is widely used in the determination of the three-dimensional (3D) structures of macromolecules. Particle picking from 2D micrographs remains a challenging early step in the Cryo-EM pipeline due to the diversity of particle shapes and the extremely low signal-to-noise ratio (SNR) of micrographs. Because of these issues, significant human intervention is often required to generate a high-quality set of particles for input to the downstream structure determination steps.

**Imports**

In [8]:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import cv2

**Mounting and import data sets**

The datasets used for the training and testing are stored in a google drive

cryo-EM Micrographs that been used in this repostory have been collected from:


*  'Apoferritin' The first dataset is "EMPIAR-10146"- Apoferritin tutorial dataset for cisTEM, Dataset description is avaliable in https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10146/#&gid=1&pid=1 (This appears to be a decreptid set as the data can no longer be accesed from this link)
*   'Beta-galactosidase' The second dataset has both Top and Side-view called (KLH), the KLH Dataset is available Online, http://nramm.nysbc.org/.
*   'Ribosome' The third datatset has shape that is considered as an irregularly shaped protein, EMPIAR-10028-80S ribosome, the dataset is downloaded from https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10028/
*   'KLH' The fourth dataset has complex protein particle shapes (EMPIAR-10017-Beta-galactosidase), the dataset is downloaded from https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10017/

All four data sets will be imported to a single 'dataset' object. To access a specific data set use the following: 

dataset['dataset_name'][data]

this will return all images within 'dataset_name'



In [9]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')
#drive path to directory containg datasets
path = "/content/drive/MyDrive/DeepCryoPicker/Data Sets"
#empty directory to store image data
dataset = {}
# Loop through all folders in the directory and import images
# images will be stored as dataset['data']['label']
for folder in os.listdir(path):
  folder_path = os.path.join(path, folder)
  data = []
  labels = []
  for image_path in os.listdir(folder_path):
    image = cv2.imread(os.path.join(folder_path, image_path))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    data.append(image)
    labels.append(folder)
  # Convert data and labels to numpy arrays
  data = np.array(data)
  labels = np.array(labels)
  # Add the data and labels to the dataset dictionary
  dataset[folder] = {'data': data, 'labels': labels}

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  data = np.array(data)


In [39]:
all_labels = []
for folder in dataset.keys():
    all_labels.extend(dataset[folder]['labels'])
labels = list(set(all_labels))

print(labels)

['Ribosome', 'Apoferritin', 'Beta-galactosidase', 'KLH']


**Data Preprocessing**

Step 1 is to preprocess the entire dataset

In [42]:
# run notebook containg preprocessing code
libdir = "/content/drive/MyDrive/DeepCryoPicker/"
%run {libdir}PreProcessing.ipynb

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
