# Galaxy Morphology Classification Project
---

**In this session, you will learn:-**
* How to do image preprocessing
* How to build CNN models
* How to train and validate them

**The main focus:-**
* It should not be on the steps but the process
* As steps will change from dataset to dataset, but the process will remain the same

### IMPORTANT NOTE
* RUN THIS NOTEBOOK ON KAGGLE SO YOU DONT NEED TO DOWNLOAD THE DATASET
* Don't forget to turn on the GPU before you connect the runtime

#### Adding the dataset
* input -> from kaggle -> search for galaxy zoo -> add (this option only works if you account is validated with cellphone number check)

In [None]:
# To begin with import most necessary dependencies
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Step 1: Get the Dataset

* Make an empty directory and name it as `galaxy_zoo_dataset`
* Unzip all the useful data files in that directory.

In [None]:
# Create a new directory to save the useful data for this project
os.mkdir('galaxy_zoo_dataset')

In [None]:
# Unzip the csv file 
!unzip /kaggle/input/galaxy-zoo-the-galaxy-challenge/training_solutions_rev1.zip -d /kaggle/working/galaxy_zoo_dataset

In [None]:
# Unzip the directory of training images
!unzip /kaggle/input/galaxy-zoo-the-galaxy-challenge/images_training_rev1.zip -d /kaggle/working/galaxy_zoo_dataset

* We have successfully unziped the training images and csv file

### Step 2: Read the csv file and split the data into train-test
* For the demonstration purposes, I will take a subset of the dataset
* To complete the project you may need more images.

In [None]:
# Check the csv file
csv_data_path = '/kaggle/working/galaxy_zoo_dataset/training_solutions_rev1.csv'
df = pd.read_csv(csv_data_path)
df.sample(5)

* GalaxyID is the ID of the image of the galaxy in the dataset
* Rest of the columns are the probabilitites of the respective morphologies
* These morphologies are like shape of galaxy, about the galactic core, e.t.c.

In [None]:
# Check the generic info of the data
df.info()

* Thank God! Tere are no missing values.
* GalaxyID is alone in integer format, and rest of the data is float type.

In [None]:
# Take only first 2000 data instances for demonstration
df_minimum = df[:2000]
df_minimum.shape

* We have taken 2000 rows of data as sample data for training.
* Please note that, you will need more dataset to get better RMSE score.
* Now, we shall use this 2000 rows of data and split it further into train-test.

In [None]:
# Get the train_test_split from sklearn
from sklearn.model_selection import train_test_split

In [None]:
# Create two dfs - one for trian, another for test with test size as 15%
df_train, df_test = train_test_split(df_minimum, test_size=.15, shuffle = True, random_state=42)

In [None]:
# Check before proceeding ahead
print(f'Shape of Train Data Frame:- {df_train.shape}')
print(f'Shape of Test Data Frame:- {df_test.shape}')

* So now we have training and testing set splitted for us!
* What's next?? Well, let's visualize some images?

### Step 3:- Visualizing RANDOM images from the dataset
1. Create the path for training image directory and call it as `root_dir`
2. Get the list of ids of images present in the `root_dir` and call it `ids_jpg`
3. Using `np.random.choice()` randomly choose one id from `ids_jpg` list
4. Now create the complete image path for visualization purposes
5. Read the image and display it
6. Add title with galaxy id and shape of image

In [None]:
# Coding all the above points for step 3
root_dir = "/kaggle/working/galaxy_zoo_dataset/images_training_rev1/" # Root Path of Dir where training images are saved
ids_jpg = os.listdir(root_dir)   # List of files in the directory (eg. 10001.jpg)
id_ = np.random.choice(ids_jpg)  # Randomly choose one item from the list above
img_path = root_dir + id_      # Complete image path
random_image = plt.imread(img_path) # Get image pixels array
plt.imshow(random_image) # Display the image
plt.title(f'Galaxy ID:- {id_[:-4]}\nShape:- {random_image.shape}', 
          color = 'tab:pink')
plt.show()

* Everytime you will run the above code cell, it will randomly give you different images from the dataset.
* What do you notice here? Can we crop some part of images?
* Yes! WE CAN... as our region of interest is exactly in the center of the image.
* How will we crop??? Let's check it in the next step.

### Step 4: Preprocessing Images
* It is always important to preprocess images before passing it to the model
* It should help the model to train faster
* How? Well, that's what we have to think...

##### **Steps for one image:-**
1. Create a function that will crop the center part of the image
2. Firstly, read the image
3. Then choose from where to begin the croping and also the crop size
4. You may further resize the image to much smaller size
5. The final step will be to normalize the image

In [None]:
# Read the image path from step 3
img_array = plt.imread(img_path)
plt.imshow(img_array)
plt.show()

In [None]:
# Crop from (84, 84) and choose crop size as (256, 256)
START_FROM = (84, 84)
CROP_SIZE = (256, 256)
cropped_img = img_array[START_FROM[0]:START_FROM[0]+CROP_SIZE[0],
                        START_FROM[1]:START_FROM[0]+CROP_SIZE[1]]

# Check the output
plt.imshow(cropped_img)
plt.title(f'Shape:- {cropped_img.shape}')
plt.show()

In [None]:
# Check the max-min pixels of the cropped_img
print(f'Maximum Pixel of Cropped Image:- {cropped_img.max()}')
print(f'Minimum Pixel of Cropped Image:- {cropped_img.min()}')

In [None]:
# Let's resize using skimage
from skimage.transform import resize
resized_img = resize(cropped_img, (64, 64))

In [None]:
# Check the max-min pixels of the resized_img
print(f'Maximum Pixel of Cropped Image:- {resized_img.max()}')
print(f'Minimum Pixel of Cropped Image:- {resized_img.min()}')

* As we can see that by default the ouput is in between 0 to 1, we don't need normalization.
* But wait... Why is it in between 0 to 1? For that check the latest activity in your slack.
* Now just create a function that does this preprocessing on any image...

In [None]:
def get_image(path, x1, y1, resize_shape, crop_size):  
    """
    Get the preprocessing for single galaxy image
    
    Parameters
    ----------
    path: Image Path for the image on which you want to apply image processing
    x1: Start pixel for rows to begin the cropping
    y1: Start pixel for cols to begin the cropping
    resize_shape: The final shape of the image
    crop_size: Image will be cropped from start pixels to the crop size
    
    Returns
    -------
    preprocessed_img: Centered image of the galaxy
    
    """
    img_array = plt.imread(path)                       
    crop_img = img_array[x1:x1+crop_size[0], y1:y1+crop_size[1]] 
    preprocessed_img = resize(crop_img, resize_shape)                   
    return preprocessed_img

* The function is ready! Now it is time to test it.
* Create a code cell that will generate a side-by-side subplot to compare original and preprocessed image.

In [None]:
# Get random image path
id_ = np.random.choice(ids_jpg) 
img_path = root_dir + id_      
org_img = plt.imread(img_path)

# Preprocess it
x_data = get_image(img_path, 84, 84, (64,64), (256, 256))

# Display before after images
plt.figure(figsize=(8,5))
plt.suptitle(f'Galaxy ID:- {id_[:-4]}')

plt.subplot(121)
plt.imshow(org_img)
plt.title(f'Original Image Shape:-\n{org_img.shape}', color = 'tab:pink')

plt.subplot(122)
plt.imshow(x_data)
plt.title(f'Re-Shaped into:-\n{x_data.shape}', color = 'tab:pink')

plt.tight_layout()
plt.show()

* Now that, we have this for one image, how about preprocessing all the images in our data?
* For that create another function to prepare batches of the images according to the dataframe.

In [None]:
# To check the progress of the loop we will need a library called as tqdm
from tqdm import tqdm 

In [None]:
# Image Data
ORG_SHAPE = (424,424)
CROP_SIZE = (256,256)
RESIZE_SHAPE = (64,64)

In [None]:
def get_all_images(dataframe, resize_shape=RESIZE_SHAPE, crop_size=CROP_SIZE):
    """
    Use dataframe to get image ids and preprocess all of them using get_image function
    
    Parameters
    ----------
    dataframe: Data frame should have first column for galaxy ids
    resize_shape: Image to be resized into this shape
    crop_size: Crop size for the image before resizing
    
    Return
    ------
    x_batch: Array of batch of images (batch_size, Height, Width, Channels)
    y_batch: Array of respective probabilities for image (batch_size, Cols) 
    """

    # Get the centre of the image where region of interest is present
    x1 = (ORG_SHAPE[0]-CROP_SIZE[0])//2       # (424-256)//2 = 84
    y1 = (ORG_SHAPE[1]-CROP_SIZE[1])//2       # (424-256)//2 = 84

    # Form x and y batches
    sel = dataframe.values                     # dataftame values in array
    ids = sel[:,0].astype(int).astype(str)     # Get Galaxy ID in string
    y_batch = sel[:,1:]                        # Get All feature values except first column (Galaxy ID)
    x_batch = []                               # Define X_batch
    for i in tqdm(ids):
        x = get_image(root_dir + i + '.jpg', x1, y1, resize_shape=resize_shape, crop_size=crop_size)
        # Calling Get Image by giving set of arguments
        x_batch.append(x) # append the cropped and resized image x into x_batch
    x_batch = np.array(x_batch)    # convert x_batch each images into numby array

    # Return the batches
    return x_batch, y_batch

In [None]:
# Apply the function to get X_train, y_train, X_test, y_test
X_train, y_train = get_all_images(df_train)
X_test, y_test = get_all_images(df_test)      # Validation set

In [None]:
# Check the shapes for training set
print('X_train Shape:- ')
print(X_train.shape)

print('\ny_train Shape:- ')
print(y_train.shape)

In [None]:
# Check the shapes for testing set
print('X_test Shape:- ')
print(X_test.shape)

print('\ny_test Shape:- ')
print(y_test.shape)

### Step 5: Build the CNN Model 
* We shall use Keras API to build the model
* The aim of the model is to accept the image of some size and get 37 different outputs between 0 to 1

In [None]:
# Important imports for building the CNN model
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Dropout, Flatten, Dense
import keras
from tensorflow.keras import backend as K

* Initiate the Model --> Sequential API
* Allow us to add layers in sequence
* For now, we will have this architecture:- `Input Layer` -> `Conv` -> `Conv` -> `Max Pool` -> `Flatten` -> `Dense` --> `Dropout` ->  `Dense (output)`
* Later, you can modify it according to your preferences

In [None]:
# Create the Sequential Model
model = Sequential()

# Build the model with Inputs, 2 Conv, 1 MaxPool followed by Flatten, Dense, Droput and Output
model.add(Conv2D(512, (3, 3), activation='relu', input_shape=(64, 64, 3))
model.add(Conv2D(256, (3, 3),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2))) 
model.add(Flatten())                      
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.8)) # to reduce the over fitting problem
model.add(Dense(37,activation='sigmoid')) 

# Check the summary
model.summary()

* Here, you can calculate the ouptut shape of the Conv2D layer using `o = (i - f + 2p)/s + 1`
* You can calculate the parameters of Conv2D layer using `params = (filter_size^2 * filter_channels)*total_filters +  total_filters`
* Example for first conv2d layer ~
    * `output shape`:- `(64 - 3 + 2*0)/1 + 1` = `62`
    * `params`:- `3^2 * 3 * 512 + 512` = `14,336`

### Step 6: Compile the model
* Compile the model using loss, metrics and optimizer
* We are choosing:-
    * Loss:- MSE
    * Optimizer:- Adam
    * Metric:- RMSE
    

In [None]:
# Create a function to calculate RMSE from the outputs
def root_mean_squared_error(y_true, y_pred):
        return K.sqrt(K.mean(K.square(y_pred - y_true))) 

In [None]:
model.compile(loss='mse', 
              optimizer=keras.optimizers.Adam(learning_rate=0.001), 
              metrics=[root_mean_squared_error])

* You can change the loss and optimizer to check how it performs
* Don't forget to tune the optimizer's hyperparameters

### Step 7: Train the model and validate it
* Make sure to turn on the GPU or else one epoch may take approx 10 mins
* With GPU it should take approx 5-10 seconds per epoch

In [None]:
# 1700 --> Batches --> Batch_size = 32 ---> 53.125 batches ---> 53 or 54
model.fit(X_train, 
          y_train, 
          epochs=50, 
          validation_data=(X_test, y_test), 
          batch_size=32) # default bs = 32 if None

### Step 8: Get the predictions

In [None]:
# This is our test outputs
y_test_df = pd.DataFrame(y_test, columns = list(df_minimum.columns[1:]))
y_test_df.head()

In [None]:
# Get the prediction outputs
pred_test = np.array(model.predict(X_test))
pred_test_df = pd.DataFrame(pred_test, columns = list(df_minimum.columns[1:]))
pred_test_df.head()

In [None]:
# Print the RMSE on the test/val data
print(np.array(root_mean_squared_error(y_test_df.values, pred_test_df.values)))

**How to get better score?**
1. You can use more samples in training
2. You can adjust your image processing pipeline
3. You can change your model architecture
4. You can change loss and optimizer or tune the hyperparameters of the Adam
5. Increase/Decrease the number of epochs
6. You can change the batch size during training

## Your next assignment
* It is based on this project
* More information on it will be shared later