# What is it?

*   Understanding how and why we are here is one of the fundamental questions for the human race. 
*   Part of the answer to this question lies in the origins of galaxies, such as our own Milky Way
*   Yet questions remain about how the Milky Way (or any of the other ~100 billion galaxies in our Universe) was formed and has evolved. 
*  Galaxies come in all shapes, sizes and colors: from beautiful spirals to huge ellipticals. 
*  Understanding the distribution, location and types of galaxies as a function of shape, size, and color are critical pieces for solving this puzzle.



# Credit: NASA and European Space Agency

* With each passing day telescopes around and above the Earth capture more and more images of distant galaxies. 
* As better and bigger telescopes continue to collect these images, the datasets begin to explode in size. 
* In order to better understand how the different shapes (or morphologies) of galaxies relate to the physics that create them, such images need to be sorted and classified. 
* Galaxy Zoo data is made

# More onto it
Galaxies in this set have already been classified once through the help of hundreds of thousands of volunteers, who collectively classified the shapes of these images by eye in a successful citizen science crowdsourcing project. However, this approach becomes less feasible as data sets grow to contain of hundreds of millions (or even billions) of galaxies. That's where you come in.

The aim is to analyze the JPG images of galaxies to find automated metrics that reproduce the probability distributions derived from human classifications. For each galaxy, determine the probability that it belongs in a particular class. 

# CNN on galaxy data

This code snippet demonstrates how to implement a Convolutional Neural Network (CNN) on galaxy image data using Keras with TensorFlow backend. Here's a breakdown of the code and its functionalities:

### Preprocessing Images:
- The `get_image` function crops and resizes images from a specified directory (`'/content/images_training_rev1/'`) to a target shape (`IMG_SHAPE`) for model input.
- The `get_all_images` function processes all images in a given DataFrame (`df_train` or `df_test`), extracting features (`X_train`, `X_test`) and labels (`y_train`, `y_test`).

### CNN Model Architecture:
- The model is defined using a Sequential approach, starting with convolutional layers (`Conv2D`) followed by activation functions (`Activation`), max pooling (`MaxPooling2D`), and dropout regularization (`Dropout`).
- The final layers include dense (fully connected) layers with activation functions and dropout, culminating in an output layer with sigmoid activation for binary classification.

### Model Compilation and Training:
- The model is compiled using binary cross-entropy loss and the Adamax optimizer, with a custom root mean squared error (RMSE) metric defined.
- Training is performed using `model.fit` on the training data (`X_train`, `y_train`) for a specified number of epochs.

### Testing and Prediction:
- The `test_image_generator` function prepares test images from a directory (`'/content/images_test_rev1/'`) for prediction.
- Predictions are made batch-wise using the trained model (`model.predict`) on test data, and results are saved in a DataFrame (`submission_df`) for submission or further analysis.

### Points for Improvement:
- Experiment with different CNN architectures, such as varying the number of layers, kernel sizes, and activation functions, to optimize model performance.
- Consider data augmentation techniques (e.g., rotation, flipping) to increase the diversity of training data and improve model generalization.
- Hyperparameter tuning (e.g., learning rate, batch size) can further enhance model training and convergence.
- Monitor training/validation metrics (loss, accuracy) to assess model performance and prevent overfitting.

Overall, this code provides a framework for implementing a CNN on galaxy image data, enabling the detection and classification of features relevant to AGN within NGC 6946.

# Step 1: Import libraries


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 2: Mount google drive and unzip the training data  

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
!unzip /content/gdrive/MyDrive/training_solutions_rev1.zip

Mounted at /content/gdrive
Archive:  /content/gdrive/MyDrive/training_solutions_rev1.zip
  inflating: training_solutions_rev1.csv  


# Step 3: Read the training data and divide it in training and testing data

In [None]:


df = pd.read_csv('/content/training_solutions_rev1.csv')

df_train, df_test = train_test_split(df, test_size=.2)
df_train.shape, df_test.shape

((49262, 38), (12316, 38))

# Know about data

In [None]:
df_train

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
13793,304641,0.510246,0.489754,0.000000,0.000000,0.489754,0.000000,0.489754,0.000000,0.489754,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
12564,285730,0.044123,0.955877,0.000000,0.025878,0.929999,0.529328,0.400670,0.622594,0.307405,...,0.025878,0.210590,0.268091,0.143913,0.000000,0.168093,0.016409,0.043661,0.013458,0.380973
13730,303711,0.119398,0.843129,0.037472,0.018549,0.824580,0.824580,0.000000,0.375184,0.449396,...,0.000000,0.319282,0.055902,0.000000,0.018778,0.076614,0.037180,0.000000,0.055958,0.186653
37170,642945,0.383957,0.583812,0.032231,0.000000,0.583812,0.000000,0.583812,0.000000,0.583812,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1145,117104,0.577271,0.366701,0.056028,0.000000,0.366701,0.000000,0.366701,0.173233,0.193468,...,0.000000,0.173233,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.173233
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15013,322307,0.090098,0.884271,0.025631,0.884271,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.148409,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
47195,788018,0.839485,0.122043,0.038472,0.000000,0.122043,0.000000,0.122043,0.060236,0.061807,...,0.000000,0.000000,0.060236,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.060236
43448,734086,0.305538,0.685270,0.009192,0.493072,0.192198,0.000000,0.192198,0.000000,0.192198,...,0.158492,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
58850,959987,0.597198,0.340394,0.062408,0.000000,0.340394,0.000000,0.340394,0.000000,0.340394,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


# Training data has image id and the probability it belongs to the respective class

In [None]:
df_test

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
33589,592092,0.620564,0.365558,0.013878,0.110130,0.255428,0.000000,0.255428,0.209258,0.046170,...,0.000000,0.111547,0.097711,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.209258
59457,968398,0.229249,0.726264,0.044486,0.059928,0.666336,0.237056,0.429280,0.165295,0.501041,...,0.059928,0.000000,0.000000,0.165295,0.000000,0.165295,0.000000,0.000000,0.000000,0.000000
27158,497588,0.867365,0.132635,0.000000,0.000000,0.132635,0.000000,0.132635,0.000000,0.132635,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
837,112527,0.253295,0.746705,0.000000,0.053274,0.693431,0.273646,0.419784,0.261472,0.431959,...,0.000000,0.261472,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.261472
37348,645725,0.534292,0.413716,0.051992,0.027719,0.385997,0.042828,0.343169,0.292645,0.093352,...,0.000000,0.097451,0.146322,0.048872,0.000000,0.146322,0.048872,0.000000,0.000000,0.097451
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47400,790899,0.726109,0.247454,0.026437,0.000000,0.247454,0.000000,0.247454,0.000000,0.247454,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
57391,938178,0.743362,0.231170,0.025468,0.000000,0.231170,0.000000,0.231170,0.000000,0.231170,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
49511,820913,0.600826,0.358629,0.040545,0.024028,0.334601,0.087299,0.247302,0.152234,0.182367,...,0.000000,0.050694,0.101540,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.152234
36753,637212,0.000000,1.000000,0.000000,0.022000,0.978000,0.133008,0.844992,0.978000,0.000000,...,0.000000,0.755239,0.111381,0.111381,0.066637,0.244990,0.044098,0.044098,0.044098,0.534078


# Step 4: Unzip and load images

In [None]:
!unzip /content/gdrive/MyDrive/images_training_rev1.zip
!unzip /content/gdrive/MyDrive/images_test_rev1.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: images_test_rev1/944041.jpg  
  inflating: images_test_rev1/944051.jpg  
  inflating: images_test_rev1/944064.jpg  
  inflating: images_test_rev1/944073.jpg  
  inflating: images_test_rev1/944075.jpg  
  inflating: images_test_rev1/944076.jpg  
  inflating: images_test_rev1/944077.jpg  
  inflating: images_test_rev1/944085.jpg  
  inflating: images_test_rev1/944088.jpg  
  inflating: images_test_rev1/944094.jpg  
  inflating: images_test_rev1/944102.jpg  
  inflating: images_test_rev1/944104.jpg  
  inflating: images_test_rev1/944114.jpg  
  inflating: images_test_rev1/944133.jpg  
  inflating: images_test_rev1/944139.jpg  
  inflating: images_test_rev1/944142.jpg  
  inflating: images_test_rev1/944147.jpg  
  inflating: images_test_rev1/944152.jpg  
  inflating: images_test_rev1/944153.jpg  
  inflating: images_test_rev1/944155.jpg  
  inflating: images_test_rev1/944207.jpg  
  inflating: images_test_rev1/94

# Step 5: Preprocessing images

In [None]:

from skimage.transform import resize
from tqdm import tqdm # progress which the model
import matplotlib.pyplot as plt
%matplotlib inline
          

ORIG_SHAPE = (424,424)
CROP_SIZE = (256,256)
IMG_SHAPE = (64,64)

def get_image(path, x1,y1, shape, crop_size):
    x = plt.imread(path)
    x = x[x1:x1+crop_size[0], y1:y1+crop_size[1]]
    x = resize(x, shape)
    x = x/255.
    return x
    
def get_all_images(dataframe, shape=IMG_SHAPE, crop_size=CROP_SIZE):
    x1 = (ORIG_SHAPE[0]-CROP_SIZE[0])//2
    y1 = (ORIG_SHAPE[1]-CROP_SIZE[1])//2
   
    sel = dataframe.values
    ids = sel[:,0].astype(int).astype(str)
    y_batch = sel[:,1:]
    x_batch = []
    for i in tqdm(ids):
        x = get_image('/content/images_training_rev1/'+i+'.jpg', x1,y1, shape=shape, crop_size=crop_size)
        x_batch.append(x)
    x_batch = np.array(x_batch)
    return x_batch, y_batch
        
X_train, y_train = get_all_images(df_train)
X_test, y_test = get_all_images(df_test)

100%|██████████| 49262/49262 [14:14<00:00, 57.65it/s]
100%|██████████| 12316/12316 [03:33<00:00, 57.68it/s]


In [None]:
X_train

# Step 6 is to make the model

In [None]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense, BatchNormalization, GlobalMaxPooling2D
from keras import backend as K

def root_mean_squared_error(y_true, y_pred):
        return K.sqrt(K.mean(K.square(y_pred - y_true))) 

model = Sequential()
model.add(Conv2D(512, (3, 3), input_shape=(IMG_SHAPE[0], IMG_SHAPE[1], 3)))
model.add(Conv2D(256, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(256, (3, 3)))
model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(128, (3, 3)))
model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(GlobalMaxPooling2D())


model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(37))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=[root_mean_squared_error])
model.summary()

# Step 7 is to fit the model

In [None]:
batch_size = 128
model.fit(X_train, y_train, epochs=30, validation_data=(X_test, y_test))

# Step 8 is to implement on test data

In [None]:
import os
from tqdm import tqdm
batch_size = 1
def test_image_generator(ids, shape=IMG_SHAPE):
    x1 = (ORIG_SHAPE[0]-CROP_SIZE[0])//2
    y1 = (ORIG_SHAPE[1]-CROP_SIZE[1])//2
    x_batch = []
    for i in ids:
        x = get_image('/content/images_test_rev1/'+i, x1, y1, shape=IMG_SHAPE, crop_size=CROP_SIZE)
        x_batch.append(x)
    x_batch = np.array(x_batch)
    return x_batch

val_files = os.listdir('/content/images_test_rev1/')
val_predictions = []
N_val = len(val_files)
for i in tqdm(np.arange(0, N_val, batch_size)):
    if i+batch_size > N_val:
        upper = N_val
    else:
        upper = i+batch_size
    X = test_image_generator(val_files[i:upper])
    y_pred = model.predict(X)
    val_predictions.append(y_pred)
val_predictions = np.array(val_predictions)
Y_pred = np.vstack(val_predictions)
ids = np.array([v.split('.')[0] for v in val_files]).reshape(len(val_files),1)
submission_df = pd.DataFrame(np.hstack((ids, Y_pred)), columns=df.columns)
submission_df = submission_df.sort_values(by=['GalaxyID'])
submission_df.to_csv('sample_submission.csv', index=False)

In [None]:
submission_df[100:200]