# Classifying Galaxies Using Convolutional Neural Networks

Around the clock, telescopes affixed to orbital satellites and ground-based observatories are taking millions of pictures of millions upon millions of celestial bodies. These data, of stars, planets and galaxies provide an invaluable resource to astronomers.

However, there is a bottleneck: until the data is annotated, it’s incredibly difficult for scientists to put it to good use. Additionally, scientists are usually interested in subsets of the data, like galaxies with unique characteristics.

In this project, you will build a neural network to classify deep-space galaxies. You will be using image data curated by Galaxy Zoo, a crowd-sourced project devoted to annotating galaxies in support of scientific discovery.

You will identify “odd” properties of galaxies. The data falls into four classes:

[1,0,0,0] - Galaxies with no identifying characteristics.
Three regular galaxies. Each has a bright center, surrounded by a cloud of stars.

[0,1,0,0] - Galaxies with rings.
Three ringed galaxies. Each has a bright center, surrounded by a ring of stars.

[0,0,1,0] - Galactic mergers.
Three photos of galaxies. Each contains two bright orbs surrounded by clouds. These images show galaxies in the process of merging.

[0,0,0,1] - “Other,” Irregular celestial bodies.Three photos of irregular celestial objects. Each are irregular clouds. The second has four bright orbs, seemingly suspensed above the cloud of stars.

Because the dataset comprises over one thousand images, you’ll use a custom function, load_galaxy_data() to load the compressed data files into the Codecademy learning environment as NumPy arrays. Take a look at the shape of the data.
Use .shape to print the dimensions of the input_data and labels.
What does the last dimension of the data indicate about the image data? What does the last dimension of the labels indicate about the labels?

Next, divide the data into training and validation data, using sklearn’s train_test_split() function.
Set the test_size argument to be 0.20.
Shuffle the data.
Set the random_state to be 222.
Set stratify=labels. This ensures that ratios of galaxies in your testing data will be the same as in the original dataset.

In [None]:
input_data, labels = load_galaxy_data()

print(input_data.shape)
print(labels.shape)

x_train, x_valid, y_train, y_valid = train_test_split(input_data, labels, test_size=0.20, stratify=labels, shuffle=True, random_state=222)

Now, it’s time to preprocess the input.
Define an ImageDataGenerator, and configure it so that the object will normalize the pixels using the rescale parameter.

Next, create two NumpyArrayIterators using the .flow(x,y,batch_size=?) method. We recommend using a batch size of 5. Significantly larger batch sizes may cause memory issues on the Codecademy platform.
Create a training data iterator by calling .flow() on your training data and labels.
Create a validation data iterator by calling .flow() on your training data and labels.



Next, build your model, starting with the input shape and output layer.
Create a tf.keras.Sequential model named model.
Add a tf.keras.Input layer. Refer back to the shape of the data. What should the input shape be?
Add a tf.keras.layers.Dense layer as your output layer. Make sure that it outputs 4 features, for the four classes (“Normal”,”Ringed”,”Merger”,”Other”).
Remember to use a softmax activation on this final layer.

Before you finish designing your architecture, compile your model with an optimizer, loss, and metrics.
Use model.compile(optimizer=?,loss=?, metrics=[?,?]) to compile your model.
Use tf.keras.optimizers.Adam with a learning_rate of 0.001.
Because the labels are one-hot categories, use tf.keras.losses.CategoricalCrossentropy() as your loss.
Set [tf.keras.metrics.CategoricalAccuracy(),tf.keras.metrics.AUC()] as your metrics.


Now, let’s go back and finish fleshing out your architecture. An architecture that works well on this task is two convolutional layers, interspersed with max pooling layers, followed by two dense layers:

Conv2D: 8 filters, each 3x3 with strides of 2

MaxPooling2D: pool_size=(2, 2), strides=2

Conv2D: 8 filters, each 3x3 with strides of 2

MaxPooling2D: pool_size=(2, 2), strides=2

Flatten Layer

Hidden Dense Layer with 16 hidden units

Output Dense Layer

Try coding up this architecture yourself, using:

tf.keras.layers.Conv2D

tf.keras.layers.MaxPooling2D

tf.keras.layers.Flatten()

tf.keras.layers.Dense()

Don’t forget to use “relu” activations for Dense and Conv2D hidden layers!

Use model.summary() to confirm this.

In [None]:
data_generator = ImageDataGenerator(rescale=1./255)

training_iterator = data_generator.flow(x_train, y_train,batch_size=5)
validation_iterator = data_generator.flow(x_valid, y_valid, batch_size=5)

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(128, 128, 3)))
model.add(tf.keras.layers.Dense(4,activation="softmax"))

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(128, 128, 3)))
model.add(tf.keras.layers.Conv2D(8, 3, strides=2, activation="relu")) 
model.add(tf.keras.layers.MaxPooling2D(
    pool_size=(2, 2), strides=(2,2)))
model.add(tf.keras.layers.Conv2D(8, 3, strides=2, activation="relu")) 
model.add(tf.keras.layers.MaxPooling2D(
    pool_size=(2,2), strides=(2,2)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(16, activation="relu"))
model.add(tf.keras.layers.Dense(4, activation="softmax"))

model.summary()

In [None]:
Use model.fit(...) to train your model.
The first argument should be your training iterator.
Set steps_per_epoch to be the length of your training data, divided by your batch size.
Set epochs to be 8.
Set validation_data to be your validation iterator.
Set validation_steps to be the length of your validation data, divided by your batch size.



Now you can run your code to train the model. Training may take a minute or two. After training for twelve epochs, your model’s accuracy should be around 0.60-0.70, and your AUC should fall into the 0.80-0.90 range!
What do these results mean?
Your accuracy tells you that your model assigns the highest probability to the correct class more than 60% of the time. For a classification task with over four classes, this is no small feat: a random baseline model would achieve only ~25% accuracy on the dataset. Your AUC tells you that for a random galaxy, there is more than an 80% chance your model would assign a higher probability to a true class than to a false one.

    



In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.CategoricalCrossentropy(),
    metrics=[tf.keras.metrics.CategoricalAccuracy(),tf.keras.metrics.AUC()])


model.fit(
        training_iterator,
        steps_per_epoch=len(x_train)/5,
        epochs=8,
        validation_data=validation_iterator,
        validation_steps=len(x_valid)/5)


from visualize import visualize_activations
visualize_activations(model,validation_iterator)