# Normalizing image data in preparation for classification:

### Data can often be manipulated in such a way that the underlying information isn't altered, but the data is better prepared for input into machine learning algorithms. Such manipulation is referred to as "pre-processing", and usually involves scaling (or "standardizing") the data, or applying the same operation to each data point, such that the underlying data is not lost, but the data is now transformed into something that results in better performance for machine learning algorithms.

### This notebook:
- Opens each cutout
- Applies 4 different data normalization techniques to each cutout
- Flattens each image (in both raw and normalized forms), in preparation for the classifier
- Saves all flatted data to .csv format in correct form for the classifier
- Plots a few cutouts so you can visualize what the normalizing techniques are doing to the data

In [None]:
import numpy as np
import os
import astropy.io.fits as fits
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt

### Getting list of all cutout paths

In [None]:
cutout_path = 'cutouts/'
cutout_files = []
gals = 0
stars = 0

for file in os.listdir(cutout_path):
    if file.endswith('.fits'):
        cutout_files.append(file)
        if file.startswith('gal'):
            gals += 1
        elif file.startswith('star'):
            stars += 1

print("Found", len(cutout_files), "cutouts:", stars, "stars, and", gals, "gals")

### Flatten and save the un-normalized "raw" image data for every cutout. 

- Open each cutout
- Set truth labels (0=Star, 1=Galaxy)
- Flatten the data (turns each 20x20 pixel image from shape (20, 20) to shape (400,))
- Save to .csv

In [None]:
gal_star = pd.DataFrame({})

for idx, file in enumerate(tqdm(cutout_files)):
    image = fits.getdata(cutout_path+str(file))
    
    if file.startswith('star')==True:
        obj_id = 0
    if file.startswith('galaxy')==True:
        obj_id = 1
         
    data_flattened = image.flatten()
    
    gal_star = gal_star.append([np.append(obj_id, data_flattened)], ignore_index=True)
    
gal_star.to_csv('raw_image_data.csv', header=False, index=False)

## Define each data normalization technique function:

### Technique 1: 

#### For each cutout:
- Take the log of each pixel value
- Find the minimum pixel value accross the image
- Subtract that minimum value from each pixel

In [None]:
def norm_1(data):
    data_log = np.log10(data)
    min_log_data = np.amin(data_log)
    data_norm = data_log - min_log_data
    return data_norm

### Technique 2:

#### For each cutout:
- Calculate minimum pixel value across image
- Calculate maximum pixel value accross image
- Scale all data between (0, 1) with:<br>
$data\_norm = \frac{data - min}{max - min}$

In [None]:
def norm_2(data):
    min_data = np.min(data)
    max_data = np.max(data)
    data_norm = (data - min_data) / (max_data - min_data)
    return data_norm

### Technique 3 & 4: 

This technique is split into 2 parts (because you need to pause in the middle to calculate the maximum pixel value over ALL images). The only difference between technique 3 & 4 is that you log the data first in technique 4. From start to end, this technique does the following:

#### For each cutout:
- Log each pixel (**technique 4 only**)
- Find the minimum pixel value in the image
- Subtract that value off of each pixel
- Calculate the max value over the entire cutout
- Once you caluclate the maximum value for each image, calculate the maximum of those (giving you the maximum pixel value over ALL images.
- Divide each pixel in the image by the maximum value over ALL images

In [None]:
def norm_3_4_part_1(data, log=bool):
    if log:
        data = np.log10(data)
    min_data = np.amin(data)
    data_min_subtracted = data - min_data
    max_val = np.amax(data_min_subtracted)
    return data_min_subtracted, max_val

def norm_3_4_part_2(data, max_val):
    data_norm = data/max_val
    return data_norm

## Applying normalization techniques:

#### Defining variables for appending to

In [None]:
# Create empty DataFrame (per technique) to append each normalized image to
gal_star_norm_1 = pd.DataFrame({})
gal_star_norm_2 = pd.DataFrame({})
gal_star_norm_3 = pd.DataFrame({})
gal_star_norm_4 = pd.DataFrame({})

# Create empty list to append max pixel value from each cutout (technique 3 & 4 only)
max_pixel_all_images_3 = []
max_pixel_all_images_4 = []

### Run each normalization technique on each cutout:

***Note:*** Only part 1 of technique 3 & 4 happen here

In [None]:
for idx, row in tqdm(gal_star.iterrows(), total=gal_star.shape[0]):
    # Separate type and data
    raw_data = row[1:].values
    obj_id = row[:1].values 
    
    # Run raw data through each normalization technique
    norm_1_data = norm_1(raw_data)
    norm_2_data = norm_2(raw_data)
    norm_3_data, max_val_3 = norm_3_4_part_1(raw_data, log=False)
    norm_4_data, max_val_4 = norm_3_4_part_1(raw_data, log=True)
    
    # For technique 3 & 4: append the max pixel value for the current cutout
    max_pixel_all_images_3.append(max_val_3)
    max_pixel_all_images_4.append(max_val_4)
    
    # Append values for current cutout to corresponding dataframe
    gal_star_norm_1 = gal_star_norm_1.append([np.append(obj_id, norm_1_data)], ignore_index=True)
    gal_star_norm_2 = gal_star_norm_2.append([np.append(obj_id, norm_2_data)], ignore_index=True)
    gal_star_norm_3 = gal_star_norm_3.append([np.append(obj_id, norm_3_data)], ignore_index=True)
    gal_star_norm_4 = gal_star_norm_4.append([np.append(obj_id, norm_4_data)], ignore_index=True)

#### Save technique 1 & 2 to .csv for the classifier, as they are complete

In [None]:
gal_star_norm_1.to_csv('norm_1_image_data.csv', header=False, index=False)
gal_star_norm_2.to_csv('norm_2_image_data.csv', header=False, index=False)

#### Finish technique 3 & 4, and save to .csv for the classifier as well:

In [None]:
gal_star_norm_3_final = pd.DataFrame({})
gal_star_norm_4_final = pd.DataFrame({})

for idx, row in tqdm(gal_star_norm_3.iterrows(), total=gal_star_norm_3.shape[0], desc='Finishing technique 3'):
    data = row[1:].values
    obj_id = row[:1].values
    norm_3_data_final = norm_3_4_part_2(data, np.amax(max_pixel_all_images_3))
    gal_star_norm_3_final = gal_star_norm_3_final.append([np.append(obj_id, norm_3_data_final)], ignore_index=True)
    
for idx, row in tqdm(gal_star_norm_4.iterrows(), total=gal_star_norm_4.shape[0], desc='Finishing technique 4'):
    data = row[1:].values
    obj_id = row[:1].values
    norm_4_data_final = norm_3_4_part_2(data, np.amax(max_pixel_all_images_4))
    gal_star_norm_4_final = gal_star_norm_4_final.append([np.append(obj_id, norm_4_data_final)], ignore_index=True)
    
gal_star_norm_3_final.to_csv('norm_3_image_data.csv', header=False, index=False)
gal_star_norm_4_final.to_csv('norm_4_image_data.csv', header=False, index=False)

### Plot for first 5 object to visualize what each normalization method is doing

In [None]:
def plot_image(raw, norm1, norm2, norm3, norm4, obj_id):
    if obj_id == 0.0:
        object_name = 'Star:'
    if obj_id == 1.0:
        object_name = 'Galaxy:'
    fig, axes = plt.subplots(nrows=1, ncols=5, sharex=True, sharey=True, figsize=(15, 10))
    
    axes[0].imshow(raw, cmap='gray')
    axes[1].imshow(norm1, cmap='gray')
    axes[2].imshow(norm2, cmap='gray')
    axes[3].imshow(norm3, cmap='gray')
    axes[4].imshow(norm4, cmap='gray')
    
    axes[0].title.set_text(str(object_name)+' Raw Data')
    axes[1].title.set_text(str(object_name)+' Norm 1')
    axes[2].title.set_text(str(object_name)+' Norm 2')
    axes[3].title.set_text(str(object_name)+' Norm 3')
    axes[4].title.set_text(str(object_name)+' Norm 4')
    fig.show()

In [None]:
for idx in range(5):
    obj_id = gal_star[0][idx]
    raw_data = np.reshape(gal_star.loc[idx][1:].values, (20, 20))
    norm_1 = np.reshape(gal_star_norm_1.loc[idx][1:].values, (20, 20))
    norm_2 = np.reshape(gal_star_norm_2.loc[idx][1:].values, (20, 20))
    norm_3 = np.reshape(gal_star_norm_3_final.loc[idx][1:].values, (20, 20))
    norm_4 = np.reshape(gal_star_norm_4_final.loc[idx][1:].values, (20, 20))
    plot_image(raw_data, norm_1, norm_2, norm_3, norm_4, obj_id)