References and Credit :

https://www.kaggle.com/awsaf49/seti-bl-tf-starter-tpu


![](https://drive.google.com/uc?id=1rkibSC447DCaBvjbwAOvD1gvg8mhOy3g)

<center><h1>Problem Statement 📝</h1></center>

###  🎯Goal: 
>**The goal of this competition is to identify the anomalous signals in the scans of Breakthrough Listen targets**

###  🎯Description: 
<div class="alert alert-block alert-warning">   

<strong>Green Bank Telescope :</strong>

📌 The Robert C. Byrd Green Bank Telescope, or GBT, is the world’s premiere single-dish radio telescope operating at meter to millimeter wavelengths. Its enormous 100-meter diameter collecting area, its unblocked aperture, and its excellent surface accuracy provide unprecedented sensitivity across the telescope’s full 0.1 – 116 GHz (3.0m – 2.6mm) operating range.
    
<br>

📌 The single focal plane is ideal for rapid, wide-field imaging systems – cameras. Because the GBT has access to 85% of the celestial sphere, it serves as the wide-field imaging complement to ALMA and the EVLA. Its operation is highly efficient, and it is used for observations about 6500 hours every year, with 2000-3000 hours per year available to high frequency science.
    
<br>

![](https://drive.google.com/uc?id=1d3ZXLr5yKUseO_aEICWozdF3OT9HcW5P)
    
<br>
    
<strong> About Breakthrough Listen project: </strong>

📌 It is an initiative to find signs of intelligent life in the universe. This 100 million dollar project was started by famous cosmologist Stephen Hawkins and Yuri Milner(an Internet investor) in 2015. Its teams are spread all over the world to find signs of intelligent life in universe. It is a ten years project where around 1,000,000 stars near to earth will be surveyed by scanning whole galactic plane of milky way.
    
<br>

</div>

**This problem is as interesting as debating about the existence of Super Power or GOD . Aamir Khan's Peekay movie which released in 2014 was one of my favourite movies in Bollywood . The film follows an alien who comes to Earth on a research mission, but loses his remote to a thief, who later sells it to a godman.**

![](https://drive.google.com/uc?id=1jmm2VSVq7zkJuY1EJcKgDRVdAtql9XpG)



<center><h1>Data Description 🤿 </h1></center>

> - ```train/``` - a training set of cadence snippet files stored in numpy float16 format (v1.20.1), one file per cadence snippet id, with corresponding labels found in the train_labels.csv file. Each file has dimension (6, 273, 256), with the 1st dimension representing the 6 positions of the cadence, and the 2nd and 3rd dimensions representing the 2D spectrogram.
> - ```test/``` - the test set cadence snippet files; you must predict whether or not the cadence contains a "needle", which is the target for this competition
> - ```sample_submission.csv``` - a sample submission file in the correct format
train_labels - targets corresponding (by id) to the cadence snippet files found in the train/ folder 
> 


<center><h1>Understanding the Data </h1></center>
<div class="alert alert-block alert-info">  

Breakthrough Listen generates similar spectrograms which are stored either as filterbank format or HDF5 format files.
<br>

📌 <strong> FilterBank Format :</strong>
<br>
2D Arrays of intensity as a function of frequency and time, accompanied by headers containing metadata such as the direction the telescope was pointed in, the frequency scale.
<br>
📌 <strong> Snippets :</strong>
<br>
Numpy arrays consisting of small regions of the spectrograms .
<br>
📌 <strong> Technosignatures :</strong>
<br>
Technosignature or technomarker is any measurable property or effect that provides scientific evidence of past or present technology. Technosignatures are analogous to the biosignatures that signal the presence of life, whether or not intelligent.
<br>
📌 <strong> Cadence Snippets :</strong>
<br>
Consider three nearby stars A , B and C to our primary target. Isolation of candidate technosignatures from RFI happens by alternating observations of primary target star with observations of three nearby stars .  5 minutes on star “A”, then 5 minutes on star “B”, then back to star “A” for 5 minutes, then “C”, then back to “A”, then finishing with 5 minutes on star “D”. One set of six observations (ABACAD) is referred to as a “cadence”.
<br>
📌 <strong> Haystack :</strong>
<br>
Thousands of cadence snippets put together is referred to as haystack
<br>
📌 <strong> Needle Signal /Target :</strong>
<br>
The goal of this problem is to identify the hidden needle signals in this haystack .

</div>

<center><h1>Import Libraries 📚</h1></center>

In [None]:
! pip install -q efficientnet

In [None]:
import re
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from functools import partial
import efficientnet.tfkeras as efn
from kaggle_datasets import KaggleDatasets
from sklearn.model_selection import train_test_split

# EDA 📊

<center><h1>Coming Soon </h1></center>

In [None]:
# TPU or GPU detection
# Detect hardware, return appropriate distribution strategy
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print(f'Running on TPU {tpu.master()}')
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
GCS_PATH = KaggleDatasets().get_gcs_path('setibl-256x256-tfrec-dataset')
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
IMAGE_SIZE = [256, 256]

In [None]:
files_train = np.sort(np.array(tf.io.gfile.glob(GCS_PATH + '/train*.tfrec')))
TEST_FILENAMES  = np.sort(np.array(tf.io.gfile.glob(GCS_PATH + '/test*.tfrec')))

In [None]:
TRAINING_FILENAMES, VALID_FILENAMES = train_test_split(
    files_train,
    test_size=0.2, random_state=42
)

In [None]:
print('Train TFRecord Files:', len(TRAINING_FILENAMES))
print('Validation TFRecord Files:', len(VALID_FILENAMES))
print('Test TFRecord Files:', len(TEST_FILENAMES))

In [None]:
def read_train_tfrecord(example):
    tfrec_format = {
        'image'                        : tf.io.FixedLenFeature([], tf.string),
        'image_id'                     : tf.io.FixedLenFeature([], tf.string),
        'target'                       : tf.io.FixedLenFeature([], tf.int64)
    }           
    example = tf.io.parse_single_example(example, tfrec_format)
    return example['image'], example['target']


def read_test_tfrecord(example, return_image_id):
    tfrec_format = {
        'image'                        : tf.io.FixedLenFeature([], tf.string),
        'image_id'                     : tf.io.FixedLenFeature([], tf.string),
    }
    example = tf.io.parse_single_example(example, tfrec_format)
    return example['image'], example['image_id'] if return_image_id else 0

 
def prepare_image(img, augment=True, dim=IMAGE_SIZE[0]):    
    img = tf.image.decode_png(img, channels=3)
    img = tf.cast(img, tf.float32) / 255.0
    
    if augment:
        
        img = tf.image.random_flip_left_right(img)
           
                      
    img = tf.reshape(img, [dim,dim, 3])
            
    return img

def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

In [None]:
      
def load_dataset(filenames, ordered=False, augment = True,labeled=True ,return_image_ids=True):
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO)
    dataset = dataset.with_options(ignore_order)
    if labeled: 
        dataset = dataset.map(read_train_tfrecord, num_parallel_calls=AUTO)
    else:
        dataset = dataset.map(lambda example: read_test_tfrecord(example, return_image_ids), 
                    num_parallel_calls=AUTO) 
    dataset = dataset.map(lambda img, imgid_or_label: (prepare_image(img, augment=augment, dim=IMAGE_SIZE[0]), 
                                               imgid_or_label), 
                num_parallel_calls=AUTO)
    return dataset
        

def get_training_dataset():
    dataset = load_dataset(TRAINING_FILENAMES, ordered=False, augment = True,labeled=True)
    dataset = dataset.repeat()
    dataset = dataset.shuffle(128, seed = 0)
    dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)
    dataset = dataset.prefetch(AUTO)
    return dataset

def get_validation_dataset(ordered=True):
    dataset = load_dataset(VALID_FILENAMES, ordered=ordered, augment = False,labeled=True)
    dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)
    #dataset = dataset.cache()
    dataset = dataset.prefetch(AUTO)
    return dataset

def get_test_dataset(ordered=False):
    dataset = load_dataset(TEST_FILENAMES, ordered=ordered, augment = False,labeled=True  ,return_image_ids=True)
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
NUM_TRAINING_IMAGES = count_data_items(TRAINING_FILENAMES)
NUM_VALIDATION_IMAGES = count_data_items(VALID_FILENAMES)
NUM_TEST_IMAGES = count_data_items(TEST_FILENAMES)
STEPS_PER_EPOCH = NUM_TRAINING_IMAGES // BATCH_SIZE
print(
    'Dataset: {} training images, {} validation images, {} unlabeled test images'.format(
        NUM_TRAINING_IMAGES, NUM_VALIDATION_IMAGES, NUM_TEST_IMAGES
    )
)

In [None]:
def build_lrfn(lr_start=0.00001, lr_max=0.000075, lr_min=0.000001, lr_rampup_epochs=20, lr_sustain_epochs=0, lr_exp_decay=.8):
    lr_max = lr_max * strategy.num_replicas_in_sync
    def lrfn(epoch):
        if epoch < lr_rampup_epochs:
            lr = (lr_max - lr_start) / lr_rampup_epochs * epoch + lr_start
        elif epoch < lr_rampup_epochs + lr_sustain_epochs:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) * lr_exp_decay ** (epoch - lr_rampup_epochs - lr_sustain_epochs) + lr_min
        return lr
    return lrfn

In [None]:
with strategy.scope():
    
    train_dataset = get_training_dataset()
    valid_dataset = get_validation_dataset()
    
    model = tf.keras.Sequential([
        efn.EfficientNetB6(
            input_shape=(256,256,3),
            weights='imagenet',
            include_top=False
        ),
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(512, activation= 'relu'), 
        tf.keras.layers.Dropout(0.25), 
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer='adam',
        loss = 'binary_crossentropy',
        metrics=['accuracy']
    )

model.summary()

In [None]:
lrfn = build_lrfn()
STEPS_PER_EPOCH = NUM_TRAINING_IMAGES // BATCH_SIZE
VALID_STEPS = NUM_VALIDATION_IMAGES // BATCH_SIZE


history = model.fit(
    train_dataset, epochs=1,
    steps_per_epoch=STEPS_PER_EPOCH,
    validation_data=valid_dataset,
    validation_steps=VALID_STEPS,
    callbacks=[
        tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=1),
        tf.keras.callbacks.ModelCheckpoint(
            os.path.join("./model.h5"),
            monitor='train_loss', verbose=0,
            save_best_only=True, save_weights_only=False,
            mode='auto', save_freq='epoch'
        )
    ]
)

<h1> Work in progress 🚧 </h1>