# Hunting Exoplanets In Space - Exploring The DataFrame

There are deep Physical and mathematical theories on exploring exoplanets in the space.
Right now, we just need to understand the basic principle behind these theories to be able to learn how the exoplanets are found.

---



### Finding Exoplanets Principle

Imagine that you are in your room during the day time with the window curtains open. The room probably would be well-lit. Now, imagine that you close the curtains of the window and block the sunlight from entering the room. In this situation, the room would be darker and the visibility would be low.

So, whenever the curtains are open, the brightness of the light would be higher whereas when the curtains are closed, the brightness would be lower. We can measure the brightness of the light using a spectroscope.

The same principle is applied in searching for an exoplanet. There are billions of galaxies in the universe. These galaxies have millions of stars. One such galaxy is the Milky-way galaxy in which our solar system exists. The solar system has a star called Sun which has its own light. In astronomy, a star is a heavenly body which has its own light. There are 8 planets in our solar system orbiting around the Sun. Similar to this, in some other galaxy there would be a star and probably a planet would be revolving around that star.

Long back, NASA placed a telescope called Kepler telescope in the space. This telescope is used to measure the brightness of the stars in the far-distant galaxies.


<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/kepler-space-telescope.jpg' width="800">

Whenever a planet, while orbiting its star, comes in between the telescope and the star, the brightness of the star recorded by the telescope is lower whereas when the planet goes behind the star, the brightness of the light recorded by the telescope is higher.

This method of detecting exoplanets in far-distant galaxies through the brightness of the light emitted by a star is called the **Transit Method**. You can read about it by clicking on the link provided in the **Activities** section under the title **How Do Astronomers Find Exoplanets?**

Essentially, if you plot the brightness on the vertical axis and the time on the horizontal axis, then you will see that the brightness of the star recorded by the telescope increases and decreases periodically. Thus, in the graph, you will notice a wave-like pattern. This indicates that the star definitely has at least one planet.

<img src = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/transit-method-gif.gif' width='800'>

The image below shows some of the exoplanets (Kepler 4b to Kepler 8b) discovered by the Kepler space telescope. You can see the brightness level radiated by the star for each planet. The Flux values on the vertical axis represent the brightness level of the star.

<img src = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/transit-method.jpg' width='800'>


As you can see in the image above, the bigger the planet (Kepler 6b), deeper the dip in the brightness level. And, the longer the orbital period of a planet, broader is the width of the dip (Kepler 7b). Kepler 7b has the greatest orbital period of 4.9 days among these 5 planets.

So, this is how NASA finds a planet beyond our solar system. Now, let's use Kepler space telescope dataset to create a Pandas DataFrame in order to find out which stars beyond our solar system have a planet.

---

#### Loading The Training Dataset

(Use the Links to the (or download) datasets are too large to upload here)

Dataset links:

1. Train dataset
   
   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv

2. Test dataset
   
   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

In [3]:
# TRAINING FILE

import pandas as pd
exo_train_df = pd.read_csv('exoTrain.csv')
print(exo_train_df.shape)
exo_train_df.head()


(5087, 3198)


Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


In [4]:
# TESTING FILE

import pandas as pd
exo_test_df = pd.read_csv('exoTest.csv')
print(exo_test_df.shape)
exo_test_df.head()

(570, 3198)


Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,...,14.52,19.29,14.44,-1.62,13.33,45.5,31.93,35.78,269.43,57.72
1,2,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,...,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,...,17.82,-51.66,-48.29,-59.99,-82.1,-174.54,-95.23,-162.68,-36.79,30.63
3,2,-826.0,-827.31,-846.12,-836.03,-745.5,-784.69,-791.22,-746.5,-709.53,...,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,2,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.9,-45.2,-5.04,...,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84


In [5]:
# DIMENSIONS OF THE DATASETS

print(f"Shape of Traning set: {exo_train_df.shape}")
print(f"Shape of Test set: {exo_test_df.shape}")

Shape of Traning set: (5087, 3198)
Shape of Test set: (570, 3198)


---

#### Slicing A DataFrame Using The `iloc[]` Function

**Syntax:**

`dataframe_name.iloc[row_position_start : row_position_end, column_position_start : column_position_end]`

In this syntax:

- `row_position_start` denotes the position of the row in the DataFrame **starting** from whose values you want to take in the new Pandas series or DataFrame.
- `row_position_end` denotes the position of the row in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.
- `column_position_start` denotes the position of the column in the DataFrame **starting** from whose values you want to take in the new Pandas series or DataFrame.
- `column_position_end` denotes the position of the column in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.

You can verify manually whether we have extracted the values from the first row or not by viewing the first 5 rows of the DataFrame using the `head()` function.

In [6]:
# DTAFRAME INFO

print(f"Traning info:{exo_train_df.iloc[:,0:2].info()}\n")
print(f"Test info:{exo_test_df.info()}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5087 entries, 0 to 5086
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   LABEL   5087 non-null   int64  
 1   FLUX.1  5087 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 79.6 KB
Traning info:None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570 entries, 0 to 569
Columns: 3198 entries, LABEL to FLUX.3197
dtypes: float64(3197), int64(1)
memory usage: 13.9 MB
Test info:None


---

#### Check For The Missing Values

In [7]:
print(exo_train_df.isna().sum())
print(exo_test_df.isna().sum())

LABEL        0
FLUX.1       0
FLUX.2       0
FLUX.3       0
FLUX.4       0
            ..
FLUX.3193    0
FLUX.3194    0
FLUX.3195    0
FLUX.3196    0
FLUX.3197    0
Length: 3198, dtype: int64
LABEL        0
FLUX.1       0
FLUX.2       0
FLUX.3       0
FLUX.4       0
            ..
FLUX.3193    0
FLUX.3194    0
FLUX.3195    0
FLUX.3196    0
FLUX.3197    0
Length: 3198, dtype: int64


Tere are no missing values in the dataframe

In [8]:
# Columns in the Traning and Testing DataFrame.

print(exo_train_df.columns,"\n","=="*50,"\n")
print(exo_test_df.columns)

Index(['LABEL', 'FLUX.1', 'FLUX.2', 'FLUX.3', 'FLUX.4', 'FLUX.5', 'FLUX.6',
       'FLUX.7', 'FLUX.8', 'FLUX.9',
       ...
       'FLUX.3188', 'FLUX.3189', 'FLUX.3190', 'FLUX.3191', 'FLUX.3192',
       'FLUX.3193', 'FLUX.3194', 'FLUX.3195', 'FLUX.3196', 'FLUX.3197'],
      dtype='object', length=3198) 

Index(['LABEL', 'FLUX.1', 'FLUX.2', 'FLUX.3', 'FLUX.4', 'FLUX.5', 'FLUX.6',
       'FLUX.7', 'FLUX.8', 'FLUX.9',
       ...
       'FLUX.3188', 'FLUX.3189', 'FLUX.3190', 'FLUX.3191', 'FLUX.3192',
       'FLUX.3193', 'FLUX.3194', 'FLUX.3195', 'FLUX.3196', 'FLUX.3197'],
      dtype='object', length=3198)


---

In [12]:
print(f"Value counts Training: {exo_train_df.iloc[:,0].value_counts()}\n")
print(f"Value counts Teating: {exo_test_df.iloc[:,0].value_counts()}")

Value counts Training: LABEL
1    5050
2      37
Name: count, dtype: int64

Value counts Teating: LABEL
1    565
2      5
Name: count, dtype: int64


#### Explanation

This code prints the distribution of class labels in the training and testing datasets.  

- **Label `1`** → No planets detected.  
- **Label `2`** → Indicates the presence of a planet.  

This helps in understanding the class distribution and identifying any potential imbalance in the dataset.  


---