# Read _.fits_ data
---






In [1]:
DATA_DIR = './database/dataset_exoplanets_confirmed'

Our database, storage on DATA_DIR, can be found on `CoRoT IAS Archive` website or in my [Google Drive folder](https://drive.google.com/drive/folders/1xMlUVc8K5BRd8663Q-UdfkTworL4vDIK?usp=sharing/), which for now is structured as ... 

```Json
./database
   │
   └── dataset_exoplanets_confirmed
         ├── ..._.fits
         ├── ..._.fits
         ⋮       ⋮
         └── ..._n.fits
  
```

In `database/dataset_exoplanets_confirmed` has all _.fits_ files of _CoRoT targets with confirmed exoplanets_, obtained in `CoRoT IAS Archive`.

The main purpose of reading _.fits_ data-set is to turn them into a pandas DataFrame. For this, it was decided to use `Astropy==4.2` library.

In [2]:
from astropy.io import fits

image_file = fits.open(DATA_DIR + '/EN2_STAR_CHR_0101086161_20070516T060226_20071005T074409.fits')

print(type(image_file), "\n")
print(len(image_file))

<class 'astropy.io.fits.hdu.hdulist.HDUList'> 

4


Now, in the variable `image_file` we have all the information storage into _.fits_ file. 

It is great to know how this file type works, so in order to know better the data, the `type` METHOD returns that `image_file` belongs to `HDUList` class.

> A little breath for FITS: The Flexible Image Transport System (FITS) is a portable file that it is used for store, transmit and process data formatted as multi-dimensional arrays or tables. FITS is widely used in the astronomy community to store images and tables. We use the software _[QFits View](https://www.mpe.mpg.de/~ott/QFitsView/)_ to open this files in our machine and, as soon as it opens, it is ask to choose an extension. 

So, _.fits_ files can have several extensions, in CoRoT database case, by the `len` command, it only has 4. We are only interessed in the most informative extension with the minimal computacional cost, since we are not interessed in dealing with big-data issues, the third table is the most attractive one.


In [3]:
scidata = image_file[3].data

print(type(scidata), "\n")
print(scidata)

<class 'astropy.io.fits.fitsrec.FITS_rec'> 

[(54236.75758185, 112626.77,   0) (54236.76350826, 112605.61,   0)
 (54236.76943468, 112771.5 ,   8) ... (54378.80910033, 112496.13, 256)
 (54378.81502574, 112344.83,   0) (54378.82095114, 112318.5 ,  80)]


The variable `scidata` contains the information of the third table. We can see that is belongs to the `FITS_rec` class and, to manipulate this type, it is decided to transform it into an array, using `NumPy==1.19.5` library and them plot the raw data, using `Plotly==4.4.1`.

> At this point, we come to a problem that is worth commenting on. The original file, when transformed into an array, according to the code: `x = np.array(scidata)` presents the following error when being plotted, using the *Plotly* library: <br /><br /> **ValueError: Big-endian buffer not supported on little-endian compiler** <br /><br /> This error occurs when the data being worked on was created on a machine with a different byte order than the one on which we are running *Python*.
To deal with this problem, we must convert the NumPy array to the byte order of the native system before transforming it into a DataFrame or Series. <br /> <br />Therefore, we use the methods *byteswap()* e *newbyteorder*(). <br /><br />Reference: [Pandas Documentation - Byte-ordering issues](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#byte-ordering-issues)

In [4]:
import numpy as np

data_aux = np.array(scidata).byteswap().newbyteorder()

In [5]:
import plotly.express as px

fig = px.line(data_aux, x='DATEBARTT', y='WHITEFLUXSYS', title='Raw Light Curve')
fig.show()

# Knowing and fixing the header
---
With normalized data, stored in `data_normalized`, we transform them into `Pandas==1.1.5` DataFrame and now we are going to understand what `DATEBARRT`, `WHITEFLUXSYS` and `STATUSSYS` means.

[Reference: Part II.4 The “ready to use” CoRoT data from The CoRoT Legacy Book](http://idoc-corot.ias.u-psud.fr/sitools/common/html/doc/cII_4_data.pdf)

In [6]:
import pandas as pd

data = pd.DataFrame(data_aux)
data.head()

Unnamed: 0,DATEBARTT,WHITEFLUXSYS,STATUSSYS
0,54236.757582,112626.773438,0
1,54236.763508,112605.609375,0
2,54236.769435,112771.5,8
3,54236.775361,113113.601562,0
4,54236.781288,112621.789062,256


## Quick data analysis 

We will do an data analysis in our light curves, aiming to prepare them for the filtering techniques and future Machine Learning model.

We can see by the `shape` method that, in this sample, there are 23951 rows and 3 columns.

In [7]:
(row, columns) = data.shape
print(row, columns)

23951 3


The `isnull().values.any()` checks if there's any Not a Number (NaN) value on our data and as it returned _False_ there is not missing values.

Note. No Machine Learning model can work with NaN values.

In [8]:
data.isnull().values.any()

False

The `describe()` method returns a DataFrame with the statistical summary of the data. Thus, we are aware of parameters such as Average, Standard Deviation, Minimum Value and Maximum Value of the values. 

In [9]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
DATEBARTT,23951.0,54307.807634,41.011608,54236.757582,54272.313222,54307.818667,54343.325513,54378.820951
WHITEFLUXSYS,23951.0,112501.03125,292.86673,111110.515625,112276.492188,112529.890625,112724.828125,113609.890625
STATUSSYS,23951.0,101.434429,204.139657,0.0,0.0,0.0,256.0,1024.0


## STATUSSYS 

It represents the Flag of the status in a `int` type. As this information is not useful in this research, we will just discart it.

In [10]:
data.drop('STATUSSYS', axis=1, inplace=True)
data.head()

Unnamed: 0,DATEBARTT,WHITEFLUXSYS
0,54236.757582,112626.773438
1,54236.763508,112605.609375
2,54236.769435,112771.5
3,54236.775361,113113.601562
4,54236.781288,112621.789062


## ​​WHITEFLUXSYS

It represents the amount of white light flux captured by CoRoT, after the correction of the SYSTEMATIC.

SYSTEMATIC is a data corrections procedure applied to Faint stars data. It is formed by the correction of residual systematics skews in the whole set of light curves of the run added to BARFILL.

The BARFILL is composed of correction of the jumps and replacement of the invalid and missing data using the Inpainting method added to BAR.

The BAR os the process where it occurs the correction from aliasing, offsets, backgrounds and of the jitter of the satellite and correction of the change of the temperature set point and of the loss of long-term efficiency.


## DATEBARTT

It represents the date of the measurement in the solar barycentric reference frame. It is usually measured in Julian Days.


The method of counting days sequentially, starting at an arbitrary date in the past, determined at noon on January 1, 4713 BC by the Julian calendar, or November 24, 4714 BC, by the Gregorian calendar, became known as the Julian Date or Julian Day (JD), proposed by Joseph Justus Scaliger and based on the Julian calendar.

Julian Days are counted continuously, without separating them into days, weeks, months or even years. Each day starts at noon and lasts until the next noon. Thus, there is a great advantage in the night period (where Astronomical observations are made) as there are no time interruptions in this period, unlike the Gregorian Caledary in which the day starts at midnight and ends at midnight following. In addition, with the adoption of Julian Day in astronomical activities, it becomes much easier to measure the period between two events since it is only necessary to subtract one DJ from the other. 



With that in mind, it is useful to turn `DATEBARTT` into convetional date form (day, month and year).

So, we are going to define a function that will transform Julian date to standard time and them apply to `data`.

Note. We are using `julian==0.14`.

In [11]:
import julian
from datetime import datetime

def julian_to_stdtime(old_date):
  """
  This is the function used to convert Julian 
  date to Gregorian date

  :param numpy.float64 old_date: Represents Julian date of float type
  """
  aux_1 = julian.from_jd(old_date, fmt='mjd')
  aux_2 = datetime.strptime(str(aux_1), '%Y-%m-%d %H:%M:%S.%f')
  new_date = str(aux_2)

  return new_date

In [12]:
data.DATEBARTT = data.DATEBARTT.apply(lambda x: julian_to_stdtime(x))

data.rename(columns={'DATEBARTT': 'DATE'}, inplace = True)
data.rename(columns={'WHITEFLUXSYS': 'WHITEFLUX'}, inplace = True)

data.head()

Unnamed: 0,DATE,WHITEFLUX
0,2007-05-16 18:10:55.071642,112626.773438
1,2007-05-16 18:19:27.113766,112605.609375
2,2007-05-16 18:27:59.155929,112771.5
3,2007-05-16 18:36:31.198092,113113.601562
4,2007-05-16 18:45:03.240256,112621.789062


Now, we can see our x-axis in Year/Month/Day format.

In [13]:
fig = px.line(data, x='DATE', y='WHITEFLUX', title='Light Curve in standard time format')
fig.show()

# Sampling light curve 
---
With this project, we intend to find out the best filtering technique to apply into a Light Curve. But, we can not apply those filters in each Light Curve, so to solve it, it is a good practice to create a sample light curve thath holds the general information of all data. 

First, to analyse our time-series curve, we might look to graph that contains the box spot of each curve.  

In [14]:
# 1. Como faz box spot para uma curva
# 2. Laço de repetição para fazer em todas

# Algorithms
---
Here, the `fits_to_csv` is implemented for each light curve and it is the sumary of all the work we did until Section Sampling Light Curve but expanded to all files into our database.

Note. `fits_to_csv` use `julian_to_stdtime` function (Knowing and fixing the header - DATEBARTT)

In [15]:
# Libs already imported
from astropy.io import fits
import numpy as np
import plotly.express as px
import pandas as pd
import julian
from datetime import datetime

# Libs used for files/folder manipulation
import os
import shutil

# Lib for mesure the execution time
import time

In [16]:
def fits_to_csv(path):
  '''
  Normalize one .fits files and convert it
  into a .csv file

  :param str path: Path to .fits data-set folder
  '''
  image_file = fits.open(path)
  scidata = image_file[3].data
  data_aux = np.array(scidata).byteswap().newbyteorder()
  data = pd.DataFrame(data_aux)
  data.drop('STATUSSYS', axis=1, inplace=True)
  data.rename(columns={'DATEBARTT': 'DATE'}, inplace = True)
  data.rename(columns={'WHITEFLUXSYS': 'WHITEFLUX'}, inplace = True)

  if (path == (DATA_DIR + '/EN2_STAR_CHR_0101368192_20070516T060050_20071015T062306.fits')): # There's a convert problem that will not affect our study
    data.drop(19573, inplace = True)
  
  if (path == (DATA_DIR + '/EN2_STAR_MON_0630831435_20110708T151253_20110930T044950.fits')): # There's a convert problem that will not affect our study
    data.drop(2521, inplace = True)

  data.DATE = data.DATE.apply(lambda x: julian_to_stdtime(x))  

  # Creating folder with .csv files
  CSV_DIR = 'csv_files'
  if not os.path.isdir(CSV_DIR):
    os.mkdir(CSV_DIR)

  # Renaming .csv file
  name = path[path.rfind('/')+1:path.rfind('.')] + '.csv'
  data.to_csv(name, index=False)

  # Move to .csv folder
  shutil.move(name, CSV_DIR) 

Applying *fits_to_csv()* to `./database`.

In [17]:
i = 0
my_dir = DATA_DIR
t_o = time.time()

for root_dir_path, sub_dirs, files in os.walk(my_dir):
  for i in range(0, len(files)):
    fits_to_csv( my_dir + os.path.abspath(files[i])[os.path.abspath(files[i]).rfind('/'):] )

t_f = time.time()

print("It takes:", round(t_f-t_o, 2), "seconds to apply")

It takes: 13.96 seconds to apply


Zipping *csv_files* folder

In [18]:
!zip -r /content/csv_files.zip /content/csv_files

  adding: content/csv_files/ (stored 0%)
  adding: content/csv_files/EN2_STAR_CHR_0315239728_20100305T001525_20100329T065610.csv (deflated 67%)
  adding: content/csv_files/EN2_STAR_CHR_0315198039_20100305T001525_20100329T065610.csv (deflated 67%)
  adding: content/csv_files/EN2_STAR_CHR_0102912369_20070203T130553_20070402T070126.csv (deflated 67%)
  adding: content/csv_files/EN2_STAR_IMAG_0102725122_20120112T183055_20120329T093058.csv (deflated 66%)
  adding: content/csv_files/EN2_STAR_MON_0652180928_20110708T151253_20110930T044950.csv (deflated 66%)
  adding: content/csv_files/EN2_STAR_MON_0105793995_20080415T231048_20080907T224903.csv (deflated 67%)
  adding: content/csv_files/EN2_STAR_CHR_0105891283_20080415T231048_20080907T224903.csv (deflated 67%)
  adding: content/csv_files/EN2_STAR_IMAG_0102708694_20120112T183055_20120329T093058.csv (deflated 66%)
  adding: content/csv_files/EN2_STAR_MON_0102725122_20071023T223035_20080303T093534.csv (deflated 66%)
  adding: content/csv_files/EN

Downloading zipped folder

In [19]:
from google.colab import files
files.download("csv_files.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>