# 3. Extracting minimum necessary data from original Gaia's data files
To make a Milky Way texture, the necessary data are only the two-dimensional positions on the celestial sphere and magnitudes of stars.
Here, let's extract them from the original data files and save them in other files in order to reduce the data file size. 

The original files of Gaia DR1 in the CSV format downloaded in the previous section are compressed by the gzip. We can treat these gzipped files by using `gzip` module in Python.

First of all, enter the following code in the cell below. This import Python modules used in this section and then defines the directories for input and output.
```
import gzip
import os.path
import struct

src_dir = './orig_data'
dest_dir = './extract_data'
```

Then, let's see the first several lines of a Gaia DR1 data.

For the first index `i1` and the second `i2`, the input filename is generated using the `format` method as follows:
```
fni = '{0}/GaiaSource_000-{1:03d}-{2:03d}.csv.gz'.format(src_dir,i1,i2)
```

The gzip files can be treated by using the `gzip` module as if it were plain text files.

Enter the following code and run it. This reads a gzipped data file and shows the first 4 lines.
```
# Generate file name for given i1 and i2
i1 = 0
i2 = 0
fni = '{0}/GaiaSource_000-{1:03d}-{2:03d}.csv.gz'.format(src_dir,i1,i2)

with gzip.open(fni, 'rb') as f: # Open the input file in gzip format
    for i in range(4):
        line = f.readline()
        line = line.decode('utf-8') # convert binary data to utf-8 text
        print (line)
```

You see that the first line is the explanation of data and the following lines are actual data.

Here, for our purpose, we need the Galactic coordinates (l, b) and the magnitude.
Careful investigations of the first line show:
The Galactic longitude l is the 54th elements.
The Galactic latitude b is the 55th element.
The magnitude (Gaia G-magnitude) is the 52nd element.

As a check, enter the following code which has been slightly modified in the cell below.
Note that the index of array starts with zero in Python. Thus, the array index for, e.g., the 54th element is given by 53.
```
# Generate file name for given i1 and i2
i1 = 0
i2 = 0
fni = '{0}/GaiaSource_000-{1:03d}-{2:03d}.csv.gz'.format(src_dir,i1,i2)

with gzip.open(fni, 'rb') as f: # Open the input file in gzip format
    for i in range(4):
        line = f.readline()
        line = line.decode('utf-8') # convert binary data to utf-8 text

        elements = line.split(',') # split the line into elements
        
        l = elements[53]
        b = elements[54]
        mag = elements[51]
        
        print (i, l, b, mag)
```

Here let's save the extracted data as a binary file.
A code to output data for one star to a binary file is given as follows:
```
# extract Galactic coordinates and magnitude
l = float(elems[53]) # Galactic longitude
b = float(elems[54]) # Galactic latitude
mag = float(elems[51]) # Gaia G-magnitude

# pack them into a binary data and write it
dat = struct.pack('fff', l, b, mag)
fout.write(dat)
```
Here, the `float()` function converts a string into a float value. The function `struct.pack()` in the `struct` module packs the three float valiables into a binary data. The file object '`fout`' should be opened for writing in the binary mode.

A function that read an original data file with indices `i1` and `i2`, then extracts the necessary values (l, b, and mag), and finally saves it in a binary file would be given as follows:
```
def extract_file(i1, i2):
    if not os.path.isdir(dest_dir):
        os.mkdir(dest_dir)
    
    # input file name
    fni0 = '{0}/GaiaSource_000-{1:03d}-{2:03d}'.format(src_dir,i1,i2)
    fni = '{0}.csv.gz'.format(fni0)
    # output file name
    fno = '{0}/gaia_data_{1:03d}_{2:03d}.dat'.format(dest_dir,i1,i2)

    print ('"' + fni + '":', end=' ')
    if not os.path.exists(fni):
        print(' Not Found')
        return
    if os.path.exists(fno):
        print(' Skipped')
        return

    print(' Extracting...', end='')

    count = 0
    fout = open(fno, 'wb')
    with gzip.open(fni, 'r') as f: # Open the input file in gzip format
        first = True
        for line_bin in f:
            if first: # Skip first line
                first = False
                continue
            
            # split the line into elements for a star
            line = line_bin.decode('utf-8')
            elems = line.split(',')
            
            # extract Galactic coordinates and magnitude
            l = float(elems[53]) # Galactic longitude
            b = float(elems[54]) # Galactic latitude
            mag = float(elems[51]) # Gaia G-magnitude

            # pack them into a binary data and write it
            dat = struct.pack('fff', l, b, mag)
            fout.write(dat)

            count += 1
    fout.close()
    print(' Completed ({0} stars)'.format(count))
```
Enter this code in the cell below and run it to define the function.

Check if it works well by running the following code:
```
i1 = 0
i2 = 0
extract_file(i1, i2)
```

If it works well, the file sizes are reduced from ~40MB (original file) to ~2.5MB (extracted file).
Next, let's process the all files you downloaded.
Enter the following code in the cell below and run it.
```
i1 = 0
for i2 in range(5):
    extract_file(i1,i2)
```

That's all for this section.

To create all extracted files, the following code can be used:
```
def extract_all():
    for i1 in range(21):
        for i2 in range(256):
            extract_file(i1, i2)
```

However, it will take a very long time to process all the files
and we don't do it in this workshop.