## Raw data processing
----------------------------
In this notebook we will extract the data from two text files, one which is from the latest 2020 Atomic Mass Evaluation (AME2020) [[1](https://www-nds.iaea.org/amdc/ame2020/mass_1.mas20.txt)] and the other one which is the output file from the Duflo-Zuker with 10 parameters (DZ10) Fortran program. 

In [1]:
#Libraries for data processing
import numpy as np 
import pandas as pd
import csv

If you choose the Google Colaboratory solution, you must use this next cell and comment the next one.

In [2]:
from logging import RootLogger
#Mount Google Drive
from google.colab import drive #Import drive from google colab

root = "/content/drive"        #Default location for the drive

drive.mount(root)              #We mount the google drive at /content/drive

#Import join used to join root path and my_google_drive_path
from os.path import join  

#Path to the project on Google Drive
my_google_drive_path = "MyDrive/StudentProject2023-main"

project_path = join(root, my_google_drive_path)

Mounted at /content/drive


### AME2020 data extraction

We will process and extract the data we need from the AME mass table as .csv file. As can be seen on the [original text file](https://www-nds.iaea.org/amdc/ame2020/mass_1.mas20.txt) the first 36 lines are useless and are just there to explain how to read this table, and we thus got rid of them by hand. The different steps to the data extraction are explained as comments in the code.

Strangely, when we first compile this program, it only gets 2570 elements when there should be 3558, if we compile again, all is fine. We do not understand where such an error comes from and it is unfortunately a recurring problem.


In [4]:
#Open AME mass table document
mass_file = open(join(project_path,"1_raw_data/mass_1.mas20.txt"),"r+")


#We create the .csv file and give the name of the columns
mass_csv = open(join(project_path,"2_processed_data/mass_data.csv"),"w+")
ame_csv_header_row = "N-Z;N;Z;A;ame_ME;ame_ME_unc;ame_BE/A;ame_BE/A_unc;ame_BDE;ame_BDE_unc;ame_AM;ame_AM_unc\n"
mass_csv.writelines(ame_csv_header_row)

#Extract data from AME mass data into a pandas dataframe and csv file
#We should have 3559 entries in .csv doc, so 3558 elements in it
#As the first line is the column names

element_list = mass_file.readlines()

#The following lines are for the purpose of standardization of the data
#As the data is in a complicated format (some values are empty) and thus
#we need to process the data ourselves

for element in element_list :

    splitted_line = element.split() #Split a string separated by spaces
    #We will get a list of 15 elements in the end

    #All elements have a column with only "B-" written
    #We use it as a constant to know whether there are blanks or not

    if splitted_line.index("B-") == 11 : #There is a 0 and a decay mode
        #We want to get rid of indices 0 and 6
        #Index 0 is only for Fortran users
        #Index 6 is decay mode
        splitted_line.pop(0) 
        splitted_line.pop(5) #5 as index 0 is already removed by .pop(0)
    
    if splitted_line.index("B-") == 10 : #There is a 0 or a decay mode
        if (int(splitted_line[1]) - int(splitted_line[2]) == int(splitted_line[0]) and 
            int(splitted_line[1]) + int(splitted_line[2]) == int(splitted_line[3])) :
            #We use the N-Z column to know wheter first 0 is Fortran or not
            splitted_line.pop(5)

        else : #There is no 0 nor a decay mode (what we want)
            splitted_line.pop(0)
    
    if len(splitted_line) != 15 :
        #Beta-decay energies uncertainties are sometimes empty, we add a 0
        splitted_line.insert(11,"0") 
        

    
    #We get rid of element symbol and "B-" in the list
    #We now have list of 13 elements
    if splitted_line[10].find("*") != -1 :
        splitted_line[10] = "0" #Replace "*" by "0"
    
    splitted_line.pop(4) #Getting rid of element symbols
    splitted_line.pop(8) #Getting rid of "B-" string

    #Values for atomic_mass follow a strange format
    #We thus concatenate two columns
    #index 10 & 11
    atomic_mass_coma = splitted_line.pop(11)
    atomic_mass_coma = "." + atomic_mass_coma.replace(".","")
    splitted_line[10] = splitted_line[10] + atomic_mass_coma
    
    
    #We now have list of 12 elements

    #Remove "#" and standardization of the list in order to convert into array
    for i in range(12) :
        if splitted_line[i].find("#") != -1 :
            splitted_line[i] = splitted_line[i].replace("#","")

    mass_csv.writelines(";".join(splitted_line) + "\n")

## Duflo-Zuker with 10 parameters data extraction

We will now process the DZ10 data provided by the fortran program.

Same problem here, when we first compile we obtain 13685 or 16042 elements while it should be 16040 elements.


In [6]:
#Open DZ10 document
duzu_file=open(join(project_path,"1_raw_data/duzu.txt"),"r+")

#Extract data from DZ10 into a .csv file

dz_element_list=duzu_file.readlines()

dz_csv=open(join(project_path,"2_processed_data/dz_data.csv"),"w+")
dz_csv_header_row="Z;N;dz_BE/A;dz_ME\n"
dz_csv.writelines(dz_csv_header_row)

for element in dz_element_list :
    dz_split_line=element.split() 

    #We simply get rid of names
    dz_split_line.pop(0)
    dz_split_line.pop(1)
    dz_split_line.pop(2)
    dz_split_line.pop(3)

    if not(dz_split_line[2].find("NaN")!=-1 or 
           dz_split_line[3].find("NaN")!=-1 or
           np.float128(dz_split_line[2])<0) :  
           #There are around 120x200 elements in this text file
           #Some elements are associated with "NaN" as Binding Energy
           #Some elements are associated with negative Bindign Energy
           #We get rid of both 

           dz_csv.writelines(";".join(dz_split_line)+"\n")