# Transfer raw txt dataset into csv format

This notebook will extract data from raw txt and format it into csv/dataframe.
1) raw data is located under **/raw_txt**;

2) new csv data will be saved under **/processed_csv**.

3) after running, you will get:

    - a series of csv files under "/processed_csv";
    
    - a combined csv file named "/processed_csv/processed_combined.csv";
    
    - a readme file named "/processed_csv/readme.csv";

First, let's import packages.

In [1]:
import pandas as pd
import os

The function **process_line** is used to process each line and only extract the needed info. 

It will return the needed info as a list.

In [2]:
def process_line(line):
    info = []
    for item in line:
        if len(item) == 0:
            continue
        else:
            info.append(item)
    return info

The function **txt2csv** is used to transfer raw txt to csv/dataframe and save it. It will call the function **process_line** to process lines one by one.

The parameters are:

-*raw_txt_file_name*: the file name of the raw txt data

-*extracted_csv_file_name*: the file name of the extracted data, it should be a csv file

-*readme_file_name*: the file name of "Readme", it includes the stastics about the dataset

-*drop_feature_list*: the noisy features we want to drop; it can be empty if no features should be dropped.

In [11]:

def txt2csv(raw_txt_file_name, extracted_csv_file_name, readme_file_name, drop_feature_list):
    # define variables to save results
    csv_readme_dict = {'ID':[], 'FileName':[], 'Num':[]}
    csv_combine_list = []
    count = 0
    csv_num = 0
    result_dict = {}
    
    # opening file
    raw_txt_file = open(raw_txt_file_name, 'r')
    
    # processing lines one by one
    for line in raw_txt_file:
        count += 1
        line = line.strip()
        #print("Line{}: {}".format(count, line.strip()))
        check_line = line.replace('*','')
        
        if check_line == '':
            #print('Empty line; Skip it!')
            continue
            
        if '==>' in line:
            #print('The last line for this particular csv file.')
            
            #### tranfer dict to csv file
            result_df = pd.DataFrame.from_dict(result_dict)
            
            #------------- here, let's check whether we want to want any noisy features
            if len(drop_feature_list)!=0:
                print('We need to drop **{}** noisy features!'.format(len(drop_feature_list)))
                result_df = (result_df.drop(drop_feature_list, 1))
            else:
                print('We need to drop **{}** noisy features!'.format(len(drop_feature_list)))
            
            
            #-------------- save current dataframe to csv
            cur_save_name = extracted_csv_file_name.replace('.csv','')
            cur_save_name = cur_save_name + '_{}.csv'.format(csv_num)
            result_df.to_csv(cur_save_name, index=False)
            print('Congrats! **{}** has been saved!'.format(cur_save_name))
            
            ########## let's get statistical info #############
            csv_readme_dict['ID'].append(csv_num)
            csv_readme_dict['FileName'].append(cur_save_name)
            csv_readme_dict['Num'].append(len(result_df))
            csv_combine_list.append(result_df)
            count = 0
            csv_num += 1
            result_dict = {}
            continue
            
        line = check_line.split(' ')
        info = process_line(line)
        if count == 2:
            for item in info:
                result_dict[item]=[]
        else:
            keys = list(result_dict.keys())
            num = len(info)
            for i in range(num):
                cur_key = keys[i]
                cur_info = info[i]
                result_dict[cur_key].append(cur_info)
    # closing files
    raw_txt_file.close()
    
    # let's save the combined csv and statistical info
    combined_df = pd.concat(csv_combine_list, axis=0)
    combined_df.to_csv(extracted_csv_file_name, index=False)
    csv_readme_dict['ID'].append(csv_readme_dict['ID'][-1]+1)
    csv_readme_dict['FileName'].append('combined.csv')
    csv_readme_dict['Num'].append(len(combined_df))
    csv_readme_df = pd.DataFrame.from_dict(csv_readme_dict)
    csv_readme_df.to_csv(readme_file_name, index=False)
    
    print('')
    print('')
    print('>>>The raw txt has been processed successfully!')

Below, we start to run our code.

In [12]:
if __name__ == '__main__':
    raw_txt_file_name = 'raw_txt/data_all_v4.txt' # the raw txt data file name
    extracted_csv_file_name = 'processed_csv/processed_combined.csv' # the extracted data file name 
    readme_file_name = 'processed_csv/processed_readme.csv' # the readme file name 
    
    # the noisy features we want to remove
    drop_feature_list = ['PAamp', 'PBamp', 'PCamp', 'PDamp', 'PFamp', 'energy']
    
     # create folder if not exists
    if not os.path.exists('processed_csv'):
        os.makedirs('processed_csv')
        print('The folder has been created!')
    else:
        print('We have the folder. Do not need to create it!')
        
    txt2csv(raw_txt_file_name, extracted_csv_file_name, readme_file_name, drop_feature_list)

We have the folder. Do not need to create it!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_0.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_1.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_2.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_3.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_4.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_5.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_6.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_7.csv** has been saved!
We need to drop **6** noisy features!
Congrats! **processed_csv/processed_combined_8.csv** has bee