# Arctic Heat IWG data Cleanup

The IWG flight data from ArcticHeat flights can be parsed and cleaned to ease analysis and minimize transmission costs

The big two tasks are:
+ clean up the repeated headerlines
+ choose just the columns you wish to keep

### Specify files to work with

In [144]:
file_path = '/Users/bell/ecoraid/2018/Additional_FieldData/ArcticHeat/IWG_data/'
file_name = '20180531_210924_IWG.dat'

df = file_path+file_name

Include standard package imports 

In [145]:
import pandas as pd
import os


In [146]:
#read in file and drop all duplicate header lines leaving only the first occurance
ds = pd.read_csv(df,header=None)
ds.drop_duplicates(inplace=True) 

column_names = (ds.loc[ds[0] == 'IWG1_NAMES']).values[0]

In [147]:
# find the first occurence of the header line and use it to name columns (and drop it)
ds.drop((ds.loc[ds[0] == 'IWG1_NAMES']).index,inplace=True)
ds.columns = column_names

### A quick summary of all contents of the file

In [148]:
ds.describe()

Unnamed: 0,IWG1_NAMES,TIME,LAT,LON,ALTGPS,GPS_GEOIDHT,ALTPAFT,ALTRAFT,GS,TAS,...,none,none.1,none.2,none.3,FLID,MISSIONID,STORMID,SST,PYRAUCLEAR,RH
count,20621,20621,20621.0,20621.0,20621,0.0,20621,20621,20621.0,20621.0,...,0.0,0.0,0.0,0.0,20621,20621,0.0,20621.0,20621.0,20621.0
unique,1,20616,19260.0,14716.0,787,0.0,338,1406,3431.0,3187.0,...,0.0,0.0,0.0,0.0,1,1,0.0,960.0,15962.0,716.0
top,IWG1,20180531T212652,71.284866,-149.993195,144,,460,505,71.14,69.09,...,,,,,20180531L1,ARCTIC_HEAT,,-1.94,671.44,95.2
freq,20621,2,34.0,27.0,665,,1860,728,53.0,55.0,...,,,,,20621,20621,,94.0,8.0,507.0


In [149]:
ds.columns

Index(['IWG1_NAMES', 'TIME', 'LAT', 'LON', 'ALTGPS', 'GPS_GEOIDHT', 'ALTPAFT',
       'ALTRAFT', 'GS', 'TAS', 'IAS', 'MACH', 'GSZ', 'THDG', 'TRK', 'DA',
       'PITCH', 'ROLL', 'SA', 'AA', 'TA', 'TD', 'TTM', 'PS', 'PQ', 'PCAB',
       'WS', 'WD', 'UWZ', 'none', 'none', 'none', 'none', 'FLID', 'MISSIONID',
       'STORMID', 'SST', 'PYRAUCLEAR', 'RH'],
      dtype='object')

### Fields to keep and smaller file output

copy wanted header fields as shown above to the keep_columns variable below

In [150]:
keep_columns = ['TIME','LAT', 'LON', 'ALTGPS','SST', 'PYRAUCLEAR', 'RH']

In [151]:
ds[keep_columns].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20621 entries, 0 to 21308
Data columns (total 7 columns):
TIME          20621 non-null object
LAT           20621 non-null object
LON           20621 non-null object
ALTGPS        20621 non-null object
SST           20621 non-null object
PYRAUCLEAR    20621 non-null object
RH            20621 non-null object
dtypes: object(7)
memory usage: 1.3+ MB


In [152]:
ds[keep_columns].to_csv(file_path+file_name.replace('.dat','.clean.csv'))

In [153]:
o_size=os.stat(df).st_size
n_size=os.stat(df.replace('.dat','.clean.csv')).st_size

#new fstring format instead of or format()
print(f"The original file was {o_size / 1024 / 1000 :02.2f}MB")
print(f"The new file is {n_size / 1024 / 1000 :02.2f}MB")

The original file was 4.28MB
The new file is 1.33MB
