# Arctic Heat IWG data Cleanup

The IWG flight data from ArcticHeat flights can be parsed and cleaned to ease analysis and minimize transmission costs

The big two tasks are:
+ clean up the repeated headerlines
+ choose just the columns you wish to keep

### Specify files to work with

In [71]:
file_path = '/Users/bell/ecoraid/2019/Additional_FieldData/ArcticHeat/IWG_data/'
file_name = '20190720_231146_IWG.dat'

df = file_path+file_name

Include standard package imports 

In [72]:
import pandas as pd
import os


In [73]:
#read in file and drop all duplicate header lines leaving only the first occurance
ds = pd.read_csv(df,header=None)
ds.drop_duplicates(inplace=True) 

column_names = (ds.loc[ds[0] == 'IWG1_NAMES']).values[0]

In [74]:
# find the first occurence of the header line and use it to name columns (and drop it)
ds.drop((ds.loc[ds[0] == 'IWG1_NAMES']).index,inplace=True)
ds.columns = column_names

### A quick summary of all contents of the file

In [75]:
ds.describe()

Unnamed: 0,IWG1_NAMES,TIME,LAT,LON,ALTGPS,GPS_GEOIDHT,ALTPAFT,ALTRAFT,GS,TAS,...,none,none.1,none.2,none.3,FLID,MISSIONID,STORMID,SST,PYRAUCLEAR,RH
count,10129,10129,10129.0,10129.0,10129,0.0,10129,10129,10129.0,10129.0,...,0.0,0.0,0.0,0.0,10129,10129,0.0,10129.0,10129.0,10129.0
unique,1,10129,9739.0,9872.0,969,0.0,805,212,2038.0,1708.0,...,0.0,0.0,0.0,0.0,1,1,0.0,1923.0,7977.0,540.0
top,IWG1,20190721T005914,71.287003,-156.784943,3314,,10990,-98,0.0,0.0,...,,,,,20190720L1,ARCTIC_HEAT,,-0.06,612.74,100.0
freq,10129,1,138.0,135.0,178,,2070,9443,163.0,152.0,...,,,,,10129,10129,,27.0,8.0,4592.0


In [76]:
ds.columns

Index(['IWG1_NAMES', 'TIME', 'LAT', 'LON', 'ALTGPS', 'GPS_GEOIDHT', 'ALTPAFT',
       'ALTRAFT', 'GS', 'TAS', 'IAS', 'MACH', 'GSZ', 'THDG', 'TRK', 'DA',
       'PITCH', 'ROLL', 'SA', 'AA', 'TA', 'TD', 'TTM', 'PS', 'PQ', 'PCAB',
       'WS', 'WD', 'UWZ', 'none', 'none', 'none', 'none', 'FLID', 'MISSIONID',
       'STORMID', 'SST', 'PYRAUCLEAR', 'RH'],
      dtype='object')

### Fields to keep and smaller file output

copy wanted header fields as shown above to the keep_columns variable below

In [77]:
keep_columns = ['TIME','LAT', 'LON', 'ALTGPS','SST', 'PYRAUCLEAR', 'RH']

In [78]:
ds[keep_columns].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10129 entries, 0 to 10466
Data columns (total 7 columns):
TIME          10129 non-null object
LAT           10129 non-null object
LON           10129 non-null object
ALTGPS        10129 non-null object
SST           10129 non-null object
PYRAUCLEAR    10129 non-null object
RH            10129 non-null object
dtypes: object(7)
memory usage: 633.1+ KB


In [79]:
ds[keep_columns].to_csv(file_path+file_name.replace('.dat','.clean.csv'))

In [80]:
o_size=os.stat(df).st_size
n_size=os.stat(df.replace('.dat','.clean.csv')).st_size

#new fstring format instead of or format()
print(f"The original file was {o_size / 1024 / 1000 :02.2f}MB")
print(f"The new file is {n_size / 1024 / 1000 :02.2f}MB")

The original file was 2.14MB
The new file is 0.66MB
