# Arctic Heat IWG data Cleanup

The IWG flight data from ArcticHeat flights can be parsed and cleaned to ease analysis and minimize transmission costs

The big two tasks are:
+ clean up the repeated headerlines
+ choose just the columns you wish to keep

### Specify files to work with

In [141]:
file_path = '/Users/bell/ecoraid/2018/Additional_FieldData/ArcticHeat/IWG_data/'
file_name = '20180914_200847_IWG.dat'

df = file_path+file_name

Include standard package imports 

In [142]:
import pandas as pd
import os


In [143]:
#read in file and drop all duplicate header lines leaving only the first occurance
ds = pd.read_csv(df,header=None)
ds.drop_duplicates(inplace=True) 

column_names = (ds.loc[ds[0] == 'IWG1_NAMES']).values[0]

In [144]:
# find the first occurence of the header line and use it to name columns (and drop it)
ds.drop((ds.loc[ds[0] == 'IWG1_NAMES']).index,inplace=True)
ds.columns = column_names

### A quick summary of all contents of the file

In [145]:
ds.describe()

Unnamed: 0,IWG1_NAMES,TIME,LAT,LON,ALTGPS,GPS_GEOIDHT,ALTPAFT,ALTRAFT,GS,TAS,...,none,none.1,none.2,none.3,FLID,MISSIONID,STORMID,SST,PYRAUCLEAR,RH
count,19218,19218,19218.0,19218.0,19218,0.0,19218,19218,19218.0,19218.0,...,0.0,0.0,0.0,0.0,19218,19218,0.0,19218.0,19218.0,19218.0
unique,1,19218,16990.0,15449.0,506,0.0,216,1199,4695.0,3382.0,...,0.0,0.0,0.0,0.0,1,1,0.0,1348.0,13701.0,453.0
top,IWG1,20180914T231315,66.890564,-162.607361,244,,680,492,0.0,74.69,...,,,,,20180914L2,ARCTIC_HEAT,,8.47,93.42,100.0
freq,19218,1,206.0,191.0,746,,2373,908,251.0,53.0,...,,,,,19218,19218,,132.0,8.0,1018.0


In [146]:
ds.columns

Index(['IWG1_NAMES', 'TIME', 'LAT', 'LON', 'ALTGPS', 'GPS_GEOIDHT', 'ALTPAFT',
       'ALTRAFT', 'GS', 'TAS', 'IAS', 'MACH', 'GSZ', 'THDG', 'TRK', 'DA',
       'PITCH', 'ROLL', 'SA', 'AA', 'TA', 'TD', 'TTM', 'PS', 'PQ', 'PCAB',
       'WS', 'WD', 'UWZ', 'none', 'none', 'none', 'none', 'FLID', 'MISSIONID',
       'STORMID', 'SST', 'PYRAUCLEAR', 'RH'],
      dtype='object')

### Fields to keep and smaller file output

copy wanted header fields as shown above to the keep_columns variable below

In [147]:
keep_columns = ['TIME','LAT', 'LON', 'ALTGPS','SST', 'PYRAUCLEAR', 'RH']

In [148]:
ds[keep_columns].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19218 entries, 0 to 19858
Data columns (total 7 columns):
TIME          19218 non-null object
LAT           19218 non-null object
LON           19218 non-null object
ALTGPS        19218 non-null object
SST           19218 non-null object
PYRAUCLEAR    19218 non-null object
RH            19218 non-null object
dtypes: object(7)
memory usage: 1.2+ MB


In [149]:
ds[keep_columns].to_csv(file_path+file_name.replace('.dat','.clean.csv'))

In [150]:
o_size=os.stat(df).st_size
n_size=os.stat(df.replace('.dat','.clean.csv')).st_size

#new fstring format instead of or format()
print(f"The original file was {o_size / 1024 / 1000 :02.2f}MB")
print(f"The new file is {n_size / 1024 / 1000 :02.2f}MB")

The original file was 3.99MB
The new file is 1.23MB
