# Pre-Processing the Text Data

One of the Key Principles to understand during Pre-processing of Data is to have a clear Idea on how the Input data looks and and how we would like the end output to look like. 

This Steps followed for pre-processing are as follows :

- Understanding the Format of the Data 
- Storing The Cyclone in a Dictionary
- Converting the Dictionary to a Dataframe
- Restructuring the Columns and making it readable
- Replacing Sentinel Values and Removing Empty Strings
- Removing Unwanted Spaces and Reindexing the Data frame
- Save this Dataframe to a CSV File


### Understanding the Format of the Data

Let us Take a look at the Modified CSV Format used by the HURDAT2 Team :

In [1]:
from IPython.display import IFrame
IFrame("http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf", width=950, height=600)

The File has already been download and now let us read the file.

In [2]:
#Let's Open the File First
atlantic = open("data/hurdat2-1851-2018-120319.txt", "r")
atlantic_raw = atlantic.read()

# Running a counter to check first two letter of the Document
import io
from collections import Counter

c = Counter()
for line in io.StringIO(atlantic_raw):
    c[line[:2]] += 1
#Printing Counter Output
print(c)

Counter({'19': 32362, '20': 9756, '18': 9228, 'AL': 1873})


Let's Take a Moment to Understand what the Counter Output Means : 

* AL : Number of Atlantic Storms from 1851-2018 
* 18 : Number of Entries in 19th Century ( 1851 - 1899)
* 19 : Number of Entries in 20th Century ( 1900 - 1999)
* 20 : Number of Entries in 21st Century ( 2000 - 2018)


### Storing The Cyclone in a Dictionary

Let us now create a Dictionary to store the Cyclone data according to their name.

In [3]:
import io

# Create a Dictionary to Store All Cyclone Data According to their names
atlantic_storms_r = []
atlantic_storm_r = {'header': None, 'data': []}

for i, line in enumerate(io.StringIO(atlantic_raw)):
    if line[:2] == 'AL':
        atlantic_storms_r.append(atlantic_storm_r.copy())
        atlantic_storm_r['header'] = line
        atlantic_storm_r['data'] = []
    else:
        atlantic_storm_r['data'].append(line)
# Removing the First Element of the List and Storing Everything else.
atlantic_storms_r = atlantic_storms_r[1:]
#Number of Atlantic Cyclone 
len(atlantic_storms_r)

1872

### Converting the Dictionary to a Dataframe

In [4]:
# Let us Convert the Dictionary Data to a Pandas Dataframe which will be easier to workwith later

import pandas as pd

atlantic_storm_dfs = []
for storm_dict in atlantic_storms_r:
    storm_id, storm_name, storm_entries_n = storm_dict['header'].split(",")[:3]
    # remove hanging newline ('\n'), split fields
    data = [[entry.strip() for entry in datum[:-1].split(",")] for datum in storm_dict['data']]
    frame = pd.DataFrame(data)
    frame['id'] = storm_id
    frame['name'] = storm_name
    atlantic_storm_dfs.append(frame)
    
# Let's print the first Cyclone Data to see how it looks.
atlantic_storm_dfs[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,id,name
0,18510625,0,,HU,28.0N,94.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
1,18510625,600,,HU,28.0N,95.4W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
2,18510625,1200,,HU,28.0N,96.0W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
3,18510625,1800,,HU,28.1N,96.5W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
4,18510625,2100,L,HU,28.2N,96.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
5,18510626,0,,HU,28.2N,97.0W,70,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
6,18510626,600,,TS,28.3N,97.6W,60,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
7,18510626,1200,,TS,28.4N,98.3W,60,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
8,18510626,1800,,TS,28.6N,98.9W,50,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
9,18510627,0,,TS,29.0N,99.4W,50,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED


In [5]:
# Concatenate All the Cyclones Data into one
atlantic_storms = pd.concat(atlantic_storm_dfs)
len(atlantic_storms)

51310

### Restructuring the Columns and making it readable

In [6]:
# Restructurings the Columns in the Dataframe
atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2]) 
# Printing the First 5 Rows
atlantic_storms.head()

  atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2])


Unnamed: 0,id,name,0,1,2,3,4,5,6,7,...,11,12,13,14,15,16,17,18,19,20
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,


In [7]:
#Display the Columns of the Dataframe
atlantic_storms.columns

Index([  'id', 'name',      0,      1,      2,      3,      4,      5,      6,
            7,      8,      9,     10,     11,     12,     13,     14,     15,
           16,     17,     18,     19,     20],
      dtype='object')

In [None]:
# Make the Dataframe's Columns Readable 
atlantic_storms.columns = [
        "id",
        "name",
        "date",
        "hours_minutes",
        "record_identifier",
        "status_of_system",
        "latitude",
        "longitude",
        "maximum_sustained_wind_knots",
        "maximum_pressure",
        "34_kt_ne",
        "34_kt_se",
        "34_kt_sw",
        "34_kt_nw",
        "50_kt_ne",
        "50_kt_se",
        "50_kt_sw",
        "50_kt_nw",
        "64_kt_ne",
        "64_kt_se",
        "64_kt_sw",
        "64_kt_nw",
        "na"
]
del atlantic_storms['na']
pd.set_option("max_columns", None)

In [9]:
# Let's have a look at our Data frame : 
atlantic_storms.head()

Unnamed: 0,id,name,date,hours_minutes,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999


### Replacing Sentinel Values and Removing Empty Strings

Now that we have completed most of the Parsing , Let us do some final fixes by changing the sentinel values which are '-999' to NaN ( Not a number ) 

In [10]:
# Replacing all old Sentinels (-999 ) with nan
atlantic_storms.iloc[0]['34_kt_sw']

# We use Numpy ( Numerical Python ) to replace the Sentinels.
import numpy as np
atlantic_storms = atlantic_storms.replace(to_replace='-999', value=np.nan)
atlantic_storms.iloc[0]['34_kt_sw']

nan

In [11]:
# Checking Data types of Columns 
atlantic_storms.dtypes

id                              object
name                            object
date                            object
hours_minutes                   object
record_identifier               object
status_of_system                object
latitude                        object
longitude                       object
maximum_sustained_wind_knots    object
maximum_pressure                object
34_kt_ne                        object
34_kt_se                        object
34_kt_sw                        object
34_kt_nw                        object
50_kt_ne                        object
50_kt_se                        object
50_kt_sw                        object
50_kt_nw                        object
64_kt_ne                        object
64_kt_se                        object
64_kt_sw                        object
64_kt_nw                        object
dtype: object

In [12]:
atlantic_storms['record_identifier'].value_counts()

     50238
L     1003
I       28
P       10
T        8
S        7
R        6
C        5
W        4
G        1
Name: record_identifier, dtype: int64

Now , Let us now also replace all the Empty Strings with NaN 

In [13]:
# Replacing All Empty String with nan values
atlantic_storms = atlantic_storms.replace(to_replace="", value=np.nan)
atlantic_storms['record_identifier'].value_counts(dropna=False)

NaN    50238
L       1003
I         28
P         10
T          8
S          7
R          6
C          5
W          4
G          1
Name: record_identifier, dtype: int64

In [14]:
#Let us have a look at the Data frame now
atlantic_storms.head()

Unnamed: 0,id,name,date,hours_minutes,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,,...,,,,,,,,,,
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,,...,,,,,,,,,,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,,...,,,,,,,,,,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,,...,,,,,,,,,,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,,...,,,,,,,,,,


### Removing Unwanted Spaces and Reindexing the Data frame

In [15]:
# Final Fixes 

# Strip Unwanted Spaces from names
atlantic_storms['name'] = atlantic_storms['name'].map(lambda n: n.strip()) 

#ReIndex
atlantic_storms.index = range(len(atlantic_storms.index))
atlantic_storms.index.name = "index"

### Saving this Dataframe to a CSV File

Let us now save the Dataframe into a CSV file which we will be using for Annotating the Data.

In [16]:
atlantic_storms.tail()

Unnamed: 0_level_0,id,name,date,hours_minutes,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
51305,AL152018,NADINE,20181011,1800,,TS,14.3N,34.6W,45,1002,...,30,50,0,0,0,0,0,0,0,0
51306,AL152018,NADINE,20181012,0,,TS,14.8N,35.0W,40,1003,...,0,50,0,0,0,0,0,0,0,0
51307,AL152018,NADINE,20181012,600,,TS,15.3N,35.3W,40,1004,...,0,50,0,0,0,0,0,0,0,0
51308,AL152018,NADINE,20181012,1200,,TS,15.8N,35.7W,35,1006,...,0,50,0,0,0,0,0,0,0,0
51309,AL152018,NADINE,20181012,1800,,TD,16.2N,37.0W,30,1008,...,0,0,0,0,0,0,0,0,0,0


In [17]:
atlantic_storms.to_csv("atlantic.csv")

## License

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).