# Data Cleaning

Data lines for each cyclone is invalid if:
- Wind speed is 0
- Location of cyclone is outside of Japan
- The year is not between 2000 & 2022

The data will be divided into 4 seasons:
- March to May is Spring
- June to August is Summer
- September to November is Autumn
- December to February is winter

The cleaned version of the data will be used for creating the heatmap, and creating the GIF timelapse of the heatmap

In [7]:
# define helper functions
def insert_to_dict(data, datatemp, key):
    if key not in data.keys():
        data[key] = list()
    
    data[key].append(datatemp)

def classify_data(year, month, newLine, data, insert_to_dict):
    wind = int(newLine[33:36])
    lat = int(newLine[15:18])/10
    long = int(newLine[19:23])/10

    # remove cyclone with 0 windspeed detected
    if (wind == 0):
        return
    
    # remove cyclone outside of Japan
    if (lat < 23 or lat > 48):
        return
    
    if (long < 113 or long > 163):
        return

    # in the raw data, the min wind speed is 35, and max wind speed is 115
    # this conversion is to make it to a percentage where the minimum wind speed will not be 0 (0%)
    # and the maximum wind speed will be 1 (100%)
    wind = (wind - 25) / 90
    datatemp = list()
    datatemp.extend((lat, long, wind))

    # January to February is Winter of last year
    if month < 3:
        key = (year - 1)*10 + 3
        insert_to_dict(data, datatemp, key)
    
    # March to May is Spring
    elif month < 6:
        key = (year)*10
        insert_to_dict(data, datatemp, key)
    
    # June to August is Summer
    elif month < 9:
        key = (year)*10 + 1
        insert_to_dict(data, datatemp, key)
    
    # September to November is Autumn
    elif month < 12:
        key = (year)*10 + 2
        insert_to_dict(data, datatemp, key)
    
    # December is Winter
    elif month == 12:
        key = (year)*10 + 3
        insert_to_dict(data, datatemp, key)

In [8]:
# Extract data
data = dict()
previd = ""
month = 0
with open('raw_data.txt', 'r') as f:
    for line in f:
        year = int(line[6:8])
        numLine = int(line[12:15])

        if year > 22:
            for i in range(numLine):
                next(f)
            continue

        id = line[6:10]
        for i in range(numLine):
            newLine = f.readline()
            if (id != previd):
                month = int(newLine[2:4])
                previd = id

            classify_data(year, month, newLine, data, insert_to_dict)

# Format of Cleaned Output:
- ### Header line for each year's season starting from: **Spring 2000 to Winter 2022**.
     - Contains number of data lines for that year's season.
     - Next header line corresponds to the next season from the current header line
- ### Data line for the storm in each season.
     - Format is: lat, long, wind.
     - For latitude and longitude, its unit is 1 degree, while for wind, it is in the form of % (percentage)

<br>

*The first header line indicates Spring 2000, next header line is the next season, thus it indicates Summer 2000, and next header lines will be: Autumn 2000, Winter 2000, Spring 2001, Summer 2001, and so on. The last header line indicates Winter 2022.*

<br>

Example:
2
23.0,139.5,0.5555555555555556
33.7,162.1,0.1
1
45.5, 114.9, 1

The example data means:
- Spring 2000 has 2 data lines:
     - latitude 23, longitude 139.5, and wind speed percentage of 0.56
     - latitude 33.7, longitude 161.1, and wind speed percentage of 0.1
- Summer 2000 has 1 data line:
     - latitude 45.5, longitude 114.9, and wind percentage of 1 (100%)

In [9]:
with open('heatmap_data.txt', 'w') as f:
    for i in range(23):
        for j in range(4):
            key = i*10 + j
            if key not in data.keys():
                f.write("0\n")
            else:
                f.write(str(len(data[key])))
                f.write('\n')
                for k in data[key]:
                    f.write(','.join(str(num) for num in k))
                    f.write('\n')