# Filtering meteorological data

We will use meteorolical data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

## Load data

Load the data in `datasets/meteogalicia.txt` into an RDD:

In [2]:
rdd = sc.textFile('datasets/meteogalicia.txt')
rdd.take(5)

[u'',
 u'',
 u'ESTACI\ufffdN AUTOM\ufffdTICA:Santiago-EOAS',
 u'CONCELLO:Santiago de Compostela',
 u'PROVINCIA:A Coru\ufffda']

## Filter temperature data

Filter data from the RDD keeping only "Temperatura media" lines.

In [4]:
temperature_lines = rdd.filter(lambda line: "Temperatura media" in line)
temperature_lines.take(5)

[u'      1          2017-06-01 00:10:00    Temperatura media (\ufffdC)                    13,82',
 u'      1          2017-06-01 00:20:00    Temperatura media (\ufffdC)                    13,71',
 u'      1          2017-06-01 00:30:00    Temperatura media (\ufffdC)                    13,61',
 u'      1          2017-06-01 00:40:00    Temperatura media (\ufffdC)                    13,52',
 u'      1          2017-06-01 00:50:00    Temperatura media (\ufffdC)                    13,33']

## Count the number of points

In [5]:
count_of_temperature_lines = temperature_lines.count()
print(count_of_temperature_lines)


4176


## Find the maximum temperature of the month

Extract the column with the temperature strings:

In [15]:
temperature_strings = temperature_lines.map(lambda line: line.split()[6])
def safe_float_conversion(temp_str):
    try:
        return float(temp_str.replace(",", "."))
    except ValueError:
        return None  # Handle conversion errors

temperature_floats = temperature_strings.map(safe_float_conversion)

# To check the first few converted float values
print(temperature_floats.take(5))

[13.82, 13.71, 13.61, 13.52, 13.33]


The temperature_strings contain strings of the form "21,55", in order to use them we have to convert them to floats we have to first replace the "," with a ".":

In [19]:
values = temperature_floats.collect()
print(values[:5])

[13.82, 13.71, 13.61, 13.52, 13.33]


And now we can convert them to floats:

In [20]:
temperatures = temperature_floats.collect()
print(temperatures[:5])

[13.82, 13.71, 13.61, 13.52, 13.33]


Finally we can calculate the maximum temperature:

In [21]:
total_temperature = temperature_floats.reduce(lambda x, y: x + y)
print("Total Temperature:", total_temperature)


('Total Temperature:', 68479.95000000003)


Sometimes it is useful to explore the API to find more direct ways to do what we want.

In this case we can see that there is a **max()** built-in function in the RDD object just to do this, so we can also do:

In [22]:
max_temperature = temperature_floats.reduce(lambda x, y: max(x, y))
print("Maximum Temperature:", max_temperature)

('Maximum Temperature:', 34.4)


## Find the minimum temperature of the month

In [23]:
min_temperature = temperature_floats.reduce(lambda x, y: min(x, y))
print("Minimum Temperature:", min_temperature)


('Minimum Temperature:', -9999.0)


Reading the header of the dataset file we can see that -9999 is used as a code to indicate N/A values.

So we have to filter out -9999 and repeat:

In [25]:
filtered_temperature_floats = temperature_floats.filter(lambda x: x != -9999)

temperatures = filtered_temperature_floats.collect()

min_temperature = min(temperatures) if temperatures else None

print("Minimum Temperature:", min_temperature)

('Minimum Temperature:', 9.09)
