# Filtering meteorological data

We will use meteorolical data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

## Load data

Load the data in `datasets/meteogalicia.txt` into an RDD:

In [None]:
# sc -> Spark Content (punto de entrada al programa)

In [1]:
rdd = sc.textFile('datasets/meteogalicia.txt')

In [2]:
rdd.toDebugString()

'(2) datasets/meteogalicia.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []\n |  datasets/meteogalicia.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []'

In [4]:
rdd.take(5) # nos muestra los 5 primeros

[u'',
 u'',
 u'ESTACI\ufffdN AUTOM\ufffdTICA:Santiago-EOAS',
 u'CONCELLO:Santiago de Compostela',
 u'PROVINCIA:A Coru\ufffda']

In [6]:
rdd.takeSample(withReplacement=True, num=5) # nos muestra 5 ejemplos aleatorios

[u'      1          2017-06-13 15:10:00    Velocidade do Vento (km/h)                 4,61',
 u'      1          2017-06-06 17:30:00    Visibilidade (m)                          20044',
 u'      1          2017-06-08 19:10:00    Velocidade do Vento (km/h)                 10,4',
 u'      1          2017-06-02 03:30:00    Velocidade do Vento (km/h)                 7,06',
 u'      1          2017-06-27 11:20:00    Chuvia (L/m2)                             0,2']

## Filter temperature data

Filter data from the RDD keeping only "Temperatura media" lines.

In [7]:
temperature_lines = rdd.filter(lambda line: 'Temperatura media' in line)

In [9]:
temperature_lines.count ()

4176

## Count the number of points

In [10]:
temperature_lines.count()

4176

## Find the maximum temperature of the month

Extract the column with the temperature strings:

In [12]:
temperature_strings = temperature_lines.map(lambda line: line.split()[6])

In [13]:
temperature_strings.take(5)

[u'13,82', u'13,71', u'13,61', u'13,52', u'13,33']

The temperature_strings contain strings of the form "21,55", in order to use them we have to convert them to floats we have to first replace the "," with a ".":

In [14]:
values = temperature_strings.map(lambda t: t.replace(',' , '.')) # cambiamos a los elementos de temperature la coma por puntos

In [15]:
values.take(5)

[u'13.82', u'13.71', u'13.61', u'13.52', u'13.33']

And now we can convert them to floats:

In [16]:
temperatures = values.map(lambda x: float(x)) # cambiamos de string a número

In [17]:
temperatures.take (5)

[13.82, 13.71, 13.61, 13.52, 13.33]

Finally we can calculate the maximum temperature:

In [18]:
temperatures.reduce(lambda x,y : max(x,y))

34.4

In [19]:
temperatures.reduce(lambda x,y : x if x>y else y)

34.4

Sometimes it is useful to explore the API to find more direct ways to do what we want.

In this case we can see that there is a **max()** built-in function in the RDD object just to do this, so we can also do:

In [20]:
temperatures.max()

34.4

## Find the minimum temperature of the month

In [22]:
temperatures.reduce(lambda x,y: x if x<y else y)

-9999.0

Reading the header of the dataset file we can see that -9999 is used as a code to indicate N/A values.

So we have to filter out -9999 and repeat:

In [25]:
temperatures.filter(lambda x: x != -9999).reduce(lambda x,y: x if x<y else y)

9.09

In [26]:
temperaturesv2 = temperatures.filter(lambda x: x != -9999)

In [27]:
temperaturesv2.reduce(lambda x,y: x if x<y else y)

9.09