### Find the minimum temp observed at each station in the year 1800  

#### Sample data from dataset
|station id | date|  label| temp in 1/10 |
|---|---|---|---|---|---|---|
|ITE00100554|18000116|TMIN|41|
    
``` [ 'ITE00100554,18000116,TMIN,41,,,E,',
 'ITE00100554,18000124,TMAX,85,,,E,',
 'ITE00100554,18000213,TMAX,13,,,E,',
 'EZE00100082,18000305,TMIN,-47,,,E,',
 'GM000010962,18000321,PRCP,6,,,E,',
 'GM000010962,18000419,PRCP,5,,,E,',
 'EZE00100082,18000804,TMAX,316,,,E,'] ```

In [1]:
from pyspark import SparkContext, SparkConf

In [6]:
# use this in case of standalone scritps
#conf = SparkConf().setMaster('local').setAppName('Minimum temp')
#sc= SparkContext(conf=conf).getOrCreate()

# incase of pyspark jupyrter lab sc is already created so it errors if we try to create a new spark context
sc= SparkContext.getOrCreate()
sc.appName='Minimum temp'
sc

In [7]:
data = sc.textFile("./datasets/1800-temperatures.csv")

In [20]:
# sample is a transformation and thus will load only a sample of dataset when we collect
# for viewing purpose
data.sample(False,0.001).collect()

['ITE00100554,18000116,TMIN,41,,,E,',
 'ITE00100554,18000124,TMAX,85,,,E,',
 'ITE00100554,18000213,TMAX,13,,,E,',
 'EZE00100082,18000305,TMIN,-47,,,E,',
 'GM000010962,18000321,PRCP,6,,,E,',
 'GM000010962,18000419,PRCP,5,,,E,',
 'EZE00100082,18000804,TMAX,316,,,E,']

## Remove unwanted columns, 
## Columns of interest
* Station id
* label - TMAX or TMIN
* Temp as a fraction of 1/10 - convert to fahrenheit

  \begin{align}\mbox{1F } = (9/5)+32C \end{align}

In [46]:
def collectDOI(line): # collect data of interest
    data = line.split(",")
    station_id = data[0]
    label = data[2]
    temp = float(data[3]) * 0.1 * (9.0/5.0) + 32
    return (station_id, label, temp)


In [47]:
tempBySId = data.map(collectDOI)
# (ITE00100554,TMIN, 41)

--------------------------- --------------------

### MIN TEMP 

In [33]:
# filter all TMIN - as per requirment
# (ITE00100554,TMIN, 41)
temp_min =  tempBySId.filter(lambda x : 'TMIN' in x[1])
# lets get rid of TMIN which is no longer needed 
stationTemps = temp_min.map(lambda x : (x[0],x[2]))
# (ITE00100554, 41)
# min temps for each station
minTemps = stationTemps.reduceByKey(lambda x,y : min(x,y))
results = minTemps.collect()

In [35]:
for result in results:
    print("%s  \t%.2fF" % (result))

ITE00100554  	5.36F
EZE00100082  	7.70F


----------------------------------

### Find the maximum temp observed at each station in the year 1800  

#### Sample data from dataset
|station id | date|  label| temp in 1/10 |
|---|---|---|---|---|---|---|
|ITE00100554|18000116|TMIN|41|
    
``` [ 'ITE00100554,18000116,TMIN,41,,,E,',
 'ITE00100554,18000124,TMAX,85,,,E,',
 'ITE00100554,18000213,TMAX,13,,,E,',
 'EZE00100082,18000305,TMIN,-47,,,E,',
 'GM000010962,18000321,PRCP,6,,,E,',
 'GM000010962,18000419,PRCP,5,,,E,',
 'EZE00100082,18000804,TMAX,316,,,E,'] ```

### MAX TEMP

In [52]:
temp_max = tempBySId.filter(lambda x: 'TMAX' in x[1])
#(ITE00100554,TMAX,85)

In [58]:
# GET RID OF UNWANTED TMAX LABEL
stationTemps = temp_max.map(lambda x : (x[0],x[2]))
#(ITE00100554,85)

In [59]:
#reduce by key and find max temp
maxTemps = stationTemps.reduceByKey(lambda temp_a,temp_b :  max(temp_a,temp_b))

In [60]:
results = maxTemps.collect()

In [62]:
for result in results:
    print("%s \t %.2fF" % result)

ITE00100554 	 90.14F
EZE00100082 	 90.14F


-----------