# Cloud Computing for Distributed Big Data Applications - Practical Session 2

As you are all aware, Big Data and Cloud Computing are two distinct ideas, but the two
concepts are often crossed intricately together. Big Data refers to gigantic amounts of data
that can be structured, semi-structured or unstructured. 

It is usually collected from different
sources (eg., user input, sensors, sales data ..) for analytical purposes. The two main purposes
of collecting this data is to i) find relevant patterns that can be exploited later (eg., to build
statistical models) and ii) to process it in order to address some query. 

Your challenge today
(if you accept it :) ) is to dive deep into the data story-telling of one of two datasets that you
can download and choose from moodle.

## 1. Datasets

## 1.1 Global Warming Trends

This dataset by data science nonprofit Berkeley Earth reports on how land and
temperature vary by city (the bigger file) and by average on the planet. This
data is (mostly) already cleaned. The idea here is to dive deeper into global
surface temperature anomalies through your analysis. While doing so, please
document the different queries/ processing steps you thought of as well as the
results and observations that came out of those. You will be asked to return
them at the end of the course.

As a start you could run a sanity check (eg., unicity of the couple (country,
city) at a specific day, correspondence of the coordinates with the specified
city and country (to this end, you can use libraries like geopy), arbitrary
temperatures values), and anything relevant that comes to your mind.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [3]:
global_df = pd.read_csv('data/GlobalTemperatures.csv')
global_df.head()

Unnamed: 0,dt,LandAverageTemperature,LandAverageTemperatureUncertainty,LandMaxTemperature,LandMaxTemperatureUncertainty,LandMinTemperature,LandMinTemperatureUncertainty,LandAndOceanAverageTemperature,LandAndOceanAverageTemperatureUncertainty
0,1750-01-01,3.034,3.574,,,,,,
1,1750-02-01,3.083,3.702,,,,,,
2,1750-03-01,5.626,3.076,,,,,,
3,1750-04-01,8.49,2.451,,,,,,
4,1750-05-01,11.573,2.072,,,,,,


In [4]:
# plot each temperature with dt as the x axis
fig = px.line(global_df, x='dt', y=['LandAverageTemperature', 'LandMaxTemperature', 'LandMinTemperature', 'LandAndOceanAverageTemperature'])
fig.show()

We notice so,e data is missing before the 1850s. We will not use this data for our analysis. Because I like an easy lifestyle

In [5]:
city_df = pd.read_csv('data/GlobalLandTemperaturesByCity.csv')
city_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [24]:


# drop rows where there are missing values
city_df_clean = city_df.dropna()

# turn dt column into a datetime object
city_df_clean['dt'].update(pd.to_datetime(city_df_clean['dt']))
# set dt data type to datetime
city_df_clean['dt'] = city_df_clean['dt'].astype('datetime64[ns]')

# convert the latitude and longitude to floats
city_df_clean['Latitude'] = pd.to_numeric(city_df_clean['Latitude'].str[:-1])
city_df_clean['Longitude'] = pd.to_numeric(city_df_clean['Longitude'].str[:-1])

city_df_clean.head()
city_df_clean.to_csv('data/city_df_clean.csv')



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a

In [7]:
# plot some temperatures of some cities, there are too many to plot everything
fig = px.line(city_df_clean[:100000], x='dt', y='AverageTemperature', color='City')
fig.show()

We can conclude that this data isn't super legible again...

## Questions

#### 1. Can global warming be observed on earth’s temperature evolution?

The graph we previously plotted does not help us, because the data's variance is too high. We will try to plot the average temperature per 3 years.


In [8]:
# compute moving averages in global_df
window_size = 100
global_df['LandAverageTemperature_avg'] = global_df['LandAverageTemperature'].rolling(window_size).mean()
global_df['LandMaxTemperature_avg'] = global_df['LandMaxTemperature'].rolling(window_size).mean()
global_df['LandMinTemperature_avg'] = global_df['LandMinTemperature'].rolling(window_size).mean()

fig = px.line(global_df, x='dt', y=['LandAverageTemperature_avg', 'LandMaxTemperature_avg', 'LandMinTemperature_avg'])
fig.show()

Now we can conclude that **Yes* the temperatuyres are increasing, yeesh.

#### 2. Can the average country temperature be plotted in a compact way? you may take a (logical) sample of countries. You may also get the year’s average temperature for each.

For this I will choose western europe only because it is where I live and I am biased. I will also take the average temperature per year, we dont need to know for every month.

So yes its possible if you put enough time into it and you are willing to sacrifice your sanity.

In [9]:
year_df = city_df_clean.copy()
# use recent data and create year column
year_df['year'] = year_df['dt'].dt.year
year_df = year_df[year_df['year'] > 1800]

# group by country then group by year average
year_df = year_df.groupby(['Country', 'City', year_df['dt'].dt.year], as_index=False).mean(numeric_only = True)

# use sliding window of 10 years to compute moving average for smoother curves
year_df['AverageTemperature_avg'] = year_df['AverageTemperature'].rolling(25).mean()

year_df.to_csv('data/tmp_year_df.csv', index=False)

In [10]:

year_df = pd.read_csv('data/tmp_year_df.csv')
# only keep rows in city_df_clean where the country is 'France', 'Germany', 'Italy', 'Spain' or 'United Kingdom'
euro_df = year_df[year_df['Country'].isin(['France', 'Germany', 'Italy', 'Spain', 'United Kingdom'])]
euro_df = euro_df.groupby(['Country', 'year'], as_index=False).mean(numeric_only = True)

# plot temperatures per country using px
fig = px.line(euro_df, x=euro_df.year, y='AverageTemperature', color='Country')
fig.show()

#### 3. What does the comparison of the evolution of the temperature between two drastically different countries (location wise) allow you to observe?

In [11]:
year_df = pd.read_csv('data/tmp_year_df.csv')
compare_df = year_df[year_df['Country'].isin(['France', 'Canada', 'South Africa'])]
compare_df = compare_df[compare_df['year'] % 2 == 0]
compare_df = compare_df.groupby(['Country', 'year'], as_index=False).mean(numeric_only = True)

# plot temperatures per country using px
fig = px.line(compare_df, x=compare_df.year, y='AverageTemperature', color='Country')
fig.show()

We notice that while the average temperatures are different, they all seem to increase at the same rate. This is probably because of the global warming 🤔

#### 4. Same question for cities


In [12]:
year_df = pd.read_csv('data/tmp_year_df.csv')
compare_df = year_df[year_df['City'].isin(['Baglan', 'Dublin'])]

# plot temperatures per city using px
fig = px.line(compare_df, x=compare_df.year, y='AverageTemperature', color='City')
fig.show()

We notice the same increase over the years.

#### 5. How does one specific country evolve between two distinct years?

We can see with past plots that the temperature of the next year does not depend of the temp of the previous year. in order to notice the trend of global warming we need to see how the temperature changes over many years.

#### 6. Can the Arctic Ice Melting be observed by looking at the temperature changes in northern cities?

yes, watch this:



In [23]:
nothern_cities = city_df_clean.copy()

nothern_cities = nothern_cities[nothern_cities['Latitude'] > 60]

nothern_cities['year'] = nothern_cities['dt'].dt.year
nothern_cities = nothern_cities[nothern_cities['year'] > 1800]
nothern_cities = nothern_cities[nothern_cities['year'] % 5 == 0]

nothern_cities = nothern_cities.groupby(['City', nothern_cities['dt'].dt.year], as_index=False).mean(numeric_only = True)
# plot temperatures per city using px
fig = px.line(nothern_cities, x=nothern_cities.year, y='AverageTemperature', color='City')
fig.show()

#### 7. Plot these cities in a map

gg ez


In [28]:
nothern_cities = pd.read_csv('data/city_df_clean.csv')

# Select distinct cities with their Longitude and Latitude
nothern_cities = nothern_cities[['City', 'Latitude', 'Longitude']].drop_duplicates()
# only keep rows in city_df_clean where the latitude is greater than 60
nothern_cities = nothern_cities[nothern_cities['Latitude'] > 60]

fig = px.scatter_mapbox(nothern_cities, lat="Latitude", lon="Longitude", hover_name="City",
                        color_discrete_sequence=["red"], zoom=3, height=300)
fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()


We notice that the data is not always accurate, cities like Reykjavìk, the capital of iceland, are in sweden.

#### 8. Quantify the autocorrelation of the average temperature of the country of your choice.


#### 9. Is the temperature evolution of a northern city correlated with the evolution of a southern one? a correlation heatmap could be interesting.

#### 10. Can cities be (manually) clusterized over their temperature?