# Climate Change Data Analysis

The goal of this project is to explore land temperature data in different areas over time, to assess the severity of climate change.

In [1]:
import pickle
import os
import sys
import pandas as pd
import numpy as np
import copy

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

%reload_ext autoreload
%autoreload 2

In [2]:
# Load data

with open('data/GlobalLandTemperaturesByCity.csv', 'r') as f:
    city_data = pd.read_csv(f)

## Data Cleaning

First we'll investigate the various variables in the data and do some cleaning.

In [4]:
city_data.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [5]:
city_data.shape

(8599212, 7)

In [12]:
city_data.dtypes

dt                                object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                              object
Country                           object
Latitude                          object
Longitude                         object
dtype: object

The data is pretty self-explanatory - we have average temperatures with uncertainty for cities over time. We can convert the time column into datetime objects and parse the latitude and longitude (remove the direction and multiply by -1 for S and W)

In [3]:
city_data['time'] = pd.to_datetime(city_data['dt'], format='%Y-%m-%d')

In [33]:
'34.56S'.split('N')

['34.56S']

In [5]:
def parse_lat_lon(val, t):
    if t == 'lat':
        if 'N' in val:
            return float(val.split('N')[0])
        elif 'S' in val:
            return -1 * float(val.split('S')[0])
        else:
            return np.nan
    else:
        if 'E' in val:
            return float(val.split('E')[0])
        elif 'W' in val:
            return -1 * float(val.split('W')[0])
        else:
            return np.nan
            

In [11]:
city_data['Latitude'] = map(lambda x: parse_lat_lon(x, 'lat'),city_data['Latitude'])
city_data['Longitude'] = map(lambda x: parse_lat_lon(x, 'lon'),city_data['Longitude'])

## Data Exploration

Now we'll look into the data further. Looking at the number of cities and countires, we get the following:

In [8]:
cities = np.unique(city_data['City'])
countries = np.unique(city_data['Country'])

In [15]:
print 'Number of unique cities: ' + str(len(cities))
print 'Number of unique countries: ' + str(len(countries))

Number of unique cities: 3448
Number of unique countries: 159


We'll look at the distribution of cities on a map:

In [12]:
unique_city_data = city_data.groupby('City').first()

In [None]:
mpis = [{'lat': unique_city_data['Latitude'],
  'lon': unique_city_data['Longitude'],
  'marker': {'color': 'rgb(0,116,217)',
   'line': {'color': 'rgb(40,40,40)', 'width': 0.5},
   'size': 5,
   'sizemode': 'diameter'},
  'text': unique_city_data.index,
  'type': 'scattergeo'},
]


layout = go.Layout(
    title = 'City Distribution',
    showlegend = True,
    geo = dict(
            scope='world',
            projection=dict( type = 'natural earth'),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 255)",
            countrycolor="rgb(255, 255, 255)"
        ),)

fig =  go.Figure(layout=layout, data=mpis)
iplot( fig, validate=False)

Now we look at the amount of temperature data we have for each city - i.e. the distribution of the start year, end year and average sampling rate.

In [105]:
n = city_data.isnull().none(axis=1)
clean_data = city_data.loc[n.index[n],:]

MemoryError: 

In [104]:
clean_data.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude,time
1,1743-12-01,,,Århus,Denmark,57.05,10.33,1743-12-01 00:00:00
2,1744-01-01,,,Århus,Denmark,57.05,10.33,1744-01-01 00:00:00
3,1744-02-01,,,Århus,Denmark,57.05,10.33,1744-02-01 00:00:00
4,1744-03-01,,,Århus,Denmark,57.05,10.33,1744-03-01 00:00:00
9,1744-08-01,,,Århus,Denmark,57.05,10.33,1744-08-01 00:00:00


In [92]:
time_stats = city_data.groupby('City').agg({'time': {'start_year': lambda x: min(x.loc[x.loc['AverageTemperature'].notnull()]).year, 
                                                     'end_year': lambda x: max(x.loc[x.loc[:, 'AverageTemperature'].notnull()]).year, 
                                                     'ave_samp_rate': lambda x: x.loc[x.loc[:, 'AverageTemperature'].notnull()].diff().mean().total_seconds() / 3600.0,
                                                       'num_samps': 'count'}})

MemoryError: 

In [65]:
time_stats.columns = ['end_year', 'start_year', 'num_samps', 'ave_samp_rate']

In [60]:
print 'Number of unique start years: ' + str(len(np.unique(time_stats['start_year'])))
print 'Number of unique end years: ' + str(len(np.unique(time_stats['end_year'])))

Number of unique start years: 65
Number of unique end years: 1


All data collection appears to end in the same year:

In [61]:
print time_stats['end_year'].iloc[0]

2013


Below we plot histograms for the start year, number of samples and average sampling rate:

In [54]:
iplot(go.Figure(data = [go.Histogram(x = time_stats['start_year'])],
               layout = go.Layout(title = 'Distribution of Start Year of Temp Samples')))

In [66]:
iplot(go.Figure(data = [go.Histogram(x = time_stats['num_samps'])],
               layout = go.Layout(title = 'Distribution of Number of Temp Samples per City')))

In [58]:
iplot(go.Figure(data = [go.Histogram(x = time_stats['ave_samp_rate'])],
               layout = go.Layout(title = 'Distribution of Average Sampling Rate (Hours)')))

The data collection beings for each city somewhere in the 1700s or 1800s and ends in 2013.  The vast majority of cities were sampled about one a month (~730 hours), some had higher sampling rates but not lower than once a month.  Thus, the majority of cities had about 2000-3000 total temperature samples and the cities with higher sampling rates had up to 9000 samples.

This is good since we'll be able to get temperature trends over 200 or so years and likely be able to remove seasonal variations. In addition, we can get these trends for many cities around the world and compare them.

Given this, we can refine our goal for this analysis to be to assess the rate of change of temperature over the last 200 years for cities around the world and the planet as a whole.  We will also attempt to predict the year where temperatures will hit 2C and 4C above pre-industrial levels.  

**Insert info about why 2C and 4C chosen**

## Trend Analysis

Now we'll investigate any trends present in the temperature data.  First we'll plot the temperature over time for a few cities around the world.  This paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.491.295&rep=rep1&type=pdf identifies a representative major city for each climate type:

Singapore (tropical) 
Cairo, Egypt (hot-arid)
Milan, Italy (temperate)
Fort Smith, Canada (cold)
Resolute (polar)

We don't have info for Fort Smith and Resolute - and in fact no information for any polar areas.  So we'll just look at Norilsk, Russia for a cold climate, even though the climate is actually both cold and temperate (https://en.climate-data.org/asia/russian-federation/krasnoyarsk-krai/norilsk-1831/)


In [78]:
data = city_data.loc[city_data['City'] == 'Singapore',:]
iplot(go.Figure(data = [go.Scatter(x = data['time'], y = data['AverageTemperature'])], 
                layout = go.Layout(title = 'Temperature of Singapore Over Time')))

In [75]:
data = city_data.loc[city_data['City'] == 'Cairo',:]
iplot(go.Figure(data = [go.Scatter(x = data['time'], y = data['AverageTemperature'])], 
                layout = go.Layout(title = 'Temperature of Cairo Over Time')))

In [76]:
data = city_data.loc[city_data['City'] == 'Milan',:]
iplot(go.Figure(data = [go.Scatter(x = data['time'], y = data['AverageTemperature'])], 
                layout = go.Layout(title = 'Temperature of Milan Over Time')))

In [77]:
data = city_data.loc[city_data['City'] == 'Norilsk',:]
iplot(go.Figure(data = [go.Scatter(x = data['time'], y = data['AverageTemperature'])], 
                layout = go.Layout(title = 'Temperature of Norilsk Over Time')))