# Data Science and Programming
# Week 7b


# Table of contents
* [Introduction](#Introduction)
 - [Problem](#Problem)
 - [Importing the libraries and data](#Importing-the-libraries-and-data)
* [Pre-processing the data](#Pre-processing-the-data)
 - [Cleaning the rainfall feature](#Cleaning-the-rainfall-feature)
 - [Converting `Year` to a category](#Converting-Year-to-a-category)
* [Exploring the data](#Exploring-the-data)
* [Communicating the results](#Communicating-the-results)
 * [Checkpoint](#Checkpoint)
* [Extension: Using the UK and world data](#Extension:-Using-the-UK-and-world-data)

# Introduction 
The activity uses the data from the Edexcel large data set which features weather data for 1987 and 2015 from eight weather stations. The first part of the activity use the data from the five UK weather stations. For more information about the data see the short video at: https://mei.org.uk/introduction-to-data-science/large-data-sets/

## Problem

* ***How does the weather differ at the five different UK stations?***


To answer this question you could calculate statistics and produce 1- and 2-dimensional charts.
 
## Importing the libraries and data

> Run the code box below to import the libraries.

In [None]:
# import pandas for data analysis
import pandas as pd 

# import seaborn for visualisations
import seaborn as sns

In [None]:
# import the csv file to a data set called weather_data
weather_data = pd.read_csv('all-stations-uk.csv')

# display the data to verify it has imported
weather_data

# Pre-processing the data
## Cleaning the rainfall feature
In the extension to the activity in lesson 1 there was some code that created an new feature `Rainfall` that was a copy of `Daily Total Rainfall` with the values of *tr* replaced with 0.025.

> Run the code below to pre-process the data.

In [None]:
# create a new column called Rainfall which is a copy of Daily Total Rainfall, replacing any instances of 'tr' with 0.025 and changing the type to float
weather_data['Rainfall'] = weather_data['Daily Total Rainfall'].replace({'tr': 0.025}).astype('float')

weather_data

## Converting `Year` to a category

In this activity you will explore the difference between 1987 and 2015 at the different weather stations. It will be helpful if `Year` is reconfigured as a categorical feature to use many of the built-in commands in Seaborn. To do this you need to set the data type as a category using: `weather_data['Year'].astype('category')`

> Run the code below to change the data type of `Year` to a category.

In [None]:
# change the data type of Year to category
weather_data['Year'] = weather_data['Year'].astype('category')

# display the data types and info
weather_data.info()

# Exploring the data
The problem you are exploring is

***How does the weather differ between 1985 and 2015?***

***How does the weather differ at the five different UK stations?***

To answer this you could:
* Compare the statistics for different features for 1987 and 2015: e.g. use `groupby('Year')` to  compare the statistics for `Daily Mean Temperature`.
* Compare charts for different features for 1987 and 2015: e.g. use `catplot` to create box plots for `Daily Mean Temperature` grouped by `Year`, or `Year` and `Station`.
* Compare the scatter plots based on two features to identify any associations: e.g. use `relplot` to create a scatter plot of `Daily Mean Temperature` against `Daily Total Sunshine` and set `hue='Year'`.
* Take slices of the data and see what the differences are for the different weather stations: e.g create a slice for Camborne and explore this.

# Communicating the results
## Checkpoint
> 
> * Use the statistics and charts produced to answer the initial problem: ***How does the weather differ between 1985 and 2015?***

# Extension: Using the UK and world data
The full data set features weather for 1987 and 2015 from eight different weather stations, including 3 international stations. You can further explore the problem in this activity by comparing the weather for these two years at the international weather stations as well as the UK stations.

## Problem
***How does the weather differ between 1985 and 2015 at the stations?***

> Run the code below to import the data set with all eight weather stations.

In [None]:
# import the csv file to a data set called weather_data
weather_data_uk_world = pd.read_csv('../input/weather-data-edexcel-large-data-set/all-stations-uk-world.csv')

# display the data to verify it has imported
weather_data_uk_world

> Run the code below to create a numerical feature for rainfall.

In [None]:
# create a new column called Rainfall which is a copy of Daily Total Rainfall, replacing any instances of 'tr' with 0.025 and changing the type to float
weather_data_uk_world['Rainfall'] = weather_data_uk_world['Daily Total Rainfall'].replace({'tr': 0.025}).astype('float')

weather_data_uk_world

> Run the code below to change the type of the `Year` feature to a category.

In [None]:
# change the data type of Year to category
weather_data_uk_world['Year'] = weather_data_uk_world['Year'].astype('category')

# display the data types and info
weather_data_uk_world.info()

> Repeat your analysis for all eight weather stations using the `weather_data_uk_world` data set.