# Data Exploration Exercise
---
This is an open-ended exercise to explore a given data set.

The data is taken from the [OECD Better Life Index](https://stats.oecd.org/Index.aspx?DataSetCode=BLI) - it only covers OECD countries, so when thinking about the results of your analysis, bear in mind that this excludes a lot of the world's population!

In [None]:
import pandas as pd
bli = pd.read_csv("BLI2.csv")
bli.head()

The data in the file is already in a tidy form, though the website linked above shows a more concise view of the data, which you may find helpful.

---
### Task 1
Do men and women report the same degree of life satisfaction?

*Hint*: Start by making a list of the available values for 'Indicator'.

---
### Task 2
Is happiness related to geographical latitude?

We have some information on the overall 'life satisfaction' of each country:

In [None]:
life_sat = bli.query("Indicator == 'Life satisfaction' and Inequality == 'Total'")
life_sat.head()

We also have some information about the latitude of each country (in the *countries* table). 

In [None]:
countries = pd.read_excel("data_geographies_v1.xlsx", 
                          sheet_name = "list-of-countries-etc")
countries.head()                        

Can we combine these somehow to answer this question?

How do we work with data that are held in two different DataFrames? It will be tricky at the moment, but fortunately pandas has some more tools to help us!

### Joining tables

We will start with the *life_sat* DataFrame (the "left-hand" table) and add the 'Latitude' column from *countries*. To do this, we need to `join()` the two tables together.

Importantly, we **cannot** assume that the countries are listed in the same order, or even that both tables contain the same set of countries. We need to identify a *key*, that is, some information that exists in both tables, that we can use to "look up" the correct row from *countries* (the "right-hand" table).

We could use the country name, but this might be recorded differently in the two tables. The three-letter 'LOCATION/geo' code is unique for each country, so this is the best choice for these two tables.

To join tables in pandas, we make these *key* columns the index of both the left and right tables:

In [None]:
# make a copy of the table
left = life_sat.copy()

# make a new column for 'geo'
# i.e. LOCATION but in lower case to match the other table
left['geo'] = left['LOCATION'].str.lower()

# move the 'geo' column to the index
left = left.set_index('geo')
left.head()

In [None]:
# make the right-hand table
right = countries.set_index('geo')

# keep only the Latitude column
right = right[['Latitude']]  
right.head()

Now we can join them with the `join()` method!

In [None]:
joined = left.join(right)
joined.head()

This is the table we need to proceed with the analysis.

`join()` is just one of several pandas methods for working with **relational data** (i.e. data held in more than one table).

If needed, you can move the the current index column back into the body of the DataFrame using the method [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html?highlight=reset_index#pandas.DataFrame.reset_index). We already have the LOCATION column in the table, so we don't need do do this now.


Use the *joined* table to investigate the relationship between life satisfaction and latitude.

---
### Any other ideas?

Make a note of any other summary statistics, visualisations and hypotheses that you would be interested to explore in this dataset.