# Analysing Data + Statistics

In this lesson, we will be analysing a dataset from the World Health Organisation. 
Looking at the code you will notice some parts are incomplete and have '__' written instead. This means that you need to complete that part of the code. In some other parts, you will need to write your own code.

You will also find hyperlinks to documentation on different functions that we will use. It is recommended to look at them to familiarise yourself with what you are doing.

There are also questions to be completed. This can be done in one or two sentences.

In [None]:
#First, we need to import the tools we will be using. Links to documentation of the different
#tools we will be using can be found on the explanation of each point.
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import math

### Activity 1: Getting to know our data

Now, we need to load the data we will be using.

For this we have to get the path to the data we are using, and then read the file. In this case, it is in the datasets
folder inside our working directory called 'datasets', and the file name is 'who_countries.csv'. This dataset contains health information on several countries of the world from 2000 to 2015.


In [None]:
#Get the path to the file we will be using. In this case, the dataset is under the datasets folder
path = os.path.join(os.getcwd(), 'datasets', 'who_countries.csv')
#Load the data into the countries_info variable. This results in a DataFrame object.
countries_info = pd.read_csv(path, delimiter = ',')

Now we we will look at the dataset that we have loaded.

In [None]:
#Just writing the name of a variable in a cell is enough to print its value
countries_info

This gives us a rough idea of the data we are working with, but we can get more information with the *describe()* function. 

In [None]:
#Use the DataFrame function describe() to see a summary of the dataset
__.describe()



***What is the input data used for calculating these measures?***

***Do you think this mean can tell us anything? Why? Is this applicable to the other measures?***

### Activity 2

Now look at the same data, but only taking that from 2015. Comment on the key features you observe.

In [None]:
#The .loc[a] function allows us to get data from a DataFrame that satisfies a condition a
info_15 = countries_info.loc[countries_info['Year'] == __]
#In this case, we are getting the data that has 2015 as a value on the 'Year' column

#Use the describe() function again to see the summary of our selected data
__

***What does the mean in 'Life expectancy' represent here?***


***Looking at the percentiles, how many countries do you estimate have a life expectancy of about 77 years or more?***



In this activity, we will be working with the GDP and Life expectancy attributes. GDP stands for [Gross Domestic Product](https://www.investopedia.com/terms/g/gdp.asp), and it gives a sense of how rich a country is.

In order to know more about these, we will now plot their [distribution](https://seaborn.pydata.org/generated/seaborn.distplot.html).

In [None]:
#Plotting the distribution for GDP. Write the name of the correct column
sns.distplot(info_15['__'], bins = 15)


In [None]:
#Write the necessary code for plotting the distribution of the Life expectancy


***Seeing the GDP distribution, would you say that most countries have a high GDP? Why?***

***Having seen the distribution for both life expectancy and GDP, can you guess from these if there is a correlation?***

Now that we know our data a bit better, let's try to find relationships in it. For doing this, we will be looking at the correlation between attributes by plotting pairs of them. 

Our first step is to take the 2015 data for all the countries' GDP and Life expectancy. Then use the function *plt.plot(x, y, 'o')* to plot the points.

Note: The 'o' at the function arguments is for it to draw circular points. Other markers are available, check the [documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html) for these.

If we look at the extracted data, we can see that there are some data points for which we do not have some of the values (they appear as NaN). This means that we need to delete those for our representation.

To drop the NaN values, we use the DataFrame function [.dropna(a)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html), where a in this case stands for the subset of the names of the columns we want to act on.


In [None]:
#Complete the function with the names of the columns in which we are deleting NaN values
clean_info_15 = countries_info.loc[countries_info['Year'] == 2015].dropna(subset=['__', '__'])
plt.plot(clean_info_15['GDP'], clean_info_15['__'], 'o')
plt.xlabel("GDP")
plt.ylabel("Life expectancy")

***Remember that each point represents a country. Does the distribution of these match the distribution we plotted earlier?***

***From this plot, can you deduce if there is correlation between GDP and life expectancy?***

There are many points close to 0. This is because some GDPs are very big compared to others. So, in order to represent the data properly and see if there is in fact a correlation, we need to change the scale of the GDP axis. What scale would be the most appropriate to do this? Write the necessary code for representing the data in the new scale.

Look at the documentation for the function [.xscale()](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xscale.html) to find what argument you need to use for your scale.

In [None]:
plt.plot(clean_info_15['GDP'], clean_info_15['Life expectancy'], 'o')
#Fill in the correct argument
plt.xscale('__')
plt.xlabel("GDP")
plt.ylabel("Life expectancy")
plt.show()

***Do you think there is a correlation here?***

Now, we will plot the fitting line. In this case, we will be using the seaborn [regplot()](https://seaborn.pydata.org/generated/seaborn.regplot.html) function, which fits a Machine Learning model known as linear regression and plots the data points and a fitting line.

In [None]:
#You do not need to do anything here, just run the cell
sns.regplot([math.log10(val) for val in clean_info_15['GDP']], clean_info_15['Life expectancy'])


In the code, you will see that we are taking the logarithm of all the GDP values and feeding that into the function, instead of changing the scale, using a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html). This is because we want to fit the line to the transformed values, otherwise the line would be skewed. 

Finally, we will now calculate the value of the correlation between both attributes. We will use the NumPy function [corrcoef(x,y)](https://realpython.com/numpy-scipy-pandas-correlation-python/). This function returns a 2x2 matrix with the correlation coefficients between the two arguments (i.e. \[x-x, y-x\], \[x-y, y-y\]).We are interested in the x-y or y-x coefficients.  We will again apply a logarithmic function to our GDP data before passing it into the function.

In [None]:
np.corrcoef([math.log10(val) for val in clean_info_15['__']], clean_info_15['__'])

***What correlation coefficient did you obtain? Do you think this is high?***

### Activity 3: Mini challenge

Keeping in mind what we have done until here, take two other attributes that you think might be related and check if there is a correlation. Remember the following:
    
    -Delete the empty values where necessary 
    
    -Label all the plots you make
    
    -Use an appropriate scale when necessary

When you have your final result, make sure that you explain why do you think there is or there is not a correlation.

In [None]:
#Write your code here. Remember you can add any cells you need