# Data Science in Python part 2

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np

pd.DataFrame.iteritems = pd.DataFrame.items

## Fixing data

In the previous notebook you may have found the following things about the data:

- Dropped decimal points: mortality rates, vaccination percentages and GDP had entries which look off by factors of 10, 100, 1000, etc.
- Other bimodal or multimodal distributions: for example BMI figures look out in certain countries but not by factors of 10. This might be because BMI is calculated from other measurements (feet and inches instead of metres, pounds instead of kg) and not converted properly.
- Infant mortality rates should be per 1000, but the maximum values are 1800!
- Percentage expentidure should be out of 100, but for some countries was much higher.
- It looks like some countries recorded measles as a rate, and others as total incidence. Note that in Ghana, the drop in measles measurements does coincide with a measles vaccination programme in 2002/2003. 
- There are lots of missing values.

With most of these issues, it's not easy to fix the data in a sensible way. The best thing to do is to find alternative data to compare or use instead.

Sometimes, however, you may be forced to use the data you have available. Here are some things you can do.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/IDEMSInternational/GHAIDEMS-data-training/main/life_expectancy.csv")

### Fill in missing values

The following code fills in the missing values in the "Alcohol" column with the total mean for that column. Why might this be innappropriate? 

In [None]:
df["Alcohol"] = df["Alcohol"].fillna(df["Alcohol"].mean())

### Fix decimal point errors

Here is a basic attempt to fix some decimal point errors in the Ghana GDP column.

In [None]:
df_ghana = df[df["Country"] == "Ghana"]

In [None]:
df_ghana["GDP"]

In [None]:
df_ghana.loc[:,"GDP"] = np.where(df_ghana["GDP"] < 200, df_ghana["GDP"]*10, df_ghana["GDP"])

In [None]:
df_ghana["GDP"]

What are the problems with doing this? How could you improve it?

## A fixed dataset

Fortunately, someone has already gone through to fix a lot of the dataset errors, including the decimal point errors, sporadic data and missing values.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/IDEMSInternational/GHAIDEMS-data-training/main/life_expectancy_fixed.csv")

In [None]:
df.head()

In [None]:
df.info()

The column names are slightly different, but the data is largely the same. We see there are no missing values.

Unfortunately, the data is not in the same order. The following command orders the data first by Country, alphabetically, and then by Year, numerically. 

In [None]:
df = df.sort_values(["Country", "Year"]).reset_index(drop=True)

In [None]:
df.head()

If you want, you can check the data using the methods in the previous notebook.

## Visualising data in plotly

### Viewing mean data per country

The following command finds the mean value of each column for each country (each country has multiple values for each year 2000-2015).

In [None]:
df_mean = df.drop("Year", axis=1).groupby(["Country", "Region"]).mean(numeric_only=True).reset_index()

In [None]:
df_mean.head()

Let's make a plot of adult mortality against life expectancy.

In [None]:
fig = px.scatter(
    data_frame=df_mean, 
    x="Life_expectancy", 
    y="Adult_mortality",
)
fig.show()

Let's add country labels and a trendline.

In [None]:
fig = px.scatter(
    data_frame=df_mean, 
    x="Life_expectancy", 
    y="Adult_mortality",
    hover_name="Country",
    trendline="ols",
)
fig.show()

We can use colour, size and shape to visualise multidimensionality.

In [None]:
fig = px.scatter(
    data_frame=df_mean, 
    x="Life_expectancy", 
    y="Adult_mortality", 
    hover_name="Country", 
    color="Region", 
    size="GDP_per_capita", 
    symbol="Developed",
    trendline="ols", 
    trendline_scope="overall")
fig.show()


Maybe that got too messy! We can use faceting to separate out some of the information.

In [None]:
fig = px.scatter(
    data_frame=df_mean, 
    x="Life_expectancy", 
    y="Adult_mortality", 
    hover_name="Country", 
    facet_row="Developed", 
    size="GDP_per_capita", 
    color="Region",
    trendline="ols", 
    trendline_scope="overall")
fig.show()

### Finding other correlations and relationships

Pandas dataframes have a `df.corr()` method which produces a correlation matrix between all variables.

In [None]:
corr = df_mean.corr(numeric_only=True)

display(corr)

The seaborn plot `heatmap` is very useful for visualising this.

In [None]:
sns.heatmap(corr)
plt.show()

You can use this matrix to produce some plots like the one we did for adult mortality against life expectancy.

In [None]:
fig = px.scatter_matrix(df_mean, dimensions=["Schooling", "Hepatitis_B", "Adult_mortality", "GDP_per_capita"], color="Region", hover_name="Country")
fig.show()

### Some final cool things!

You can animate a plot to visualise evolution over time. We'll return to the dataframe `df` for this, but again look at adult mortality against life expectancy.

In [None]:
fig = px.scatter(
    data_frame=df,
    x="Life_expectancy",
    y="Adult_mortality",
    hover_name="Country",
    color="Region",
    size="GDP_per_capita",
    animation_frame="Year",
)
fig.show()

Almost every plotly plot can be animated in this way. It's a great way to view the time dimension of data.

For the final cool thing, since this data is geographical, we can use plotly to produce a map plot. To do this, we need to supply each country in the list with its 'Iso 3 alpha' code. For example Afghanistan = AFG, Ghana = GHA, etc. 

The file `iso_3alpha_codes.csv` contains this. To add it to our dataframe we can use a left-merge with `pd.merge()`.

In [None]:
iso = pd.read_csv("https://raw.githubusercontent.com/IDEMSInternational/GHAIDEMS-data-training/main/iso_3alpha_codes.csv")

In [None]:
iso.head()

In [None]:
df_new = pd.merge(df, iso, how="left", on="Country")

In [None]:
df_new.head()

Let's look at how measles vaccination rates have evolved over time.

In [None]:
fig = px.choropleth(df_new, locations="ISO3", color="Measles", hover_name="Country", animation_frame="Year")
fig.show()