# Data Science in Python part 1

## Preliminary analysis

We will use the following packages:

- pandas: data storage, manipulation and statistics
- matplotlib: python's basic plotting package (equivalent to ggplot in R)
- seaborn: an interface to matplotlib built around data visualisation
- plotly: a more advanced graphical package for data visualisation

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## Importing and viewing data

We use pandas to import the dataset `life_expectancy.csv` into a `DataFrame`.

A `DataFrame` (dataframe) is a two-dimensional tabular data format.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/IDEMSInternational/GHAIDEMS-data-training/main/life_expectancy.csv")

The command `df.head()` shows the first five rows of a dataframe.

In [None]:
df.head()

Some other useful commands:

- `df.columns` shows all the columns.
- `df["column_name"]` restricts the data to a specific column.
- `df[["column_name_1", "column_name_2", ...]]` restricts the data to specific columns.
- `df.info()` displays information about the entries (number, non-null, data type).

In [None]:
df.columns

In [None]:
df["Adult Mortality"]

# Try changing "Adult Mortality" to any value in the list above.

In [None]:
df[["Country", "Adult Mortality", "Schooling", "GDP"]].head()

# Try selecting any 5 columns in the list above.

In [None]:
df.info()

## Basic statistics

pandas can calculate basic statistics for a column or the whole dataframe.

In [None]:
df["GDP"].min()

In [None]:
df["percentage expenditure"].max()

Does this seem sensible as a percentage?

In [None]:
df["BMI"].mean()

Does this seem sensible as an average BMI?

pandas has a very useful function, `df.describe`, which automatically generates summary statistics for every column of the dataframe.

In [None]:
df.describe()

## Basic visualisation

The seaborn package is very good at quickly generating plots of your data. The function `sns.displot` produces a histogram of a single column.

Since the data contains multiple entries per country, we will look at the mean values per country across years using `df.groupby("Country").mean()`.

In [None]:
df_mean = df.drop("Year", axis=1).groupby("Country").mean(numeric_only=True).reset_index()

In [None]:
df_mean.head()

In [None]:
sns.displot(df_mean, x="Life expectancy")
plt.show()

In [None]:
sns.displot(df_mean, x="BMI")
plt.show()

This looks bimodal. Why do you think that is?

## Over to you

Let's focus on a single country, Ghana. Try to use the above tools to detect as many problems in the data as you can. When you're doing so, try to think of where these issues might have occured.

In [None]:
# You can change Ghana to any other country in the dataset.

df_ghana = df[df["Country"] == "Ghana"]

display(df_ghana)

Try to think about:

- Do the values make sense for the column heading? Percentages should be out of 100, mortality rates per 1000.
- Are the values in a column consistent with each other? Are there outliers?
- Do the missing values make sense?

You could also look at data from other countries by changing Ghana to any other country in the dataset.

The following code can help visualise trends over years to help check consistency. Just change "Life expectancy" to any other column name.

In [None]:
sns.relplot(df_ghana, x="Year", y="Life expectancy", kind="line") 
plt.show()