# Skills Correlations
In this Notebook we will analyse the correlations that exist between the completion rates of different skills.

### Imports and Load Data

In [None]:
import utilities.data as ud
import utilities.users as uu
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DATA_DIR = "./data"

In [None]:
# Execute this cell *only* if you wish to fetch the data from the Database.
ud.fetch_data(DATA_DIR)

### Data processing
Note that we're only interested in non-null users (those with more than 0 *xp* points) for the entirety of this Notebook.

In [None]:
users, challenges, items, skills, tasks = ud.read_data(DATA_DIR)
users = uu.process(uu.non_null(users))

In [None]:
# This cell may take some time to execute.
users = uu.add_completions_per_category(users, skills)

### Correlation Matrix
In the following section we will be analysing the *correlation matrix* of the completion rates of skills per category.

Any negative values in this matrix should be analysed in detail, as this would mean that completing certain skills would make it less likely for users to complete others. Of course, this result is not expected.

In [None]:
users_completion_rates = users[[category + "C" for category in skills["category"].unique()]]
corr_matrix = users_completion_rates.corr()
corr_matrix

The following heatmap provides a more intuitive visualisation of the *correlation matrix*.

In [None]:
sns.heatmap(corr_matrix)

We now calculate the *mean correlation* for each of the categories. This is a measure of the *leverage* that each category has. A higher number means that users that complete skills in that category are more likely to complete skills in other ones. It would therefore be a good idea to encourage users to complete high leverage skills, to encourage more activity throughout the entire *Skill Tree*.

It is important to interpret these results in context. For example, if the score for *MENTAL HEALTH* is low, it could mean that some of its skills are very easy to complete and that some users have only completed skills in this category. It is also important to note that certain categories do not contain many skills.

It would also be possible to modify this Notebook in order to only analyse users in the upper quartiles of *xp*, to minimise the effect explained above.

In [None]:
corr_matrix.mean()

Another informative way to view the correlations is with respect to the mean correlation. In this way, a value of 2 in the following matrix would mean that those two skills correlate twice as much as the mean correlation. 

In [None]:
corr_matrix/corr_matrix.unstack().mean()

Another visualisation for completion rates is the following scatter plot. 

In [None]:
# This is an example for mindfulness and screentime.
plt.title("Completions per category")
plt.ylabel("Mindfulness")
plt.xlabel("Screentime")

plt.scatter(users_completion_rates["screentimeC"], users_completion_rates["mindfulnessC"])

We can display all such *scatter plots* at once in the following *pair plot*. Also note that the diagonal displays a univariate distribution plot to show the marginal distribution of the data in each column.

In [None]:
sns.pairplot(users_completion_rates)