## What factors influence the most drug consumption of an individual ? A case study

We have a database of 1885 respondents. For each, 12 attributes and their score is known. These attributes, which we'll call **features**, are based on the NEO-FFI-R personality test. This test measures the following 5 personality traits:
- Neuroticism
- Extraversion
- Openness to experience
- Agreeableness
- Conscientiousness

We also have access to the Impulsiveness and the Sensation Seeking scores of each respondent.

Each personality trait has a decimal value as a score. The higher the score, the more the respondent has the corresponding personality trait. For example, a score of 0.8 for the Neuroticism trait means that the respondent is very neurotic.

additionally to these 7 personality attributes, we also have access to level of education, age, gender, country of residence and ethnicity of each respondent.

_Note that in certain countries, ethnicity statistics are illegal._

In addition, participants were questioned about their use of 18 legal and illegal drugs: alcohol, amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile substance abuse and one fictitious drug (Semeron) which was introduced to identify over-claimers.

For each drug they had to select one of the answers: never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day.


### Things to keep in mind during this analysis:
- The database contains ONLY 1885 respondents. This is a very small sample size compared to the world population. Therefore, the results of this analysis are **not** generalizable to the entire population.
- We only give some observations and suppositions about the data and the results. We do not make any definitive statements.
- Some features are not well balanced. For example most of the respondents are from the UK. This can lead to biased results.
- Correlation does not imply causation. We can only say that two variables are correlated, not that one causes the other.

## Where to start this analysis ?

First, we need to observe the repartition of each population inside the dataset.

Why ? because we need to know if the dataset is balanced or not. If a given feature is not balanced, we need to take this into account when analyzing the data.

sharnalk endpoint repartition/by_population, bar chart to show the repartition of the dataset by population, user must be able to select the population to display

For example, we can see that there are 943 male respondents and 942 women. This feature is well balanced, thus the results we will get from the analysis might be more reliable.
However, the survey was made in the UK, and the majority (1044) of the respondents are from the UK. This must be ttaken into account when observing results about different countries.

### We found it interesting to start by analyzing the drug consumption of the respondents depending on non-personality related features.

It may help us to understand how drug consumption frequency is distributed among the each feature. For example, we could ask ourselves:
- Which age range consumes the most drugs Coke ?
- Which gender consumes the most Alcohol ?

Before making this analysis we had some preconditioned ideas. For example, we thought that the less educated a person is, the more drugs he/she consumes. We will see if these ideas are confirmed by the data.

sharnalk endpoint repartition/by_population bar chart to show the repartition of the dataset by age range and other features. See notebooks/repartition for more details.
user must be able to change the drug and the feature. create chart for education level and meth at the first load. When user is done, he can switch to the next step of the analysis.

As we can see, the meth consumption of people who left school at 16 years and those with a professional certificate / diploma are almost the same. This may be surprising because we thought that people who left school at 16 years might be more likely to consume meth.

When you're done checking drug consumption for each non-personality related feature, you can move on to the next step.

We saw in the previous step that the correlation between drug consumption and non-personality related features may not be as strong as we thought. Let's now analyze the correlation between drug consumption and personality traits. For that, we'll use a correlation matrix.

sharnalk /api/correlation/drug_and_personality, see notebooks/1_personality_drug_correlation_matrix.ipynb for the example
correlation matrix, with zoom on hover etc ...

This matrix shows the correlation score of each personality trait with each drug. The higher the score, the more the personality trait is correlated with the drug consumption.

For example, the correlation between the respondents who have a high openness score and the consumption of LSD is 0.37, which is quite high. The higher the correlation, the darker the color.

This helps understand which personality trait is correlated or not to the use of a specific drug.

### As you can expect, the goal of this whole analysis is to predict which characteristics of a person make him/her more likely to consume drugs with a high frequency. This can be useful for drug prevention programs, for example.

It would help answer questions like:

- Does the age of an individual influence his drug consumption ?
- Does the impulsive behavior of an individual influence his drug consumption ?


Let's use our data to isolate the strongest correlation between each features and each drug.

### We start by observing the highest correlations between each drug and the features

sharnalk show the code only if user clicks on "show me the code !" button

```python
correlations = []
for drug in drugs:
    for feature in features:
        correlations.append((drug, feature, df[drug].corr(df[feature])))

correlations = sorted(correlations, key=lambda x: x[2], reverse=True)

print(*correlations[:30], sep='\n')
```
```
('Cannabis', 'SS', 0.45613655450406493)
('Cannabis', 'Oscore', 0.41416262189358455)
('Legalh', 'SS', 0.4055778552531838)
('Ecstasy', 'SS', 0.38818619655992304)
('Mushrooms', 'SS', 0.3782853777625287)
('LSD', 'Oscore', 0.36975911051910304)
('Mushrooms', 'Oscore', 0.36913941481128937)
('LSD', 'SS', 0.36553577743377336)
('Coke', 'SS', 0.3433520664032049)
('Amphet', 'SS', 0.33110522351156163)
('Legalh', 'Oscore', 0.31732226610206776)
('Cannabis', 'Impulsive', 0.3105287456498304)
('Nicotine', 'SS', 0.3056345911961203)
('Ecstasy', 'Oscore', 0.2963057088389688)
('Amphet', 'Impulsive', 0.2894381837447692)
('Benzos', 'Nscore', 0.27222065600294526)
('Legalh', 'Impulsive', 0.2675787698375155)
('Mushrooms', 'Impulsive', 0.26368389461899877)
('Ecstasy', 'Impulsive', 0.2608640467105362)
('Coke', 'Impulsive', 0.2600421406491375)
```

This code snippet computes the 20 highest correlations between each drug and the known features. For example, the first line of the output shows that the correlation between the Sensation Seeking score and the Cannabis consumption is 0.45. This means that the higher the Sensation Seeking score, the more the person is likely to consume Cannabis.

We can witness that the 20 highest correlations are not about age, country or ethnicity, but only about characteristics of the person.

One of the feature seems to stand out: the SS, or Sensation Seeking score. It is the most correlated feature with the consumption of Cannabis, Legal highs, Ecstasy, Mushrooms, LSD, Nicotine, Amphetamines, and Coke.

We may have found a key feature that influences the drug consumption of an individual.

However, it is not clear yet, because this output separates the correlation of each feature and each drug, and we don't really care about a particular drug, right ?

How could we find the most correlated feature with the drug consumption in general with this output ?

sharnalk allow user to guess by itself, and if he doesn't he can click on a button to see the answer. or go through the next step. (petit QCM)

One way to achieve that would be to compute for each feature the mean of the correlation with all drugs.

sharnalk allow user to see the code or not, if he clicks on the button it show the code, otherwise only the text.

show me the code!

```python
from itertools import groupby
from statistics import mean

correlations.sort(key=lambda x: x[1])

grouped_data = {}

# We group data by the second element of the tuple (feature)
for key, group in groupby(correlations, key=lambda x: x[1]):
    data = list(group)
    grouped_data[key] = {
        "mean": mean([x[2] for x in data]),
        "correlations": data
        }

# Then we sort the entries depending on the mean of each group
sorted_grouped_data = sorted(grouped_data.items(), key=lambda x: x[1]['mean'], reverse=True)

final_data = {feature: data['mean'] for feature, data in sorted_grouped_data}

for key, value in final_data.items():
    print(key, "-->", value)
```

After computing the mean correlation of all drugs for each feature, the output is the following:
```
SS --> 0.24750768760730996
Impulsive --> 0.1835408142616583
Oscore --> 0.18211357240362108
Nscore --> 0.09115772430291505
Ethnicity --> 0.07279538563388822
Escore --> -0.005960554225981149
Ascore --> -0.10339419036206977
Education --> -0.11093035147121946
Cscore --> -0.1526333011851522
Gender --> -0.15812904754418552
Age --> -0.1923893011427491
Country --> -0.25053952952862485
```

sharnalk show the next 3 lines only if user clicks on the "?" button, to help him understand the results.

        How to read this output ?
        For each feature, we can see the mean correlation for all drugs. The higher the mean, the more the feature is correlated with drug consumption.
        The mean correlation between the Sensation Seeking score and the drug consumption is 0.25. This is the highest mean correlation.


Once again, the sensation seeking score is ahead. It appears to be the most correlated feature with the drug consumption of an individual. The Impulsive score is the second most correlated feature. As you can see, features like the age or the country are not very correlated with the drug consumption.

### Let's plot this data, so we can visualize it in a better way !

sharnalk endpoint /api/correlation/feature_to_drug_mean, see notebooks/2_all_features_to_drugs_mean_correlation.ipynb.ipynb for the example

### That's it for the analysis, remember that these are only observations and suppositions. We do not make any definitive statements about the data.

Feel free to explore the charts by yourself and check the variety of drugs and features that the dataset offers. You may see some insights that we didn't see !