# Causes derrière la consommation de drogue 

Vous allez retouver sur ce notebook les différentes analyses effectuées dans l'optique de répondre à la problématique suivante :
- Peut-on prédire la consommation de drogue "dure" selon la personnalité et l'environnement des jeunes? 

URL to our dataset : https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29

Le dataset que nous allons traité a été collecté en 2016. 
Le traitement de notre data passera par un travail de classification afin de déterminer si des profils ressortent dans notre dataset.

Poure ce faire, nous allons passer par les étapes suivantes :
- Data Cleaning 
- Exploration de la data
- Visualisation de la data
- Action (ML models etc )

### I. Chargement et Description des données 

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

from IPython.display import display, HTML 

css = """
.output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(css))

In [2]:
# chargement des données
# noms des colonnes

data =  pd.read_csv('drug_consumption.data', sep=",", header = None)
data.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,1,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 32 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1885 non-null   int64  
 1   1       1885 non-null   float64
 2   2       1885 non-null   float64
 3   3       1885 non-null   float64
 4   4       1885 non-null   float64
 5   5       1885 non-null   float64
 6   6       1885 non-null   float64
 7   7       1885 non-null   float64
 8   8       1885 non-null   float64
 9   9       1885 non-null   float64
 10  10      1885 non-null   float64
 11  11      1885 non-null   float64
 12  12      1885 non-null   float64
 13  13      1885 non-null   object 
 14  14      1885 non-null   object 
 15  15      1885 non-null   object 
 16  16      1885 non-null   object 
 17  17      1885 non-null   object 
 18  18      1885 non-null   object 
 19  19      1885 non-null   object 
 20  20      1885 non-null   object 
 21  21      1885 non-null   object 
 22  

Nom de chaque colonne : 

In [4]:
names_column = ['ID','age','gender','education','country','ethnicity','nscore','escore','oscore','ascore','csore','impulsive_bis11','sensation_ImpSS','alcohol','amphet','amyl','benzos','caff','canabis','choc','coke','crack','ecstasy','heroin','ketamine','legalhigh','LSD','meth','mushrooms','nicotine','semer','VSabuse']

In [5]:
columns_dict = dict(zip(list(range(len(data.columns))),names_column))

data.rename(columns = columns_dict, inplace = True)

#data.rename( columns = names_column )

Nous avons trois Grands types de colonnes : 
- Colonnes décrivant le profil de l'individu ( CPRFL )
- Colonnes décrivant la personnalité de l'individu ( CPRSL )
- Colonnes décrivant la fréquence de consommation de drogues ( CFCD )


In [6]:
data_CPROFIL = data.iloc[:,0:6]
data_CPERSONALITY = data.iloc[:,6:13]
data_CDRUG = data.iloc[:,13:31]

Décrivons en premier lieu les colonnes décrivant le profil de l'individu\
Les colonnes CPRFL :
- ID : Categorical variable
- Age : Interval variable 
- Genre : Categorical variable 
- Education : Ordinal Variable 
- Pays : Categorical Variable 
- Ethnicité : Categorical Variable 

Soit 6 colonnes.

In [7]:
for i in range(1,6) :
    print('La colonne \'',names_column[i],'\' a ',len(data_CPROFIL.iloc[:,i].unique()),' valeurs uniques ',data.iloc[:,i].unique(),'\n')
    


La colonne ' age ' a  6  valeurs uniques  [ 0.49788 -0.07854 -0.95197  2.59171  1.09449  1.82213] 

La colonne ' gender ' a  2  valeurs uniques  [ 0.48246 -0.48246] 

La colonne ' education ' a  9  valeurs uniques  [-0.05921  1.98437  1.16365 -1.22751 -1.7379   0.45468 -0.61113 -2.43591
 -1.43719] 

La colonne ' country ' a  7  valeurs uniques  [ 0.96082  0.24923 -0.57009 -0.28519 -0.09765  0.21128 -0.46841] 

La colonne ' ethnicity ' a  7  valeurs uniques  [ 0.126   -0.31685  0.1144  -0.22166 -0.50212 -1.10702  1.90725] 



Nous pouvons observer que les colonnes CPRFL sont dans l'ensemble catégorique (même si à première vue, les valeurs étant numériques, nous aurions pu penser autrement). 

Décrivons en premier lieu les colonnes décrivant le profil de l'individu\
Les colonnes CPRSL:

- les colonnes se basant sur le test NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness),
- la colonne impulsivité selon le test BIS-11 (impulsivity), 
- la colonne de sensation selon le test ImpSS (sensation seeking)

Soit 7 colonnes.

In [8]:
for i in range(0,) :
    print('La colonne \'',names_column[i],'\' a ',len(data.iloc[:,i].unique()),' valeurs uniques ',data.iloc[:,i].unique(),'\n')
    


Nous pouvons observer que les colonnes CPRSL sont dans l'ensemble catégorique (même si à première vue, les valeurs étant numériques, nous aurions pu penser autrement). 

Décrivons en premier lieu les colonnes décrivant le profil de l'individu\
Les colonnes CPRSL:
- les colonnes se basant sur le test NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness),
- la colonne impulsivité selon le test BIS-11 (impulsivity), 
- la colonne de sensation selon le test ImpSS (sensation seeking)

Soit 7 colonnes.

In [9]:
data.describe

<bound method NDFrame.describe of         ID      age   gender  education  country  ethnicity   nscore   escore  \
0        1  0.49788  0.48246   -0.05921  0.96082    0.12600  0.31287 -0.57545   
1        2 -0.07854 -0.48246    1.98437  0.96082   -0.31685 -0.67825  1.93886   
2        3  0.49788 -0.48246   -0.05921  0.96082   -0.31685 -0.46725  0.80523   
3        4 -0.95197  0.48246    1.16365  0.96082   -0.31685 -0.14882 -0.80615   
4        5  0.49788  0.48246    1.98437  0.96082   -0.31685  0.73545 -1.63340   
...    ...      ...      ...        ...      ...        ...      ...      ...   
1880  1884 -0.95197  0.48246   -0.61113 -0.57009   -0.31685 -1.19430  1.74091   
1881  1885 -0.95197 -0.48246   -0.61113 -0.57009   -0.31685 -0.24649  1.74091   
1882  1886 -0.07854  0.48246    0.45468 -0.57009   -0.31685  1.13281 -1.37639   
1883  1887 -0.95197  0.48246   -0.61113 -0.57009   -0.31685  0.91093 -1.92173   
1884  1888 -0.95197 -0.48246   -0.61113  0.21128   -0.31685 -0.46725  2.127

- 1884 rows -> 1884 d'individus 
- 32 columns : 32 caractéristiques d'individus

### II. Data Cleaning 

Cette partie sera dédiée au nettoyage (renommer les colonnes,reset des index, drop des colonnes, ) et à la mise en forme de la donnée pour que cette dernière soit prête à l'exploitation

In [10]:
data.head()

Unnamed: 0,ID,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,...,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer,VSabuse
0,1,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0


Nous pouvons observer que le nom de nos colonnes n'est pas bonne.\
Nous vérifierons que la première ligne reste la première ligne et que la dernière ligne reste la dernière ligne même après nos différentes mises en forme. 

In [11]:
first_row = data.iloc[0]
last_row = data.iloc[-1]

In [12]:
pd.DataFrame(first_row).T

Unnamed: 0,ID,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,...,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer,VSabuse
0,1,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0


In [13]:
pd.DataFrame(last_row).T

Unnamed: 0,ID,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,...,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer,VSabuse
1884,1888,-0.95197,-0.48246,-0.61113,0.21128,-0.31685,-0.46725,2.127,1.65653,1.11406,...,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL6,CL0,CL2


First row : " 0.49788	0.48246	-0.05921	0.96082	0.12600	0.31287	-0.57545	-0.58331	-0.91699	...	CL0.4	CL0.5	CL0.6	CL0.7	CL0.8	CL0.9	CL0.10	CL2.2	CL0.11	CL0.12 "

In [14]:
display(data.head(2))
display(data.tail(2))

Unnamed: 0,ID,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,...,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer,VSabuse
0,1,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0


Unnamed: 0,ID,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,...,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer,VSabuse
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.6209,...,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL4,CL0,CL0
1884,1888,-0.95197,-0.48246,-0.61113,0.21128,-0.31685,-0.46725,2.127,1.65653,1.11406,...,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL6,CL0,CL2


On peut voir qu'il ya un décalage étrange entre l'écart initial de l'ID et de l'index aux premières lignes et l'écart qu'on voit aux dernières lignes.\
Nous allons donc voir à quel niveau l'aggrandissement de l'écart occurre en comparant les valeurs index et ID

In [15]:
# On commence par décaler les valeurs de la colonne ID 
# afin qu'elle puisse matcher avec les valeurs des index

data['ID'].index = data['ID'].index + 1 
Data_ID = pd.concat([pd.Series([0]),data['ID'][:-1]])
Data_ID

  Data_ID = pd.concat([pd.Series([0]),data['ID'][:-1]])


0          0
1          1
2          2
3          3
4          4
        ... 
1880    1883
1881    1884
1882    1885
1883    1886
1884    1887
Length: 1885, dtype: int64

In [16]:
# et on compare
pd.Series(data.index).compare(Data_ID)

Unnamed: 0,self,other
294,294.0,295.0
295,295.0,296.0
296,296.0,297.0
297,297.0,298.0
298,298.0,300.0
...,...,...
1880,1880.0,1883.0
1881,1881.0,1884.0
1882,1882.0,1885.0
1883,1883.0,1886.0


Avec cette comparaison nous pouvons affirmer que le décalage a commencé avec la ligne 294.

Vérifions qu'il n y ait pas d'anomalie aux alentours de cette ligne.

In [17]:
data.iloc[290:296]

Unnamed: 0,ID,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,...,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer,VSabuse
290,291,0.49788,0.48246,-0.05921,0.96082,-1.10702,-0.92104,0.96248,-0.84732,1.45039,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0
291,292,0.49788,0.48246,-0.61113,0.96082,-0.31685,0.62967,-1.09207,0.44585,-0.76096,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL4,CL0,CL0
292,293,1.09449,0.48246,-0.05921,0.96082,-0.31685,-0.05188,-0.15487,-0.58331,-0.60633,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0
293,295,-0.07854,0.48246,-0.05921,0.96082,-0.31685,1.13281,0.00332,-0.17779,-0.76096,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL6,CL0,CL1
294,296,-0.95197,0.48246,-1.43719,0.96082,-0.31685,-0.34799,0.47617,-0.17779,-0.60633,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0
295,297,0.49788,-0.48246,-0.05921,0.96082,-0.31685,1.37297,-0.43999,0.14143,-0.01729,...,CL0,CL2,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0


Aucune anomalie à priori, il manque seulement un individu dont l'identifiant est '294'.\
On pourra dans la phase 'Action' si besoin le créer et prédire ses scores et sa fréquence de consommation selon un profil.

In [18]:
data.columns

Index(['ID', 'age', 'gender', 'education', 'country', 'ethnicity', 'nscore',
       'escore', 'oscore', 'ascore', 'csore', 'impulsive_bis11',
       'sensation_ImpSS', 'alcohol', 'amphet', 'amyl', 'benzos', 'caff',
       'canabis', 'choc', 'coke', 'crack', 'ecstasy', 'heroin', 'ketamine',
       'legalhigh', 'LSD', 'meth', 'mushrooms', 'nicotine', 'semer',
       'VSabuse'],
      dtype='object')

In [19]:
data.isna().sum()

ID                 0
age                0
gender             0
education          0
country            0
ethnicity          0
nscore             0
escore             0
oscore             0
ascore             0
csore              0
impulsive_bis11    0
sensation_ImpSS    0
alcohol            0
amphet             0
amyl               0
benzos             0
caff               0
canabis            0
choc               0
coke               0
crack              0
ecstasy            0
heroin             0
ketamine           0
legalhigh          0
LSD                0
meth               0
mushrooms          0
nicotine           0
semer              0
VSabuse            0
dtype: int64

- Bon point : Nous n'avons aucune valeur nulle dans le tableau

#### Nous allons désormais drop les colonnes qui ne sont soit
- pas cohérentes avec la problématique 
- pas cohérentes avec le reste des variables

In [20]:
# data_CPROFIL
# data_CPERSONALITY
# data_CDRUG

MAPPING 1

In [21]:
data_CPROFIL_Visual = data_CPROFIL.copy()

map_dict_age = {-0.95197:"18-24", -0.07854:"25-34", 0.49788:"35-44", 1.09449:"45-54", 1.82213:"55-64", 2.59171:"65+"}
data_CPROFIL_Visual["age"] = data_CPROFIL_Visual["age"].map(map_dict_age)

map_dict_gender = {0.48246:"Female", -0.48246:"Male"}
data_CPROFIL_Visual["gender"] = data_CPROFIL_Visual["gender"].map(map_dict_gender)

map_dict_education = {-2.43591:"Left school before 16 years", -1.73790:"Left school at 16 years", -1.43719:"Left school at 17 years", -1.22751:"Left school at 18 years", -0.61113:"Some college or university, no certificate or degree", -0.05921:"Professional certificate/ diploma", 0.45468:"University degree", 1.16365:"Masters degree", 1.98437:"Doctorate degree"}
data_CPROFIL_Visual["education"] = data_CPROFIL_Visual["education"].map(map_dict_education)

map_dict_country = {-0.09765:"Australia", 0.24923:"Canada", -0.46841:"New Zealand", -0.28519:"Other", 0.21128:"Republic of Ireland", 0.96082:"UK", -0.57009:"USA"}
data_CPROFIL_Visual["country"] = data_CPROFIL_Visual["country"].map(map_dict_country)

map_dict_ethnicity = {-0.50212:"Asian", -1.10702:"Black", 1.90725:"Mixed-Black/Asian", 0.12600:"Mixed-White/Asian", -0.22166:"Mixed-White/Black", 0.11440:"Other", -0.31685:"White"}
data_CPROFIL_Visual["ethnicity"] = data_CPROFIL_Visual["ethnicity"].map(map_dict_ethnicity)


data_CPROFIL_Visual

Unnamed: 0,ID,age,gender,education,country,ethnicity
0,1,35-44,Female,Professional certificate/ diploma,UK,Mixed-White/Asian
1,2,25-34,Male,Doctorate degree,UK,White
2,3,35-44,Male,Professional certificate/ diploma,UK,White
3,4,18-24,Female,Masters degree,UK,White
4,5,35-44,Female,Doctorate degree,UK,White
...,...,...,...,...,...,...
1880,1884,18-24,Female,"Some college or university, no certificate or ...",USA,White
1881,1885,18-24,Male,"Some college or university, no certificate or ...",USA,White
1882,1886,25-34,Female,University degree,USA,White
1883,1887,18-24,Female,"Some college or university, no certificate or ...",USA,White


MAPPING 2

In [22]:
data_CPERSONALITY.head(5)

Unnamed: 0,nscore,escore,oscore,ascore,csore,impulsive_bis11,sensation_ImpSS
0,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,-0.21712,-1.18084
1,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575
2,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,-1.37983,0.40148
3,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084
4,0.73545,-1.6334,-0.45174,-0.30172,1.30612,-0.21712,-0.21575


In [23]:
data_CPERSONALITY_Visual = data_CPERSONALITY.copy()

map_dict_nscore = {-0.95197:"18-24", -0.07854:"25-34", 0.49788:"35-44", 1.09449:"45-54", 1.82213:"55-64", 2.59171:"65+"}
data_CPERSONALITY_Visual["nscore"] = data_CPERSONALITY_Visual["nscore"].map(map_dict_age)





data_CPERSONALITY_Visual

Unnamed: 0,nscore,escore,oscore,ascore,csore,impulsive_bis11,sensation_ImpSS
0,,-0.57545,-0.58331,-0.91699,-0.00665,-0.21712,-1.18084
1,,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575
2,,0.80523,-0.84732,-1.62090,-1.01450,-1.37983,0.40148
3,,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084
4,,-1.63340,-0.45174,-0.30172,1.30612,-0.21712,-0.21575
...,...,...,...,...,...,...,...
1880,,1.74091,1.88511,0.76096,-1.13788,0.88113,1.92173
1881,,1.74091,0.58331,0.76096,-1.51840,0.88113,0.76540
1882,,-1.37639,-1.27553,-1.77200,-1.38502,0.52975,-0.52593
1883,,-1.92173,0.29338,-1.62090,-2.57309,1.29221,1.22470


In [24]:
data_CDRUG.head(5)

Unnamed: 0,alcohol,amphet,amyl,benzos,caff,canabis,choc,coke,crack,ecstasy,heroin,ketamine,legalhigh,LSD,meth,mushrooms,nicotine,semer
0,CL5,CL2,CL0,CL2,CL6,CL0,CL5,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0
1,CL5,CL2,CL2,CL0,CL6,CL4,CL6,CL3,CL0,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0
2,CL6,CL0,CL0,CL0,CL6,CL3,CL4,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0
3,CL4,CL0,CL0,CL3,CL5,CL2,CL4,CL2,CL0,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0
4,CL4,CL1,CL1,CL0,CL6,CL3,CL6,CL0,CL0,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0


In [27]:
data_CDRUG.shape[1]

18

In [None]:
data_CDRUG_Visual = data_CDRUG.copy()

map_dict_drug = {"CL0":"Never Used", "CL1":"Used over a Decade Ago", "CL2":"Used in Last Decade", "CL3":"Used in Last Year", "CL4":"Used in Last Month", "CL5":"Used in Last Week", "CL6":"Used in Last Day"}

for i in range(0, data_CDRUG.shape[1])
    print(i)

data_CDRUG_Visual["nscore"] = data_CDRUG_Visual["nscore"].map(map_dict_drug)





data_CDRUG_Visual