# Drug Vibes meets the European Web Survey on Drugs (EWSD)
### In 2024, the web survey was conducted as part of a European-wide project : the European Web Survey on Drugs, a web survey conducted in 35 countries, from the EU and beyond, in coordinated by .
The third tab of this dashboard displays a selection of the results of this survey, using the answers of people who indicated that they live in Belgium.
The structure of this survey is similar to Drug Vibes', with many questions in common. However, the surveys have substantial differences, which makes direct comparison of these results ill-advised:

- Drug Vibes focuses on drug use in the past month, whereas the EWSD focuses on use in the past year. There is therefore a meaningful difference between the analysis population of both studies.
- A number of questions were asked differently in both studies (e.g. questions with different answer options, like the questions about the motivations, settings of use, or purchase channels).
The studies, although similar, should therefore be considered and analysed separately.


## Load the dataset to know the data (col, row, dtype, Null value etc)

In [1]:
import pandas as pd

df = pd.read_excel("table_DV_region.xlsx", sheet_name="Sheet 1")
df = df.dropna(subset=["Percentage"])   # remove missing percentages


## Create Target column 
-For this i will add addicated =1 , when percentage is more than 50 otherwise 0 

In [4]:
df["Addicted"] = (df["Percentage"]> 50 ).astype(int)
df

Unnamed: 0,Year,Gender,Age,Region,Substance,n,tot,Percentage,Addicted
0,2022,Woman,18 - 29 yo,Flanders,All,228,424,53.77,1
1,2022,Woman,18 - 29 yo,Wallonia,All,39,424,9.20,0
2,2022,Woman,18 - 29 yo,Brussels,All,65,424,15.33,0
3,2022,Woman,18 - 29 yo,No answer,All,92,424,21.70,0
4,2023,Woman,18 - 29 yo,Flanders,All,663,1237,53.60,1
...,...,...,...,...,...,...,...,...,...
943,2023,All,All,No answer,Ketamine,354,730,48.49,0
944,2025,All,All,Flanders,Ketamine,59,260,22.69,0
945,2025,All,All,Wallonia,Ketamine,19,260,7.31,0
946,2025,All,All,Brussels,Ketamine,34,260,13.08,0


## Split the Features and the target value for training and testing

In [5]:
X=df[["Year","Gender","Age", "Region", "Substance"]]
y= df["Addicted"]


## Train / Test Split


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Processing (Encoding Catagories)

In [10]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


categorical_features = ["Gender", "Age", "Region", "Substance"]
numeric_features = ["Year"]
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
        ("num", "passthrough", numeric_features)
    ]
)


## Model training 
- I will use logistic regration model
- Because  

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

log_reg_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000, class_weight="balanced"))
])

log_reg_pipeline.fit(X_train, y_train)


### Random Forest

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# -------------------------------
# Predict with both models
# -------------------------------
df["Predicted_LogReg"] = log_reg_pipeline.predict(X)

# -------------------------------
# Actual addiction rate (from dataset labels)
# -------------------------------
region_age_actual = df.groupby(["Region", "Age"])["Addicted"].mean().reset_index()
region_age_actual.rename(columns={"Addicted": "Actual"}, inplace=True)

# -------------------------------
# Logistic Regression predictions
# -------------------------------
region_age_log = df.groupby(["Region", "Age"])["Predicted_LogReg"].mean().reset_index()
region_age_log.rename(columns={"Predicted_LogReg": "LogReg_Predicted"}, inplace=True)


# Merge results
# -------------------------------
comparison = region_age_actual.merge(region_age_log, on=["Region", "Age"])

print(comparison)






       Region         Age    Actual  LogReg_Predicted
0    Brussels  18 - 29 yo  0.000000          0.000000
1    Brussels     30 - 39  0.000000          0.000000
2    Brussels         40+  0.000000          0.000000
3    Brussels         All  0.000000          0.000000
4    Flanders  18 - 29 yo  0.297872          1.000000
5    Flanders     30 - 39  0.378378          1.000000
6    Flanders         40+  0.700000          1.000000
7    Flanders         All  0.396226          0.981132
8   No answer  18 - 29 yo  0.234043          0.468085
9   No answer     30 - 39  0.081081          0.243243
10  No answer         40+  0.000000          0.333333
11  No answer         All  0.094340          0.169811
12   Wallonia  18 - 29 yo  0.000000          0.000000
13   Wallonia     30 - 39  0.000000          0.000000
14   Wallonia         40+  0.000000          0.000000
15   Wallonia         All  0.000000          0.000000


In [24]:
import pandas as pd

# Example input (a single person)
sample = pd.DataFrame([{
    "Year": 2025,              # pick a valid year from your dataset
    "Gender": "all",          # choose one (must match training categories)
    "Age": "35 - 44 yo",       # use the same category as in dataset
    "Region": "Flanders",      # region
    "Substance": "Ketamine"         # pick a substance from dataset
}])

# Logistic Regression probability
prob_log_reg = log_reg_pipeline.predict_proba(sample)[:,1][0]

# Random Forest probability
prob_rf = rf_pipeline.predict_proba(sample)[:,1][0]

print(f"Logistic Regression: {prob_log_reg:.2f} chance of addiction")
print(f"Random Forest: {prob_rf:.2f} chance of addiction")


Logistic Regression: 0.62 chance of addiction
Random Forest: 0.43 chance of addiction
