# Analisis de datos sobre consumo de sustancias psicoactivas

*import's* generales <br>Proyect hecho en Python 3.12.7

In [1]:
import pandas as pd
pd.set_option('future.no_silent_downcasting', True)

Obtener los datos correspondientes

In [2]:
dconsum = pd.read_csv("drug_consumption.csv")
dconsum.drop("ID", axis = 1, inplace = True)
dconsum.head()

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,...,Ecstasy,Heroin,Ketamine,Legalh,LSD,Meth,Mushrooms,Nicotine,Semer,VSA
0,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,-0.14277,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,0.58489,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,1.30612,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0


Cuantificar categorias relacionadas al consumo de sustancias, dejando simplemente el digito
```Value	Description
CL0	Never Used
CL1	Used over a Decade Ago
CL2	Used in Last Decade
CL3	Used in Last Year
CL4	Used in Last Month
CL5	Used in Last Week
CL6	Used in Last Day```

In [83]:
quant_cat = {
    'CL0': 0,
    'CL1': 1,
    'CL2': 2,
    'CL3': 3,
    'CL4': 4,
    'CL5': 5,
    'CL6': 6,
}

dconsum.iloc[:,12:] = dconsum.iloc[:,12:].replace(quant_cat) 
dconsum.head()

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,...,Ecstasy,Heroin,Ketamine,Legalh,LSD,Meth,Mushrooms,Nicotine,Semer,VSA
0,5,1.0,0,2,6,0,5,0,0,0,...,0,0,0,0,0,0,0,2,0,0
1,5,1.0,2,0,6,4,6,3,0,4,...,4,0,2,0,2,3,0,4,0,0
2,6,0.0,0,0,6,3,4,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,4,0.0,0,3,5,2,4,2,0,0,...,0,0,2,0,0,0,0,2,0,0
4,4,1.0,1,0,6,3,6,0,0,1,...,1,0,0,1,0,0,2,2,0,0


Obtenemos la columna realcionada con sustancias mas balanceada (tiene una cantidad similar de consumidores y no consumidores *CL0* )

In [84]:
def balanced_col(df, value):
    percentages = {} # Disctionary with percentages for each column
    
    #Precentage for each column
    for column in df.columns:
        count = df[column].eq(value).sum()
        total = len(df[column])
        precentage = (count/total) * 100
        percentages[column] = precentage #Store the result in the dictionary

    # Find the column with percentage closest to 50%
    closest_column = min(percentages, key=lambda x: abs(percentages[x] - 50))
    
    return percentages, closest_column

percentages, closest_column = balanced_col(dconsum.iloc[:,12:], 0)
print("La sustancia con mayor balance entre consumidores y no consumidores es " + closest_column)

La sustancia con mayor balance entre consumidores y no consumidores es Amphet


Convertimos `Amphet` en una columna que nos dice si consumio o nunca ha consumido

In [85]:
def clas_quant(value):
    if(value == 0):
        return 0
    return 1

dconsum["Amphet"] = dconsum["Amphet"].apply(clas_quant)

## Modelos para la clasificacion de consumidores de Anfetaminas

La premisa de nuestros modelos va a ser decidir si una persona es una potencial consumidora de anfetaminas en base a los datos proporcionados en `dconsum`, aplicando tambien un analisis estadistico.

Para esto usaremos los siguientes modelos estadisticos
- Regresion logistica
- KNN
- SVN
- Discriminante Lineal

Con las siguientes metricas de desempe;o
- F1-Score
- Recall
- Accuracy

Para esto es importante separar nuestras variables independientes y dependiente, ya que esto sera de utilidad en todos nuestros modelos

In [92]:
X = dconsum.iloc[:, :12]
y = dconsum["Amphet"]

### Modelo de regresion logistica

In [96]:
import statsmodels.api as sm

def clean_dataset(X, y, threshold=0.05):
    # Add a constant for the intercept term
    X = sm.add_constant(X)
    
    # Get the p-values
    model = sm.Logit(y, X).fit(disp=False)
    p_values = model.pvalues.drop('const')
    
    # Loop to remove variables with p-values greater than the threshold
    while p_values.max() > threshold:
        to_drop = p_values.idxmax() #Drop the variable with the highest p-value
        
        print(f"\nRemoving variable: {to_drop} with p-value: {p_values.max()}")
        
        # Remove the variable from the dataset
        X = X.drop(columns=[to_drop])
        
        # Get the updated p-values
        model = sm.Logit(y, X).fit(disp=False)
        p_values = model.pvalues.drop('const')
    
    return X

X_reduc = clean_dataset(X, y)


ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

A su vez dividimos los datasets en entrenamiento y pruebas, con una relacion 80-20

In [97]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [99]:
X_train

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,Impulsive,SS
1806,4,1.0,0,2,6,6,3,5,2,4,2,1
705,5,1.0,2,0,6,5,5,3,0,5,0,3
947,6,1.0,0,3,6,4,6,2,0,3,2,2
934,3,1.0,2,5,5,5,6,3,0,5,0,3
1883,5,0.0,0,0,6,6,5,0,0,3,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
607,6,1.0,4,4,6,6,6,6,2,4,2,3
1568,6,0.0,0,0,6,0,5,0,0,0,0,0
1667,6,0.0,0,0,6,0,5,0,0,0,0,0
414,6,1.0,3,0,6,6,6,3,0,3,0,5


In [98]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Crear el modelo de regresión logística
model = LogisticRegression()

# Entrenar el modelo
model.fit(X_train, y_train)

# Hacer predicciones
y_pred = model.predict(X_test)

accuracy_score(y_test, y_pred) 

1.0

Metricas de desempe;o

In [90]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       217
           1       1.00      1.00      1.00       160

    accuracy                           1.00       377
   macro avg       1.00      1.00      1.00       377
weighted avg       1.00      1.00      1.00       377



In [77]:
confusion_matrix(y_test, y_pred)

array([[177,  40],
       [ 21, 139]])