# Notebook Regression

Le but de ce notebook est d'effectuer une regression de la survie de l'essai clinique sur nos différentes variables explicatives. On considère deux principaux modèles: d'un côté une regression linéaire et de l'autre un modèle de durée paramétrique.

In [22]:
import statsmodels.api as sm
import pandas as pd
import warnings 
warnings.filterwarnings("ignore")

In [23]:
data_phase3 = pd.read_csv('./data/Data_augmented3.csv')
data_full = pd.read_csv('../data/clini_data.csv')

In [24]:
print(f"dimensions du dataframe des données d'essais cliniques en phase 3:\n{data_phase3.shape[0]} observations pour {data_phase3.shape[1]} variables.\n\ndimensions du dataframe de l'ensemble des données:\n{data_full.shape[0]} observations pour {data_full.shape[1]} variables")

dimensions du dataframe des données d'essais cliniques en phase 3:
6261 observations pour 341 variables.

dimensions du dataframe de l'ensemble des données:
450000 observations pour 329 variables


In [25]:
set(data_phase3.columns) - set(data_full.columns)

{'Bin',
 'Conditions',
 'Drug',
 'InclusionCriteria',
 'InclusionReduced',
 'Mood',
 'Observation',
 'Person',
 'Procedure',
 'TimePassed',
 'Unnamed: 0.1',
 'raw_count'}

Le premier modèle est un modèle de regression linéaire. Formellement, on considère

$(M) : y_i = {\alpha}_0 + \sum_{j=1}^{3}{\alpha}_j * C_{j,i} + \sum_{j=1}^{3}{\alpha}_2 * C_{j,i}^2 + {\epsilon}_i $

Où:
* $y_i$ représente la durée de l'essai clinique $i$, cf la différence entre sa date de début et sa date de fin
* $C_{j,i}$ représente le nombre de terme associé au type de critère $j$ de l'essai clinique $i$

On intègre le carré des variables pour ne pas contraindre la relation à une relation strictement linéaire.

In [38]:
model_string = 'TimePassed ~ '
for feat in ['Conditions', 'Procedure', 'Drug']:
    var = f'{feat}_2'
    data_phase3[var] = data_phase3[feat].apply(lambda x: x**2)
    model_string+= f'{feat} + {feat}_2 + '
model_string+= 'const'

In [39]:
data_phase3['const'] = 1
data_phase3['Conditions2'] = data_phase3['Conditions'].apply(lambda x: x**2)

model = sm.OLS.from_formula(model_string, data=data_phase3).fit()
model.summary()

0,1,2,3
Dep. Variable:,TimePassed,R-squared:,0.031
Model:,OLS,Adj. R-squared:,0.03
Method:,Least Squares,F-statistic:,33.2
Date:,"Wed, 27 Dec 2023",Prob (F-statistic):,1.26e-39
Time:,22:04:55,Log-Likelihood:,-43448.0
No. Observations:,6261,AIC:,86910.0
Df Residuals:,6254,BIC:,86960.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,233.2975,2.578,90.495,0.000,228.244,238.351
Conditions,32.7078,2.922,11.194,0.000,26.980,38.436
Conditions_2,-2.6896,0.277,-9.695,0.000,-3.233,-2.146
Procedure,18.1774,4.948,3.674,0.000,8.477,27.877
Procedure_2,-2.7281,0.901,-3.027,0.002,-4.495,-0.962
Drug,27.3511,5.789,4.724,0.000,16.002,38.700
Drug_2,-2.7616,0.875,-3.156,0.002,-4.477,-1.046
const,233.2975,2.578,90.495,0.000,228.244,238.351

0,1,2,3
Omnibus:,1166.705,Durbin-Watson:,1.934
Prob(Omnibus):,0.0,Jarque-Bera (JB):,243.811
Skew:,0.049,Prob(JB):,1.14e-53
Kurtosis:,2.038,Cond. No.,1.06e+16


Le modèle reste assez simple. Le R2 est très faible (on ne s'attends bien évidemment pas à pouvoir prédire parfaitement la durée d'un essai clinique mais on cherche si la complexité des critères d'éligibilités à un effet ou non sur la durée d'un essai clinique).

In [54]:
import numpy as np

In [64]:
def country_to_num(line):
    if line!='':
        return np.log(dict_country[line])
    else:
        return 0
dict_country = dict(data_full['LocationCountry'].value_counts())
data_phase3['CountryCT'] = data_phase3['LocationCountry'].fillna('').apply(country_to_num)
data_phase3['CountryCT']

0        7.473069
1       11.927555
2       11.927555
3        0.000000
4        8.775086
          ...    
6256    11.927555
6257     0.000000
6258     0.000000
6259    11.927555
6260    11.927555
Name: CountryCT, Length: 6261, dtype: float64

In [65]:
data_phase3[['LeadSponsorName','CountryCT']]

Unnamed: 0,LeadSponsorName,CountryCT
0,POLYSAN Scientific & Technological Pharmaceuti...,7.473069
1,State University of New York at Buffalo,11.927555
2,"University of California, Los Angeles",11.927555
3,Sanofi,0.000000
4,Universidade Federal do Para,8.775086
...,...,...
6256,"Teva Branded Pharmaceutical Products R&D, Inc.",11.927555
6257,Organon and Co,0.000000
6258,Merck Sharp & Dohme LLC,0.000000
6259,Galderma R&D,11.927555


In [66]:
model = sm.OLS.from_formula(model_string+ '+ CountryCT + HealthyVolunteers', data=data_phase3).fit()
model.summary()

0,1,2,3
Dep. Variable:,TimePassed,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.053
Method:,Least Squares,F-statistic:,44.99
Date:,"Wed, 27 Dec 2023",Prob (F-statistic):,8.78e-71
Time:,22:33:28,Log-Likelihood:,-43267.0
No. Observations:,6246,AIC:,86550.0
Df Residuals:,6237,BIC:,86610.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,180.0028,5.344,33.684,0.000,169.527,190.479
HealthyVolunteers[T.No],87.7488,8.402,10.443,0.000,71.277,104.220
Conditions,25.8063,2.968,8.696,0.000,19.989,31.624
Conditions_2,-2.2391,0.278,-8.062,0.000,-2.784,-1.695
Procedure,13.4550,4.908,2.742,0.006,3.834,23.076
Procedure_2,-2.1421,0.892,-2.402,0.016,-3.890,-0.394
Drug,25.9003,5.752,4.503,0.000,14.625,37.176
Drug_2,-2.8994,0.867,-3.344,0.001,-4.599,-1.200
const,180.0028,5.344,33.684,0.000,169.527,190.479

0,1,2,3
Omnibus:,924.23,Durbin-Watson:,1.939
Prob(Omnibus):,0.0,Jarque-Bera (JB):,220.734
Skew:,0.048,Prob(JB):,1.17e-48
Kurtosis:,2.084,Cond. No.,3700000000000000.0
