# IEEE Fraud Detection

La détection des fraudes en ligne est l'une des problématiques les plus courantes et sensibles dans de nombreux secteurs, en particulier les banques. Au cours des dernières années, les tentatives de fraude ont connu une forte hausse, ce qui rend la lutte contre ce phénomène très importante. 

Cette compétition est un problème de classification binaire - c'est-à-dire que notre variable cible est un attribut binaire (l'utilisateur fait-il le clic frauduleux ou non?) Et notre objectif est de classer les utilisateurs en "frauduleux" ou "non frauduleux" le mieux possible.

On cherche à prédire la probabilité qu'une transaction en ligne soit frauduleuse.

# Packages nécessaires

In [4]:
import numpy as np     
import pandas as pd   
import matplotlib.pyplot as plt   
import seaborn as sns          #version améliorée de matplotlib
import pickle as pkl
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import time
import random
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

sns.set() #pour avoir de plus beau plot

# Import des données

Commençons par vérifier les données submission.

In [5]:
data_sub = pd.read_csv('sample_submission.csv')
data_sub.head()

Unnamed: 0,TransactionID,isFraud
0,3663549,0.5
1,3663550,0.5
2,3663551,0.5
3,3663552,0.5
4,3663553,0.5


In [6]:
del data_sub

On charge maintenant, les données trains et tests. Les données sont divisées en deux fichiers d'identité et de transaction

In [None]:
train_id = pd.read_csv('train_identity.csv')
train_trans = pd.read_csv('train_transaction.csv')
test_id = pd.read_csv('test_identity.csv')
test_trans = pd.read_csv('test_transaction.csv')

In [None]:
train_id.head()

On rassemble les données train et test via la variable TransactionID.

In [None]:
train = pd.merge(train_trans, train_id, on='TransactionID', how='left')
test = pd.merge(test_trans, test_id, on='TransactionID', how='left')

In [None]:
del train_id, train_trans, test_id, test_trans

 - Réduction de mémoire

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: 
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

# Visualisation des données

Tout d'abord, on visualise les données à l'aide de graphiques et de tests statistiques.

**Données "object"**

In [None]:
cat_cols = list(train.select_dtypes(include=['object']).columns)
print(cat_cols)

 Variables discrètes :

 - ProductCD
 - emaildomain
 - card1 - card6
 - addr1, addr2
 - P_emaildomain
 - R_emaildomain
 - M1 - M9
 - DeviceType
 - DeviceInfo
 - id_12 - id_38

Le reste des variables sont numériques.

## Target : isFraud

In [None]:
train.groupby('isFraud') \
    .count()['TransactionID'] \
    .plot(kind='barh',
          title='Distribution of Target in Train',
          figsize=(15, 3))
plt.show()

On peut voir clairement que la plupart des transactions sont non frauduleuses. Si on utilise cette base de données comme base pour nos modèles prédictifs et nos analyses, nous pourrions obtenir beaucoup d'erreurs et nos algorithmes seront probablement trop adaptés car ils "supposeront" que la plupart des transactions ne sont pas de la fraude. Mais on ne veut pas que notre modèle suppose, nous voulons que notre modèle détecte les modèles qui donnent des signes de fraude!

## Transaction Amt

Cette variable décrit le montant de la transaction.

In [None]:
plt.style.use('ggplot')
color_pal = [x['color'] for x in plt.rcParams['axes.prop_cycle']]

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 6))
train.loc[train['isFraud'] == 1] \
    ['TransactionAmt'].apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='Log Transaction Amt - Fraud',
          color=color_pal[1],
          xlim=(-3, 10),
         ax= ax1)
train.loc[train['isFraud'] == 0] \
    ['TransactionAmt'].apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='Log Transaction Amt - Not Fraud',
          color=color_pal[2],
          xlim=(-3, 10),
         ax=ax2)
train.loc[train['isFraud'] == 1] \
    ['TransactionAmt'] \
    .plot(kind='hist',
          bins=100,
          title='Transaction Amt - Fraud',
          color=color_pal[1],
         ax= ax3)
train.loc[train['isFraud'] == 0] \
    ['TransactionAmt'] \
    .plot(kind='hist',
          bins=100,
          title='Transaction Amt - Not Fraud',
          color=color_pal[2],
         ax=ax4)
plt.show()


In [None]:
print('Mean transaction amt for fraud is {:.4f}'.format(train.loc[train['isFraud'] == 1]['TransactionAmt'].mean()))
print('Mean transaction amt for non-fraud is {:.4f}'.format(train.loc[train['isFraud'] == 0]['TransactionAmt'].mean()))

In [None]:
from scipy import stats
print(stats.ttest_ind(train.loc[train['isFraud'] == 1] \
    ['TransactionAmt'] ,train.loc[train['isFraud'] == 0] \
    ['TransactionAmt'] ,equal_var=False))

En faisant un test de student on remarque qu'il y a une différence significative entre les deux moyennes.

## ProductCD

Le produit pour chaque transaction.

In [None]:
train.groupby('ProductCD') \
    ['TransactionID'].count() \
    .sort_index() \
    .plot(kind='barh',
          figsize=(15, 3),
         title='Count of Observations by ProductCD')
plt.show()

In [None]:
train.groupby('ProductCD')['isFraud'] \
    .mean() \
    .sort_index() \
    .plot(kind='barh',
          figsize=(15, 3),
         title='Percentage of Fraud by ProductCD')
plt.show()

On observe que :
 - W a le plus grand nombre d'observations, C a le moins.
 - C a le plus grand pourcentage de fraude >11%
 - W a le moins avec ~2%

## card1 - card6

Informations sur les cartes de paiement.

In [None]:
card_cols = [c for c in train.columns if 'card' in c]
train[card_cols].head()

In [None]:
color_idx = 0
for c in card_cols:
    if train[c].dtype in ['float64','int64']:
        train[c].plot(kind='hist',
                                      title=c,
                                      bins=50,
                                      figsize=(15, 2),
                                      color=color_pal[color_idx])
    color_idx += 1
    plt.show()

In [None]:
train_fr = train.loc[train['isFraud'] == 1]
train_nofr = train.loc[train['isFraud'] == 0]
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_fr.groupby('card4')['card4'].count().plot(kind='barh', ax=ax1, title='Count of card4 fraud')
train_nofr.groupby('card4')['card4'].count().plot(kind='barh', ax=ax2, title='Count of card4 non-fraud')
train_fr.groupby('card6')['card6'].count().plot(kind='barh', ax=ax3, title='Count of card6 fraud')
train_nofr.groupby('card6')['card6'].count().plot(kind='barh', ax=ax4, title='Count of card6 non-fraud')
plt.show()

## DeviceType

In [None]:
train.groupby('DeviceType') \
    .mean()['isFraud'] \
    .sort_values() \
    .plot(kind='barh',
          figsize=(15, 5),
          title='Percentage of Fraud by Device Type')
plt.show()

## DeviceInfo

In [None]:
train.groupby('DeviceInfo') \
    .count()['TransactionID'] \
    .sort_values(ascending=False) \
    .head(20) \
    .plot(kind='barh', figsize=(15, 5), title='Top 20 Devices in Train')
plt.show()

## TransactionDT

In [None]:
plt.hist(train['TransactionDT'], label='train');
plt.hist(test['TransactionDT'], label='test');
plt.legend();
plt.title('Transaction dates');

Ci-dessus, on voit que les dates des données Train et Test ont une intersection vide.

# Valeurs manquantes

**Nettoyage des NaN**

**Train**

In [None]:
missing_values_count = train.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

On peut voir que 45% des données du train sont des valeurs manquantes, nettoyons tout ça !

In [None]:
def get_too_many_null_attr(data):
    many_null_cols = [col for col in data.columns if data[col].isnull().sum() / data.shape[0] > 0.9]
    return many_null_cols

In [None]:
data_null = get_too_many_null_attr(train)

In [None]:
def get_too_many_repeated_val(data):
    big_top_value_cols = [col for col in train.columns if train[col].value_counts(dropna=False, normalize=True).values[0] > 0.9]
    return big_top_value_cols

In [None]:
get_too_many_repeated_val(train)

In [None]:
train['id_03'].value_counts(dropna=False, normalize=True).head()

On peut voir que 88% des données sont des NaN, et 10% sont des valeurs nulles. Soit 98% des données sont des valeurs manquantes, donc inutiles !

## Les chaînes de caractères

## One hot incoding  et label encoding

Pour pourvoir utiliser les variables contenant des chaînes de caractères, on utilise la méthode de one hot encoder.

In [None]:
def onehot(col,col_name):    
    data = np.array(col.fillna(col.mode()[0]))
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(data)
    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    name = [col_name+str(i) for i in range(onehot_encoded.shape[1])]
    onehot_encoded = pd.DataFrame(onehot_encoded,columns=name)
    return (onehot_encoded)

In [None]:
X_non_num = train[train.columns[~train.columns.isin(col_num)]]
X_non_num.head(5)

In [None]:
card4 = onehot(train["card4"],"card4")
email = onehot(train["P_emaildomain"],"P_emaildomain")

In [None]:
 email_label= labelencod(train["P_emaildomain"],"P_emaildomain")
 card4_label = labelencod(train["card4"],"card4")

In [None]:
X_card_mail = X.join(card4).join(email)
X_card_mail.shape

In [None]:
X_card_mail_label = X.join(card4_label).join(email_label)
X_card_mail_label.shape

# Prédiction de la probabilité du nombre de fraudes

## Régression logistique

In [None]:
Y = train["isFraud"] 
X = X.loc[:, X.columns != "isFraud"]

#### Split nos données en train(67%) et test(33%)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [None]:
def temps(second):
    m, s = divmod(second, 60)
    h, m = divmod(m, 60)
    print("temps :",'{:02.0f}:{:02.0f}:{:02.0f}'.format(h, m, s))

### Regression logistique avec toutes nos varibales numériques

In [None]:
tstart = time.time()
log = LogisticRegression(random_state=0).fit(X_train, y_train)
pred_train = log.predict_proba(X_train)
print("score auc train :",roc_auc_score(y_train, pred_train[:, 1]))
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
pred = log.predict_proba(X_test)
print("score auc test :",roc_auc_score(y_test, pred[:, 1]))
tend = time.time()
temps(tend-tstart)

## Regression logistique avec cross validation

### cross validation sur toutes nos données numériques

In [None]:
tstart = time.time()
clf = LogisticRegression(random_state=0)
scores = cross_val_score(clf, X, Y, cv=5,scoring='roc_auc')
print("score auc :",scores)
tend = time.time()
temps(tend-tstart)

### cross validaiton avec nos données numérique et celles obtenus par le one hot encoding pour les variables "card4" et "P_emaildomain"

In [None]:
tstart = time.time()
clf = LogisticRegression(random_state=0)
scores_card4_email = cross_val_score(clf, X_card_mail, Y, cv=5,scoring='roc_auc')
print("score auc :",scores_card4_email)
tend = time.time()
temps(tend-tstart)

On peut constater que l'on obtient un auc de 0.7218 qui est le plus élevé obtenu jusqu'à présent

In [None]:
tstart = time.time()
clf = LogisticRegression(random_state=0)
scores_card4_email_label = cross_val_score(clf, X_card_mail_label, Y, cv=5,scoring='roc_auc')
print("score auc :",scores_card4_email_label)
tend = time.time()
temps(tend-tstart)

## Retranchons nous sur des données moindres

In [None]:
len(train[train["isFraud"]==1])/len(train)

Seulement 3,49% des données sont des fraudes.

In [None]:
60000*0.0349

In [None]:
60000-2094

In [None]:
random.seed(2)
train_sample0 = train[train["isFraud"]==0].sample(n=57906)
train_sample1 = train[train["isFraud"]==1].sample(n=2094)
train_sample = train_sample0.append(train_sample1)

In [None]:
X_sample = train_sample[col_num]
Y_sample = train_sample["isFraud"] 
X_sample = X_sample.loc[:, X_sample.columns != "isFraud"]
X_sample = X_sample.fillna(X_sample.median())

In [None]:
tstart = time.time()
clf = LogisticRegression(random_state=0)
scores_sample = cross_val_score(clf, X_sample, Y_sample, cv=5,scoring='roc_auc')
print("score auc :",scores_sample)
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', random_state=0)
scores_sample = cross_val_score(clf, X_sample, Y_sample, cv=5,scoring='roc_auc')
print("score auc :",scores_sample)
tend = time.time()
temps(tend-tstart)

Les auc obtenu avec une penalty l1 sont bien meilleurs que ceux obtenus prédédement. Toutefois, nous travaillons que sur 60000 données. Le temps de d'exécution est en revanche bien plus grand avec cette méthodes.

In [None]:
20000*0.0349

In [None]:
20000-698

In [None]:
random.seed(2)
train_sub0 = train[train["isFraud"]==0].sample(n=19302)
train_sub1 = train[train["isFraud"]==1].sample(n=698)
train_sub = train_sub0.append(train_sub1)
X_sub = train_sub[col_num]
Y_sub = train_sub["isFraud"] 
X_sub = X_sub.loc[:, X_sub.columns != "isFraud"]
X_sub = X_sub.fillna(X_sub.median())

In [None]:
tstart = time.time()
clf = LogisticRegression(random_state=0)
scores_sub = cross_val_score(clf, X_sub, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub)
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', random_state=0)
scores_sub = cross_val_score(clf, X_sub, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub)
tend = time.time()
temps(tend-tstart)

ajouter C = 1/alpha et tester

In [None]:
col_non_num = X_non_num.columns

In [None]:
X_sub_ohe = X_sub.reset_index(drop=True)
for i in col_non_num:
    X_sub_ohe = X_sub_ohe.join(onehot(train_sub[i],i))

In [None]:
print(X_sub_ohe.shape)

In [None]:
X_sub_ohe.head(5)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', random_state=0)
scores_sub_ohe = cross_val_score(clf, X_sub_ohe, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_ohe)
tend = time.time()
temps(tend-tstart)

On peut voir qu'avec le one hot encoding le temps de calcul est moindre qu'avec le label encoding. De plus, les performence au niveau de l'auc sont meilleures.

## Cross validation pour le paramètre Inverse of regularization strength

In [None]:
X_sub_ohe = X_sub.reset_index(drop=True)
for i in col_non_num:
    X_sub_ohe = X_sub_ohe.join(onehot(train_sub[i],i))

### Données centrées réduites:

In [None]:
scaler = StandardScaler()
X_sub_sc = scaler.fit_transform(X_sub_ohe)

In [None]:
X_sub_train_sc, X_sub_test_sc, y_sub_train_sc, y_sub_test_sc = train_test_split(X_sub_sc, Y_sub, test_size=0.33, 
                                                                                random_state=42)

In [None]:
pred = []
tstart = time.time()
t = np.arange(0.1,1.1,0.1)
for i in t:
    if(i%2==0): print(i)
    log = LogisticRegression(solver ='liblinear', penalty = 'l1', C=i, random_state=0).fit(X_sub_train_sc, 
                                                                                           y_sub_train_sc)
    pred_train = log.predict_proba(X_sub_test_sc)
    pred.append(roc_auc_score(y_sub_test_sc, pred_train[:, 1]))
tend = time.time()
temps(tend-tstart)

plt.plot(t,pred)

In [None]:
print("Pour C = ",t[np.argmax(pred)], " ,auc = ",np.max(pred))

In [None]:
pred = []
tstart = time.time()
t = np.arange(0.01,0.11,0.01)
for i in t:
    if((i*10)%2==0): print(i)
    log = LogisticRegression(solver ='liblinear', penalty = 'l1', C=i, random_state=0).fit(X_sub_train_sc, 
                                                                                           y_sub_train_sc)
    pred_train = log.predict_proba(X_sub_test_sc)
    pred.append(roc_auc_score(y_sub_test_sc, pred_train[:, 1]))
tend = time.time()
temps(tend-tstart)

plt.plot(t,pred)

In [None]:
print("Pour C = ",t[np.argmax(pred)], " ,auc = ",np.max(pred))

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', C=0.03, penalty = 'l1', random_state=0)
scores_sub_sc = cross_val_score(clf, X_sub_sc, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_sc)
tend = time.time()
temps(tend-tstart)

### Données non centrées réduites

In [None]:
X_sub_train, X_sub_test, y_sub_train, y_sub_test = train_test_split(X_sub, Y_sub, test_size=0.33, random_state=42)

In [None]:
pred = []
tstart = time.time()
t = np.arange(0.1,1.1,0.1)
for i in t:
    if(i%2==0): print(i)
    log = LogisticRegression(solver ='liblinear', penalty = 'l1', C=i, random_state=0).fit(X_sub_train, y_sub_train)
    pred_train = log.predict_proba(X_sub_test)
    pred.append(roc_auc_score(y_sub_test, pred_train[:, 1]))
tend = time.time()
temps(tend-tstart)

plt.plot(t,pred)

In [None]:
pred = []
tstart = time.time()
t = np.arange(0.01,0.11,0.01)
for i in t:
    if((i*10)%2==0): print(i)
    log = LogisticRegression(solver ='liblinear', penalty = 'l1', C=i, random_state=0).fit(X_sub_train, 
                                                                                           y_sub_train)
    pred_train = log.predict_proba(X_sub_test)
    pred.append(roc_auc_score(y_sub_test, pred_train[:, 1]))
tend = time.time()
temps(tend-tstart)

plt.plot(t,pred)

In [None]:
print("Pour C = ",t[np.argmax(pred)], " ,auc = ",np.max(pred))

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', C=0.04, random_state=0)
scores_sub_ohe = cross_val_score(clf, X_sub_ohe, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_ohe)
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', C=1/11, random_state=0)
scores_sub_ohe = cross_val_score(clf, X_sub_ohe, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_ohe)
tend = time.time()
temps(tend-tstart)

In [None]:
pred = []
tstart = time.time()
t = np.arange(0,1.1,0.1)
for i in t:
    if((i*10)%2==0): print(i)
    log = LogisticRegression(solver ='saga', penalty = 'elasticnet', random_state=0, l1_ratio=i).fit(X_sub_train, y_sub_train)
    pred_train = log.predict_proba(X_sub_test)
    pred.append(roc_auc_score(y_sub_test, pred_train[:, 1]))
tend = time.time()
temps(tend-tstart)

In [None]:
plt.plot(t,pred)

essayer de melanger les données (toutes les données non fraude sont en haut et fraude en bas du DataFrame)
essayer avec : df = df.sample(frac=1).reset_index(drop=True)

In [None]:
X_shuffle = X_sub_ohe.join(Y_sub.reset_index(drop=True))
X_shuffle = X_shuffle.sample(frac=1).reset_index(drop=True)

In [None]:
Y_sh = X_shuffle["isFraud"]
X_sh = X_shuffle.loc[:, X_shuffle.columns != "isFraud"]

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', C=0.04, random_state=0)
scores_sh = cross_val_score(clf, X_sh, Y_sh, cv=5,scoring='roc_auc')
print("score auc :",scores_sh)
tend = time.time()
temps(tend-tstart)

# Selection de variables avec Lasso

### Variables sélectionner par Lasso pour  $\lambda$=1

In [None]:
clf = linear_model.Lasso(alpha=1)
clf.fit(X_sub_ohe,Y_sub)

In [None]:
coef = clf.coef_
col_ohe = X_sub_ohe.columns
var_ohe = col_ohe[coef!=0]
X_lasso = X_sub_ohe[var_ohe]

In [None]:
print(len(var_ohe))

In [None]:
Nous pouvons constater que la plupart des coefficients ont été mis à zéros

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', random_state=0)
scores_sub_lasso = cross_val_score(clf, X_lasso, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_lasso)
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', random_state=0)
scores_sub_lasso = cross_val_score(clf, X_lasso, Y_sub, cv=10,scoring='roc_auc')
print("score auc :",scores_sub_lasso)
tend = time.time()
temps(tend-tstart)

###  Variables sélectionner par Lasso pour $\lambda$ =0.1 

In [None]:
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X_sub_ohe,Y_sub)

In [None]:
coef = clf.coef_
col_ohe = X_sub_ohe.columns

In [None]:
var_ohe = col_ohe[coef!=0]
print(len(var_ohe))

Nous pouvons voir qu'en ce lambda le nombre de coefficient différents de ézéros à augmenter

In [None]:
X_lasso = X_sub_ohe[var_ohe]

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', C=0.04, penalty = 'l1', random_state=0)
scores_sub_lasso = cross_val_score(clf, X_lasso, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_lasso)
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', C=0.04, penalty = 'l1', random_state=0)
scores_sub_lasso = cross_val_score(clf, X_lasso, Y_sub, cv=10,scoring='roc_auc')
print("score auc :",scores_sub_lasso)
tend = time.time()
temps(tend-tstart)

### Variables sélectionner par Lasso pour  𝜆 =0.01

In [None]:
clf = linear_model.Lasso(alpha=0.01)
clf.fit(X_sub_ohe,Y_sub)

In [None]:
coef = clf.coef_
col_ohe = X_sub_ohe.columns
var_ohe = col_ohe[coef!=0]
X_lasso = X_sub_ohe[var_ohe]

In [None]:
print(len(var_ohe))

Nous pouvons constater que le nombre de coefficients différents de zéros a encore augmenté

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', C=0.04, random_state=0)
scores_sub_lasso = cross_val_score(clf, X_lasso, Y_sub, cv=5,scoring='roc_auc')
print("score auc :",scores_sub_lasso)
tend = time.time()
temps(tend-tstart)

In [None]:
tstart = time.time()
clf = LogisticRegression(solver ='liblinear', penalty = 'l1', C=0.04, random_state=0)
scores_sub_lasso = cross_val_score(clf, X_lasso, Y_sub, cv=10,scoring='roc_auc')
print("score auc :",scores_sub_lasso)
tend = time.time()
temps(tend-tstart)