# Fraud Detection in Electricity and Gas Consumption Challenge



**Enabling and testing the GPU**

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

# **Loading the data**

If you already have the data uploaded in colab just skip to "Let's Begin"

**In here we try to bring in the training and test data to the colab environment.**


First off, I downloaded the files from Zindi and then uploaded them to my google drive. Why? Well, Google Colab gives us access to our drive, so, I don't have to upload the data to colab each time i'm going to work

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Just click the link that shows up and allow access.

You will then get a code, copy it and paste it in the bar in the output of the cell above

Now, I'm going to copy those files from my drive to my environment (feel free to change this: "drive/My Drive/train.zip" according to your drive [ you can check the files on the left to see where your train.zip and test.zip are ] )

In [2]:
!cp "drive/My Drive/train.zip" train.zip
!cp "drive/My Drive/test.zip" test.zip

Then just unzip them

In [3]:
%%capture
!unzip train.zip
!unzip test.zip

You can notice that i didn't load in the sample submission (that's because its not that big of a size so I just upload it directly from my computer)

# **Let's begin**

In [4]:
import numpy as np
import pandas as pd
import datetime
import gc
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
np.random.seed(4590)

  import pandas.util.testing as tm


In [107]:
train_client=pd.read_csv('/content/client_train.csv')
test_client=pd.read_csv('/content/client_test.csv')
train_invoice=pd.read_csv('/content/invoice_train.csv')
test_invoice=pd.read_csv('/content/invoice_test.csv')

#change this according to the name of your SampleSubmission file
sub=pd.read_csv('/content/sample.csv')

In [108]:
train_client.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target
0,60,train_Client_0,11,101,31/12/1994,0.0
1,69,train_Client_1,11,107,29/05/2002,0.0
2,62,train_Client_10,11,301,13/03/1986,0.0
3,69,train_Client_100,11,105,11/07/1996,0.0
4,62,train_Client_1000,11,303,14/10/2014,0.0


In [109]:
train_invoice.head(10)

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,0,0,14302,14384,4,ELEC
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,0,0,12294,13678,4,ELEC
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,0,0,14624,14747,4,ELEC
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,0,0,14747,14849,4,ELEC
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,0,0,15066,15638,12,ELEC
5,train_Client_0,2017-07-17,11,1335667,0,207,9,1,314,0,0,0,15638,15952,8,ELEC
6,train_Client_0,2018-12-07,11,1335667,0,207,9,1,541,0,0,0,15952,16493,12,ELEC
7,train_Client_0,2019-03-19,11,1335667,0,207,9,1,585,0,0,0,16493,17078,8,ELEC
8,train_Client_0,2011-07-22,11,1335667,0,203,9,1,1200,186,0,0,7770,9156,4,ELEC
9,train_Client_0,2011-11-22,11,1335667,0,203,6,1,1082,0,0,0,9156,10238,4,ELEC


In [110]:
#Label encoding the counter_type
d={"ELEC":0,"GAZ":1}
train_invoice['counter_type']=train_invoice['counter_type'].map(d)

In [111]:
train_client['client_catg'] = train_client['client_catg'].astype('object')
train_client['disrict'] = train_client['disrict'].astype('object')

test_client['client_catg'] = test_client['client_catg'].astype('object')
test_client['disrict'] = test_client['disrict'].astype('object')

In [112]:
train_invoice['counter_type'].value_counts()

0    3079406
1    1397343
Name: counter_type, dtype: int64

In [113]:
#changing the invoice date to datetime
for df in [train_invoice,test_invoice]:
    df['invoice_date'] = pd.to_datetime(df['invoice_date'])

Here we are going to groupby the invoice data according to the client_id
We're going to take the mean of the consommation_level s for each client

In [114]:
aggs = {}
aggs['consommation_level_1'] = ['mean']
aggs['consommation_level_2'] = ['mean']
aggs['consommation_level_3'] = ['mean']
aggs['consommation_level_4'] = ['mean']

In [115]:
    agg_trans = train_invoice.groupby(['client_id']).agg(aggs)
    agg_trans.columns = ['_'.join(col).strip() for col in agg_trans.columns.values]
    agg_trans.reset_index(inplace=True)

    df = (train_invoice.groupby('client_id')
          .size()
          .reset_index(name='{}transactions_count'.format('1')))

    agg_trans = pd.merge(df, agg_trans, on='client_id', how='left')

In [116]:
agg_trans.head()

Unnamed: 0,client_id,1transactions_count,consommation_level_1_mean,consommation_level_2_mean,consommation_level_3_mean,consommation_level_4_mean
0,train_Client_0,35,352.4,10.571429,0.0,0.0
1,train_Client_1,37,557.540541,0.0,0.0,0.0
2,train_Client_10,18,798.611111,37.888889,0.0,0.0
3,train_Client_100,20,1.2,0.0,0.0,0.0
4,train_Client_1000,14,663.714286,104.857143,117.357143,36.714286


Now, we just merge this with out train data

In [117]:
train = pd.merge(train_client,agg_trans, on='client_id', how='left')

In [118]:
test_client.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date
0,62,test_Client_0,11,307,28/05/2002
1,69,test_Client_1,11,103,06/08/2009
2,62,test_Client_10,11,310,07/04/2004
3,60,test_Client_100,11,101,08/10/1992
4,62,test_Client_1000,11,301,21/07/1977


In [119]:
test_invoice.head()

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,test_Client_0,2018-03-16,11,651208,0,203,8,1,755,0,0,0,19145,19900,8,ELEC
1,test_Client_0,2014-03-21,11,651208,0,203,8,1,1067,0,0,0,13725,14792,8,ELEC
2,test_Client_0,2014-07-17,11,651208,0,203,8,1,0,0,0,0,14792,14792,4,ELEC
3,test_Client_0,2015-07-13,11,651208,0,203,9,1,410,0,0,0,16122,16532,4,ELEC
4,test_Client_0,2016-07-19,11,651208,0,203,9,1,412,0,0,0,17471,17883,4,ELEC


Same thing for test

In [120]:
d={"ELEC":0,"GAZ":1}
test_invoice['counter_type']=test_invoice['counter_type'].map(d)

In [121]:
    agg_trans = test_invoice.groupby(['client_id']).agg(aggs)
    agg_trans.columns = ['_'.join(col).strip() for col in agg_trans.columns.values]
    agg_trans.reset_index(inplace=True)

    df = (test_invoice.groupby('client_id')
          .size()
          .reset_index(name='{}transactions_count'.format('1')))

    agg_trans = pd.merge(df, agg_trans, on='client_id', how='left')

In [122]:
test = pd.merge(test_client,agg_trans, on='client_id', how='left')

In [123]:
train.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,1transactions_count,consommation_level_1_mean,consommation_level_2_mean,consommation_level_3_mean,consommation_level_4_mean
0,60,train_Client_0,11,101,31/12/1994,0.0,35,352.4,10.571429,0.0,0.0
1,69,train_Client_1,11,107,29/05/2002,0.0,37,557.540541,0.0,0.0,0.0
2,62,train_Client_10,11,301,13/03/1986,0.0,18,798.611111,37.888889,0.0,0.0
3,69,train_Client_100,11,105,11/07/1996,0.0,20,1.2,0.0,0.0,0.0
4,62,train_Client_1000,11,303,14/10/2014,0.0,14,663.714286,104.857143,117.357143,36.714286


In [124]:
test.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,1transactions_count,consommation_level_1_mean,consommation_level_2_mean,consommation_level_3_mean,consommation_level_4_mean
0,62,test_Client_0,11,307,28/05/2002,37,488.135135,3.243243,0.0,0.0
1,69,test_Client_1,11,103,06/08/2009,22,1091.409091,843.136364,182.318182,586.318182
2,62,test_Client_10,11,310,07/04/2004,74,554.040541,37.364865,15.743243,0.162162
3,60,test_Client_100,11,101,08/10/1992,40,244.35,0.0,0.0,0.0
4,62,test_Client_1000,11,301,21/07/1977,53,568.188679,145.056604,33.679245,0.0


In [125]:
train.shape,test.shape

((135493, 11), (58069, 10))

In [126]:
for df in [train,test]:
    df['creation_date'] = pd.to_datetime(df['creation_date'])

Dropping useless columns from train and test

In [127]:
train.columns

Index(['disrict', 'client_id', 'client_catg', 'region', 'creation_date',
       'target', '1transactions_count', 'consommation_level_1_mean',
       'consommation_level_2_mean', 'consommation_level_3_mean',
       'consommation_level_4_mean'],
      dtype='object')

In [128]:
col_to_drop = ['client_id', 'creation_date']
for col in col_to_drop:
    if col in train.columns:
        train.drop([col], axis=1, inplace=True)
    if col in test.columns:
        test.drop([col], axis=1, inplace=True)

Label Encoding

In [129]:
from sklearn import preprocessing
for f in test.columns:
    if train[f].dtype=='object' or test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[f].values) + list(test[f].values))
        train[f] = lbl.transform(list(train[f].values))
        test[f] = lbl.transform(list(test[f].values))  

Checking if we have any missing data

In [130]:
all_data_na = train.isnull().sum() 
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing ' :all_data_na})
missing_data.head(20)

Unnamed: 0,Missing


In [131]:
target=train['target']
train.drop('target',axis=1,inplace=True)

# Modelling ( LightGBM )


In [137]:
import lightgbm
from lightgbm import LGBMClassifier
model = LGBMClassifier(boosting_type='gbdt',num_iteration=500, silent=True)

#Fit to training data
%time model.fit(train,target)

CPU times: user 4.6 s, sys: 46.9 ms, total: 4.64 s
Wall time: 4.66 s


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_iteration=500, num_leaves=31,
               objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)

In [138]:
preds = model.predict_proba(test)

In [139]:
preds = pd.DataFrame(preds)

In [143]:
preds.head()

Unnamed: 0,0,1
0,0.9431,0.0569
1,0.897028,0.102972
2,0.936515,0.063485
3,0.962018,0.037982
4,0.962783,0.037217


Making your submission

In [140]:

submission = pd.DataFrame({
        "client_id": sub["client_id"],
        "target": preds[1]
    })
submission.to_csv('submission.csv', index=False)



In [141]:
submission.head()

Unnamed: 0,client_id,target
0,test_Client_0,0.0569
1,test_Client_1,0.102972
2,test_Client_10,0.063485
3,test_Client_100,0.037982
4,test_Client_1000,0.037217


# **How can I improve this?**

Well, there are lots of ways to improve this notebook.

I would suggest:

1.   Better feature engineering:
      *   When grouping by invoice data, you can use more columns other than just the mean of the consumption levels
      *   When converting dates to datetime, you can engineer more columns to your dataset ( eg. year, month, day )
      *   Try to create columns from 1 or more already-existant columns

2.   Creating a better model:

      *   Trying out different models ( XGBoost, Adaboost, catboost, randomforest, decision trees, etc.. )
      *   Fine-tuning your models ( You can fine-tune the LightGBM model given above by changing its hyperparameters, for example, if you increase the number of iterations you might get a better score )







**Thank you all! I hope this was useful**

**Have fun in the competition :)**