# Electricity and Gas Consumption Fraud Detection

This challenge was designed specifically for the AI Tunisia Hack 2019, which takes place from 20 to 22 September. Welcome to the AI Tunisia Hack participants!

After AI Hack Tunisia, this competition will be re-opened as a Knowledge Challenge to allow others in the Zindi community to learn and test their skills.

The Tunisian Company of Electricity and Gas (STEG) is a public and a non-administrative company, it is responsible for delivering electricity and gas across Tunisia. The company suffered tremendous losses in the order of 200 million Tunisian Dinars due to fraudulent manipulations of meters by consumers.

Using the client’s billing history, the aim of the challenge is to detect and recognize clients involved in fraudulent activities.

The solution will enhance the company’s revenues and reduce the losses caused by such fraudulent activities. 

## Imports

In [1]:
import csv
import pandas as pd

**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

## Load Training Sets

### Client Train

In [2]:
client_train = pd.read_csv('Datasets/train/client_train.csv')
client_train.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target
0,60,train_Client_0,11,101,31/12/1994,0.0
1,69,train_Client_1,11,107,29/05/2002,0.0
2,62,train_Client_10,11,301,13/03/1986,0.0
3,69,train_Client_100,11,105,11/07/1996,0.0
4,62,train_Client_1000,11,303,14/10/2014,0.0


In [3]:
# Percent distribution of labels
client_train['target'].value_counts(dropna=False, normalize=True) * 100

0.0    94.415948
1.0     5.584052
Name: target, dtype: float64

In [4]:
client_train.shape

(135493, 6)

### Invoice Train

In [5]:
invoice_train = pd.read_csv('Datasets/train/invoice_train.csv')
invoice_train.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,0,0,14302,14384,4,ELEC
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,0,0,12294,13678,4,ELEC
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,0,0,14624,14747,4,ELEC
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,0,0,14747,14849,4,ELEC
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,0,0,15066,15638,12,ELEC


**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

## Dataset analysis

### Check for missing values

In [6]:
client_train.isnull().sum()

disrict          0
client_id        0
client_catg      0
region           0
creation_date    0
target           0
dtype: int64

In [7]:
invoice_train.isnull().sum()

client_id               0
invoice_date            0
tarif_type              0
counter_number          0
counter_statue          0
counter_code            0
reading_remarque        0
counter_coefficient     0
consommation_level_1    0
consommation_level_2    0
consommation_level_3    0
consommation_level_4    0
old_index               0
new_index               0
months_number           0
counter_type            0
dtype: int64

### Counter Statue

**Counter_Statue has discrepancies in the values. Should only contain 1 of 5 values**

In [8]:
invoice_train['counter_statue'].value_counts(dropna=False)

0         4346960
1           73496
0           32048
5           20495
4            2706
1             540
3             258
5             144
2              32
4              23
46             14
A              13
618            12
769             6
420             1
269375          1
Name: counter_statue, dtype: int64

### Consommation Levels

**Refers to consumption level -> Investigate consumption level values ->  Once Level 1 reaches the max value then level 2 accumulates and so on.**

In [9]:
invoice_train[invoice_train['consommation_level_1']==0]

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
90,train_Client_100,2009-10-22,11,2078,0,413,6,1,0,0,0,0,98,98,4,ELEC
91,train_Client_100,2006-10-10,11,2078,0,413,6,1,0,0,0,0,90,90,4,ELEC
93,train_Client_100,2007-06-18,11,2078,0,413,6,1,0,0,0,0,91,91,4,ELEC
94,train_Client_100,2008-06-19,11,2078,0,413,6,1,0,0,0,0,91,91,4,ELEC
95,train_Client_100,2008-10-17,11,2078,0,413,6,1,0,0,0,0,91,91,4,ELEC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476675,train_Client_99996,2014-06-19,40,0,0,5,6,1,0,0,0,0,0,0,4,GAZ
4476676,train_Client_99996,2013-06-17,40,0,1,5,6,1,0,0,0,0,0,0,4,GAZ
4476677,train_Client_99996,2014-02-21,40,0,0,5,6,1,0,0,0,0,0,0,4,GAZ
4476679,train_Client_99996,2013-10-22,40,0,1,5,6,1,0,0,0,0,0,0,4,GAZ


In [10]:
client_train_0 = client_train[client_train['client_id']=='train_Client_0']
client_train_0

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target
0,60,train_Client_0,11,101,31/12/1994,0.0


In [11]:
# Join client 0 info to client 0s invoices
# 'on' only applies to the first df, the second df needs to have that column as index
client_0_joined = client_train_0.join(invoice_train.set_index('client_id'), on='client_id', how='inner')
client_0_joined

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,invoice_date,tarif_type,counter_number,counter_statue,...,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,60,train_Client_0,11,101,31/12/1994,0.0,2014-03-24,11,1335667,0,...,8,1,82,0,0,0,14302,14384,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2013-03-29,11,1335667,0,...,6,1,1200,184,0,0,12294,13678,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2015-03-23,11,1335667,0,...,8,1,123,0,0,0,14624,14747,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2015-07-13,11,1335667,0,...,8,1,102,0,0,0,14747,14849,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2016-11-17,11,1335667,0,...,9,1,572,0,0,0,15066,15638,12,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2017-07-17,11,1335667,0,...,9,1,314,0,0,0,15638,15952,8,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2018-12-07,11,1335667,0,...,9,1,541,0,0,0,15952,16493,12,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2019-03-19,11,1335667,0,...,9,1,585,0,0,0,16493,17078,8,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2011-07-22,11,1335667,0,...,9,1,1200,186,0,0,7770,9156,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2011-11-22,11,1335667,0,...,6,1,1082,0,0,0,9156,10238,4,ELEC


In [12]:
client_invoices = client_0_joined.sort_values('invoice_date')
client_invoices

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,invoice_date,tarif_type,counter_number,counter_statue,...,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,60,train_Client_0,11,101,31/12/1994,0.0,2005-10-17,11,1335667,0,...,6,1,124,0,0,0,3685,3809,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2006-02-24,11,1335667,0,...,6,1,141,0,0,0,3809,3950,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2006-06-23,11,1335667,0,...,6,1,162,0,0,0,3950,4112,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2006-10-18,11,1335667,0,...,6,1,159,0,0,0,4112,4271,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2007-02-26,11,1335667,0,...,6,1,182,0,0,0,4271,4453,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2007-06-27,11,1335667,0,...,6,1,240,0,0,0,4453,4693,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2007-10-25,11,1335667,0,...,6,1,276,0,0,0,4693,4969,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2008-01-04,11,1335667,0,...,6,1,277,0,0,0,4969,5246,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2008-07-28,11,1335667,0,...,6,1,171,0,0,0,5246,5417,4,ELEC
0,60,train_Client_0,11,101,31/12/1994,0.0,2008-11-25,11,1335667,0,...,6,1,174,0,0,0,5417,5591,4,ELEC


**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

## Feature Extraction

### Find how long client is a member

Need to find the latest invoice date to measure number of days active. Can't use today's date if the dataset is older than today. Number of days active needs to be computed with respect to the dataset

In [13]:
from datetime import datetime

In [14]:
# Convert invoice dates to datetime format and get most recent date
all_invoice_dates_str = invoice_train['invoice_date']
all_invoice_dates_dt = list( map(lambda date_str: datetime.strptime(date_str, '%Y-%m-%d'), all_invoice_dates_str) )
max(all_invoice_dates_dt)

datetime.datetime(2019, 12, 7, 0, 0)

Use 31-12-2019 for the end date. Seems the dataset just about spans 2019

In [15]:
end_date = datetime(2019, 12, 31, 0, 0)
end_date

datetime.datetime(2019, 12, 31, 0, 0)

In [16]:
# Convert creation date column to datetime format from string
client_train['creation_date'] = client_train['creation_date'].apply(lambda row:  datetime.strptime(row, '%d/%m/%Y') )

In [17]:
# Compute number of days active
client_train['days_active'] = end_date - client_train['creation_date']

# Convert timedelta (x days) to int (x) for classifier
client_train['days_active'] = client_train['days_active'].dt.days
client_train

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,days_active
0,60,train_Client_0,11,101,1994-12-31,0.0,9131
1,69,train_Client_1,11,107,2002-05-29,0.0,6425
2,62,train_Client_10,11,301,1986-03-13,0.0,12346
3,69,train_Client_100,11,105,1996-07-11,0.0,8573
4,62,train_Client_1000,11,303,2014-10-14,0.0,1904
...,...,...,...,...,...,...,...
135488,62,train_Client_99995,11,304,2004-07-26,0.0,5636
135489,63,train_Client_99996,11,311,2012-10-25,0.0,2623
135490,63,train_Client_99997,11,311,2011-11-22,0.0,2961
135491,60,train_Client_99998,11,101,1993-12-22,0.0,9505


**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

### Get Average Consumption for each client

Total Consumption per invoice = New_Index - Old_Index

In [18]:
invoice_train['total_consumption'] = invoice_train['new_index'] - invoice_train['old_index']
invoice_train

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type,total_consumption
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,0,0,14302,14384,4,ELEC,82
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,0,0,12294,13678,4,ELEC,1384
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,0,0,14624,14747,4,ELEC,123
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,0,0,14747,14849,4,ELEC,102
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,0,0,15066,15638,12,ELEC,572
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476744,train_Client_99998,2005-08-19,10,1253571,0,202,9,1,400,135,0,0,3197,3732,8,ELEC,535
4476745,train_Client_99998,2005-12-19,10,1253571,0,202,6,1,200,6,0,0,3732,3938,4,ELEC,206
4476746,train_Client_99999,1996-09-25,11,560948,0,203,6,1,259,0,0,0,13884,14143,4,ELEC,259
4476747,train_Client_99999,1996-05-28,11,560948,0,203,6,1,603,0,0,0,13281,13884,4,ELEC,603


In [19]:
avg_consumption = invoice_train.groupby(by='client_id')['total_consumption'].mean()
avg_consumption

client_id
train_Client_0        362.971429
train_Client_1        557.540541
train_Client_10       836.500000
train_Client_100        1.200000
train_Client_1000     922.642857
                         ...    
train_Client_99995      0.000000
train_Client_99996    186.609756
train_Client_99997    273.083333
train_Client_99998    370.500000
train_Client_99999    459.333333
Name: total_consumption, Length: 135493, dtype: float64

In [20]:
# 'on' only applies to the first df, the second df needs to have that column as index
client_train = client_train.join(avg_consumption, on='client_id', how='inner')
client_train.rename(columns={'total_consumption': 'avg_consumption'}, inplace=True)
client_train

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,days_active,avg_consumption
0,60,train_Client_0,11,101,1994-12-31,0.0,9131,362.971429
1,69,train_Client_1,11,107,2002-05-29,0.0,6425,557.540541
2,62,train_Client_10,11,301,1986-03-13,0.0,12346,836.500000
3,69,train_Client_100,11,105,1996-07-11,0.0,8573,1.200000
4,62,train_Client_1000,11,303,2014-10-14,0.0,1904,922.642857
...,...,...,...,...,...,...,...,...
135488,62,train_Client_99995,11,304,2004-07-26,0.0,5636,0.000000
135489,63,train_Client_99996,11,311,2012-10-25,0.0,2623,186.609756
135490,63,train_Client_99997,11,311,2011-11-22,0.0,2961,273.083333
135491,60,train_Client_99998,11,101,1993-12-22,0.0,9505,370.500000


**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

### Get average Reading score per client

Notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)

In [21]:
reading_score = invoice_train.groupby(by='client_id')['reading_remarque'].mean()
reading_score

client_id
train_Client_0        6.971429
train_Client_1        7.216216
train_Client_10       7.055556
train_Client_100      6.150000
train_Client_1000     8.857143
                        ...   
train_Client_99995    6.000000
train_Client_99996    8.487805
train_Client_99997    9.000000
train_Client_99998    7.500000
train_Client_99999    6.000000
Name: reading_remarque, Length: 135493, dtype: float64

In [22]:
# Add reading score to training set
client_train = client_train.join(reading_score, on='client_id', how='inner')
client_train.rename(columns={'reading_remarque': 'avg_reading_score'}, inplace=True)
client_train

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,days_active,avg_consumption,avg_reading_score
0,60,train_Client_0,11,101,1994-12-31,0.0,9131,362.971429,6.971429
1,69,train_Client_1,11,107,2002-05-29,0.0,6425,557.540541,7.216216
2,62,train_Client_10,11,301,1986-03-13,0.0,12346,836.500000,7.055556
3,69,train_Client_100,11,105,1996-07-11,0.0,8573,1.200000,6.150000
4,62,train_Client_1000,11,303,2014-10-14,0.0,1904,922.642857,8.857143
...,...,...,...,...,...,...,...,...,...
135488,62,train_Client_99995,11,304,2004-07-26,0.0,5636,0.000000,6.000000
135489,63,train_Client_99996,11,311,2012-10-25,0.0,2623,186.609756,8.487805
135490,63,train_Client_99997,11,311,2011-11-22,0.0,2961,273.083333,9.000000
135491,60,train_Client_99998,11,101,1993-12-22,0.0,9505,370.500000,7.500000


**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

## Compile training set

### Select all useful columns

In [23]:
training_set = client_train[['client_id', 'disrict', 'client_catg', 'region', 'days_active', 'avg_consumption', 'avg_reading_score', 'target']]
training_set

Unnamed: 0,client_id,disrict,client_catg,region,days_active,avg_consumption,avg_reading_score,target
0,train_Client_0,60,11,101,9131,362.971429,6.971429,0.0
1,train_Client_1,69,11,107,6425,557.540541,7.216216,0.0
2,train_Client_10,62,11,301,12346,836.500000,7.055556,0.0
3,train_Client_100,69,11,105,8573,1.200000,6.150000,0.0
4,train_Client_1000,62,11,303,1904,922.642857,8.857143,0.0
...,...,...,...,...,...,...,...,...
135488,train_Client_99995,62,11,304,5636,0.000000,6.000000,0.0
135489,train_Client_99996,63,11,311,2623,186.609756,8.487805,0.0
135490,train_Client_99997,63,11,311,2961,273.083333,9.000000,0.0
135491,train_Client_99998,60,11,101,9505,370.500000,7.500000,0.0


**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

## Prepare datasets

### Feature encoding

In [24]:
training_set_encoded = pd.get_dummies(training_set, columns=['disrict', 'client_catg', 'region'])
training_set_encoded.set_index('client_id', inplace=True)
training_set_encoded

Unnamed: 0_level_0,days_active,avg_consumption,avg_reading_score,target,disrict_60,disrict_62,disrict_63,disrict_69,client_catg_11,client_catg_12,...,region_308,region_309,region_310,region_311,region_312,region_313,region_371,region_372,region_379,region_399
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
train_Client_0,9131,362.971429,6.971429,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1,6425,557.540541,7.216216,0.0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_10,12346,836.500000,7.055556,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_100,8573,1.200000,6.150000,0.0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1000,1904,922.642857,8.857143,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
train_Client_99995,5636,0.000000,6.000000,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_99996,2623,186.609756,8.487805,0.0,0,0,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99997,2961,273.083333,9.000000,0.0,0,0,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99998,9505,370.500000,7.500000,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Split into train / test sets

In [25]:
import numpy as np
from sklearn.model_selection import train_test_split

In [26]:
y = training_set_encoded['target']
y

client_id
train_Client_0        0.0
train_Client_1        0.0
train_Client_10       0.0
train_Client_100      0.0
train_Client_1000     0.0
                     ... 
train_Client_99995    0.0
train_Client_99996    0.0
train_Client_99997    0.0
train_Client_99998    0.0
train_Client_99999    0.0
Name: target, Length: 135493, dtype: float64

In [27]:
X = training_set_encoded.drop(columns=['target'])
X

Unnamed: 0_level_0,days_active,avg_consumption,avg_reading_score,disrict_60,disrict_62,disrict_63,disrict_69,client_catg_11,client_catg_12,client_catg_51,...,region_308,region_309,region_310,region_311,region_312,region_313,region_371,region_372,region_379,region_399
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
train_Client_0,9131,362.971429,6.971429,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1,6425,557.540541,7.216216,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_10,12346,836.500000,7.055556,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_100,8573,1.200000,6.150000,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1000,1904,922.642857,8.857143,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
train_Client_99995,5636,0.000000,6.000000,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_99996,2623,186.609756,8.487805,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99997,2961,273.083333,9.000000,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99998,9505,370.500000,7.500000,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)

**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

## Normalization

In [29]:
from sklearn.preprocessing import MinMaxScaler

In [30]:
scaler = MinMaxScaler()

In [31]:
scaler.fit(training_set_encoded[['days_active', 'avg_consumption', 'avg_reading_score']])

MinMaxScaler()

In [32]:
transformed_data = scaler.transform(training_set_encoded[['days_active', 'avg_consumption', 'avg_reading_score']])
transformed_data

array([[0.57973902, 0.4554914 , 0.0023868 ],
       [0.40579803, 0.45566902, 0.00298825],
       [0.78639841, 0.45592368, 0.0025935 ],
       ...,
       [0.18313299, 0.45540934, 0.00737101],
       [0.60377965, 0.45549827, 0.0036855 ],
       [0.78787684, 0.45557936, 0.        ]])

In [33]:
training_set_encoded[['days_active', 'avg_consumption', 'avg_reading_score']] = transformed_data

In [34]:
training_set_encoded

Unnamed: 0_level_0,days_active,avg_consumption,avg_reading_score,target,disrict_60,disrict_62,disrict_63,disrict_69,client_catg_11,client_catg_12,...,region_308,region_309,region_310,region_311,region_312,region_313,region_371,region_372,region_379,region_399
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
train_Client_0,0.579739,0.455491,0.002387,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1,0.405798,0.455669,0.002988,0.0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_10,0.786398,0.455924,0.002594,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_100,0.543871,0.455161,0.000369,0.0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1000,0.115189,0.456002,0.007020,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
train_Client_99995,0.355081,0.455160,0.000000,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_99996,0.161406,0.455330,0.006113,0.0,0,0,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99997,0.183133,0.455409,0.007371,0.0,0,0,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99998,0.603780,0.455498,0.003686,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
y_normalized = training_set_encoded['target']
y_normalized

client_id
train_Client_0        0.0
train_Client_1        0.0
train_Client_10       0.0
train_Client_100      0.0
train_Client_1000     0.0
                     ... 
train_Client_99995    0.0
train_Client_99996    0.0
train_Client_99997    0.0
train_Client_99998    0.0
train_Client_99999    0.0
Name: target, Length: 135493, dtype: float64

In [36]:
X_normalized = training_set_encoded.drop(columns=['target'])
X_normalized

Unnamed: 0_level_0,days_active,avg_consumption,avg_reading_score,disrict_60,disrict_62,disrict_63,disrict_69,client_catg_11,client_catg_12,client_catg_51,...,region_308,region_309,region_310,region_311,region_312,region_313,region_371,region_372,region_379,region_399
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
train_Client_0,0.579739,0.455491,0.002387,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1,0.405798,0.455669,0.002988,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_10,0.786398,0.455924,0.002594,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_100,0.543871,0.455161,0.000369,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_1000,0.115189,0.456002,0.007020,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
train_Client_99995,0.355081,0.455160,0.000000,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
train_Client_99996,0.161406,0.455330,0.006113,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99997,0.183133,0.455409,0.007371,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
train_Client_99998,0.603780,0.455498,0.003686,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
scaler.fit(X_normalized)

MinMaxScaler()

In [38]:
transformed_data = scaler.transform(X_normalized)
transformed_data

array([[0.57973902, 0.4554914 , 0.0023868 , ..., 0.        , 0.        ,
        0.        ],
       [0.40579803, 0.45566902, 0.00298825, ..., 0.        , 0.        ,
        0.        ],
       [0.78639841, 0.45592368, 0.0025935 , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.18313299, 0.45540934, 0.00737101, ..., 0.        , 0.        ,
        0.        ],
       [0.60377965, 0.45549827, 0.0036855 , ..., 0.        , 0.        ,
        0.        ],
       [0.78787684, 0.45557936, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [39]:
X_normalized = transformed_data

In [40]:
X_normalized

array([[0.57973902, 0.4554914 , 0.0023868 , ..., 0.        , 0.        ,
        0.        ],
       [0.40579803, 0.45566902, 0.00298825, ..., 0.        , 0.        ,
        0.        ],
       [0.78639841, 0.45592368, 0.0025935 , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.18313299, 0.45540934, 0.00737101, ..., 0.        , 0.        ,
        0.        ],
       [0.60377965, 0.45549827, 0.0036855 , ..., 0.        , 0.        ,
        0.        ],
       [0.78787684, 0.45557936, 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Model Training

In [41]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

### KNN Classifier

In [42]:
from sklearn.neighbors import KNeighborsClassifier

In [43]:
knn = KNeighborsClassifier(n_neighbors=10)
knn

KNeighborsClassifier(n_neighbors=10)

Test both normalized and unnormalized datasets

In [44]:
scores_KNN_un = cross_val_score(knn, X, y, cv=5, n_jobs=-1, scoring='roc_auc')
scores_KNN_un.mean()

0.552040654438493

In [45]:
scores_KNN = cross_val_score(knn, X_normalized, y_normalized, cv=5, n_jobs=-1, scoring='roc_auc')
scores_KNN.mean()

0.6091250367969417

### Linear Regression

In [46]:
from sklearn.linear_model import LinearRegression

In [47]:
model_LinReg = LinearRegression(n_jobs=10)
model_LinReg

LinearRegression(n_jobs=10)

In [48]:
scores_LinReg_un = cross_val_score(model_LinReg, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_LinReg_un.mean()

0.6472690876313771

In [49]:
scores_LinReg = cross_val_score(model_LinReg, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_LinReg.mean()

0.6471581170852729

### Logistic Regression

In [50]:
from sklearn.linear_model import LogisticRegression

In [51]:
model_LogReg = LogisticRegression(random_state=42, solver='lbfgs', C=100)
model_LogReg

LogisticRegression(C=100, random_state=42)

In [52]:
scores_LogReg_un = cross_val_score(model_LogReg, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_LogReg_un.mean()

0.5003210867132372

In [53]:
scores_LogReg = cross_val_score(model_LogReg, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_LogReg.mean()

0.6439937025825533

### Naive Bayes

In [54]:
from sklearn.naive_bayes import GaussianNB

In [55]:
model_GNB = GaussianNB()
model_GNB

GaussianNB()

In [56]:
scores_NB_un = cross_val_score(model_GNB, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_NB_un.mean()

0.6554633659205935

In [57]:
scores_NB = cross_val_score(model_GNB, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_NB.mean()

0.6346640360546634

### Ridge Regression

In [58]:
from sklearn.linear_model import Ridge

In [59]:
model_Ridge = Ridge(alpha=0.01)
model_Ridge

Ridge(alpha=0.01)

In [60]:
scores_Ridge_un = cross_val_score(model_Ridge, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_Ridge_un.mean()

0.6472690321797316

In [61]:
scores_Ridge= cross_val_score(model_Ridge, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_Ridge.mean()

0.6472281770455608

### Lasso Regression

In [62]:
from sklearn.linear_model import Lasso

In [63]:
model_Lasso = Lasso(alpha=0.01)
model_Lasso

Lasso(alpha=0.01)

In [64]:
scores_Lasso_un = cross_val_score(model_Lasso, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_Lasso_un.mean()

0.5764608167430733

In [65]:
scores_Lasso = cross_val_score(model_Lasso, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_Lasso.mean()

0.5

### Polynomial Regression

In [66]:
from sklearn.preprocessing import PolynomialFeatures

In [67]:
poly = PolynomialFeatures(degree=3)

In [68]:
X_poly = poly.fit_transform(X)
X_poly

array([[1.00000000e+00, 9.13100000e+03, 3.62971429e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000000e+00, 6.42500000e+03, 5.57540541e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000000e+00, 1.23460000e+04, 8.36500000e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.00000000e+00, 2.96100000e+03, 2.73083333e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000000e+00, 9.50500000e+03, 3.70500000e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000000e+00, 1.23690000e+04, 4.59333333e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [69]:
X_poly_normalized = poly.fit_transform(X_normalized)
X_poly_normalized

array([[1.        , 0.57973902, 0.4554914 , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.40579803, 0.45566902, ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.78639841, 0.45592368, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [1.        , 0.18313299, 0.45540934, ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.60377965, 0.45549827, ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.78787684, 0.45557936, ..., 0.        , 0.        ,
        0.        ]])

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X_poly_normalized, y_normalized, train_size=0.75, test_size=0.25, random_state=0)

In [71]:
model_Poly = Ridge(alpha=100)
model_Poly

Ridge(alpha=100)

In [72]:
model_Poly.fit(X_train, y_train)

Ridge(alpha=100)

In [73]:
preds = model_Poly.predict(X_test)
preds

array([0.06154437, 0.02636852, 0.02947376, ..., 0.03256814, 0.1005698 ,
       0.11997621])

In [74]:
from sklearn.metrics import roc_auc_score

In [75]:
roc_auc_score(y_test, preds)

0.6663140053247101

### Decision Trees

Works with categorical features and no normalization required

In [76]:
from sklearn.tree import DecisionTreeClassifier

In [77]:
raw_training_set = training_set.set_index('client_id')
raw_training_set

Unnamed: 0_level_0,disrict,client_catg,region,days_active,avg_consumption,avg_reading_score,target
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
train_Client_0,60,11,101,9131,362.971429,6.971429,0.0
train_Client_1,69,11,107,6425,557.540541,7.216216,0.0
train_Client_10,62,11,301,12346,836.500000,7.055556,0.0
train_Client_100,69,11,105,8573,1.200000,6.150000,0.0
train_Client_1000,62,11,303,1904,922.642857,8.857143,0.0
...,...,...,...,...,...,...,...
train_Client_99995,62,11,304,5636,0.000000,6.000000,0.0
train_Client_99996,63,11,311,2623,186.609756,8.487805,0.0
train_Client_99997,63,11,311,2961,273.083333,9.000000,0.0
train_Client_99998,60,11,101,9505,370.500000,7.500000,0.0


In [78]:
y_original = raw_training_set['target']
y_original

client_id
train_Client_0        0.0
train_Client_1        0.0
train_Client_10       0.0
train_Client_100      0.0
train_Client_1000     0.0
                     ... 
train_Client_99995    0.0
train_Client_99996    0.0
train_Client_99997    0.0
train_Client_99998    0.0
train_Client_99999    0.0
Name: target, Length: 135493, dtype: float64

In [79]:
X_original = raw_training_set.drop(columns=['target', 'disrict', 'client_catg'])
X_original

Unnamed: 0_level_0,region,days_active,avg_consumption,avg_reading_score
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
train_Client_0,101,9131,362.971429,6.971429
train_Client_1,107,6425,557.540541,7.216216
train_Client_10,301,12346,836.500000,7.055556
train_Client_100,105,8573,1.200000,6.150000
train_Client_1000,303,1904,922.642857,8.857143
...,...,...,...,...
train_Client_99995,304,5636,0.000000,6.000000
train_Client_99996,311,2623,186.609756,8.487805
train_Client_99997,311,2961,273.083333,9.000000
train_Client_99998,101,9505,370.500000,7.500000


In [80]:
model_DT = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10, max_leaf_nodes=10)
model_DT

DecisionTreeClassifier(criterion='entropy', max_depth=10, max_leaf_nodes=10,
                       random_state=0)

In [81]:
model_DT.fit(X_original,y_original)

DecisionTreeClassifier(criterion='entropy', max_depth=10, max_leaf_nodes=10,
                       random_state=0)

In [82]:
scores = cross_val_score(model_DT, X_original, y_original, cv=10, n_jobs=-1, scoring='roc_auc')
scores

array([0.68213245, 0.69182598, 0.70130312, 0.70389861, 0.69943669,
       0.69288926, 0.69103391, 0.67754399, 0.6914388 , 0.69854079])

In [83]:
scores.mean()

0.6930043584233111

In [84]:
scores_DT_un = cross_val_score(model_DT, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_DT_un.mean()

0.6928140205298365

In [85]:
scores_DT = cross_val_score(model_DT, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_DT.mean()

0.692807341471577

In [86]:
model_DT.feature_importances_

array([0.21405283, 0.29682238, 0.09269651, 0.39642828])

In [87]:
X_original

Unnamed: 0_level_0,region,days_active,avg_consumption,avg_reading_score
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
train_Client_0,101,9131,362.971429,6.971429
train_Client_1,107,6425,557.540541,7.216216
train_Client_10,301,12346,836.500000,7.055556
train_Client_100,105,8573,1.200000,6.150000
train_Client_1000,303,1904,922.642857,8.857143
...,...,...,...,...
train_Client_99995,304,5636,0.000000,6.000000
train_Client_99996,311,2623,186.609756,8.487805
train_Client_99997,311,2961,273.083333,9.000000
train_Client_99998,101,9505,370.500000,7.500000


### Random Forest

In [88]:
from sklearn.ensemble import RandomForestClassifier

In [89]:
model_RF = RandomForestClassifier(criterion='entropy', random_state=0, max_depth=11)
model_RF

RandomForestClassifier(criterion='entropy', max_depth=11, random_state=0)

In [90]:
scores_RF = cross_val_score(model_RF, X_original, y_original, cv=10, n_jobs=-1, scoring='roc_auc')
scores_RF

array([0.71952292, 0.72600764, 0.72867118, 0.73251415, 0.73786596,
       0.74298672, 0.73510848, 0.71522616, 0.72274743, 0.73573327])

In [91]:
scores_RF.mean()

0.7296383893960754

In [92]:
scores_RF_un = cross_val_score(model_RF, X, y, cv=10, n_jobs=-1, scoring='roc_auc')
scores_RF_un.mean()

0.7307562320179404

In [93]:
scores_RF = cross_val_score(model_RF, X_normalized, y_normalized, cv=10, n_jobs=-1, scoring='roc_auc')
scores_RF.mean()

0.7305791265544075

### Support Vector Machines

Resource Intensive

In [94]:
from sklearn.svm import SVC

In [95]:
model_SVC = SVC(C=1, gamma=1)
model_SVC

SVC(C=1, gamma=1)

In [None]:
scores_SVC_un = cross_val_score(model_SVC, X, y, cv=3, n_jobs=-1, scoring='roc_auc')
scores_SVC_un.mean()

In [97]:
scores_SVC = cross_val_score(model_SVC, X_normalized, y_normalized, cv=5, n_jobs=-1, scoring='roc_auc')
scores_SVC.mean()

0.5185112783468773