## Spark ML - badanie oszustw z kartami kredytowymi

Źródło danych: https://www.kaggle.com/mlg-ulb/creditcardfraud

#### Wczytanie danych

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import DoubleType

In [2]:
spark = SparkSession.builder.master('local').appName('MiNADZD').getOrCreate()

df = spark.read.csv('./dataset/creditcard.csv', header=True, inferSchema=True)
df = df.withColumn('Class', df['Class'].cast(DoubleType()))

Rozmiar danych po wczytaniu zbioru:

In [3]:
'Rozmiar danych: ' + str(df.count()) + ' rows x ' + str(len(df.columns)) + ' columns.'

'Rozmiar danych: 284807 rows x 31 columns.'

In [28]:
df.show(5)

+----+------------------+-------------------+----------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------+-----+--------------------+
|Time|                V1|                 V2|              V3|                V4|                 V5|                 V6|                 V7|                V8|                V9|                V10|               V11|               V12|               V13|               V14|               V15|               V16|               V17|                V18|               V19|                V20|                 V21|    

Przeniesienie wszystkich cech do jednej kolumny w wektorze

In [4]:
listColumns=df.columns

if not 'features' in listColumns:
    inputColumns = ['Time','Amount']
    for i in range(1,29):
        inputColumns.append('V'+str(i))

    assembler = VectorAssembler(inputCols=inputColumns, outputCol='features')
    df = assembler.transform(df)

Normalizacja

In [5]:
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
liNormData = normalizer.transform(df)
liNormData.select('normFeatures').show(5)

+--------------------+
|        normFeatures|
+--------------------+
|[0.0,0.9190068131...|
|[0.0,0.2234773306...|
|[0.00247767129574...|
|[0.00708831461517...|
|[0.02296611443072...|
+--------------------+
only showing top 5 rows



Podział na dane uczące i testowe

In [6]:
(trainingNormData, testNormData) = liNormData.randomSplit([0.8,0.2])

### RandomForestClassifier - trenowanie & testowanie 

In [7]:
algo = RandomForestClassifier(featuresCol='normFeatures', labelCol='Class')
model = algo.fit(trainingNormData)

In [8]:
predictions = model.transform(testNormData)

Wyświetlenie pierwszych 20 wyników klasyfikacji:

In [9]:
predictions.select(['Class','prediction', 'probability']).show()

+-----+----------+--------------------+
|Class|prediction|         probability|
+-----+----------+--------------------+
|  0.0|       0.0|[0.98584846610895...|
|  0.0|       0.0|[0.99501517779238...|
|  0.0|       0.0|[0.99122397700143...|
|  0.0|       0.0|[0.99932419959270...|
|  0.0|       0.0|[0.96398032250162...|
|  0.0|       0.0|[0.99345809862081...|
|  0.0|       0.0|[0.99508633174788...|
|  0.0|       0.0|[0.99508605598980...|
|  0.0|       0.0|[0.99949164073313...|
|  0.0|       0.0|[0.99545200029915...|
|  0.0|       0.0|[0.99946677853171...|
|  0.0|       0.0|[0.99359088179585...|
|  0.0|       0.0|[0.99960312913487...|
|  0.0|       0.0|[0.92472114154392...|
|  0.0|       0.0|[0.99944328947565...|
|  0.0|       0.0|[0.99543909780981...|
|  0.0|       0.0|[0.92057682489152...|
|  0.0|       0.0|[0.99955210469369...|
|  0.0|       0.0|[0.99506528893611...|
|  0.0|       0.0|[0.96285893836944...|
+-----+----------+--------------------+
only showing top 20 rows



### Ocena jakości klasyfikatora

Dokładność i błąd klasyfikacji

In [10]:
evaluator = MulticlassClassificationEvaluator(labelCol='Class', predictionCol="prediction", metricName='accuracy')
evaluator.evaluate(predictions)

0.9990055826936497

F1-score

In [11]:
evaluator = MulticlassClassificationEvaluator(labelCol='Class', predictionCol="prediction", metricName='f1')
evaluator.evaluate(predictions)

0.9988998686865757

AUC

In [12]:
evaluator = BinaryClassificationEvaluator(labelCol='Class', rawPredictionCol='rawPrediction')
evaluator.evaluate(predictions)

0.9253303995877488

Confusion Matrix

In [13]:
metrics = MulticlassMetrics(predictions.select('prediction','Class').rdd)
metrics.confusionMatrix()

DenseMatrix(2, 2, [57214.0, 45.0, 12.0, 49.0], 0)

### LogisticRegression - trenowanie & testowanie 

In [14]:
algo = LogisticRegression(featuresCol='normFeatures', labelCol='Class')
model = algo.fit(trainingNormData)

In [15]:
predictions = model.transform(testNormData)

Wyświetlenie pierwszych 20 wyników klasyfikacji:

In [16]:
predictions.select(['Class','prediction']).show()

+-----+----------+
|Class|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



### Ocena jakości klasyfikatora

Dokładność i błąd klasyfikacji

In [17]:
evaluator = MulticlassClassificationEvaluator(labelCol='Class', predictionCol="prediction", metricName='accuracy')
evaluator.evaluate(predictions)

0.9984647592463364

F1-score

In [18]:
evaluator = MulticlassClassificationEvaluator(labelCol='Class', predictionCol="prediction", metricName='f1')
evaluator.evaluate(predictions)

0.9980499316079973

AUC

In [19]:
evaluator = BinaryClassificationEvaluator(labelCol='Class', rawPredictionCol='rawPrediction')
evaluator.evaluate(predictions)

0.8884689930406596

Confusion Matrix

In [20]:
metrics = MulticlassMetrics(predictions.select('prediction','Class').rdd)
metrics.confusionMatrix()

DenseMatrix(2, 2, [57215.0, 77.0, 11.0, 17.0], 0)

### Gradient-Boosted Tree Classfier - trenowanie & testowanie 

In [21]:
algo = GBTClassifier(featuresCol='normFeatures', labelCol='Class')
model = algo.fit(trainingNormData)

In [22]:
predictions = model.transform(testNormData)

Wyświetlenie pierwszych 20 wyników klasyfikacji:

In [23]:
predictions.select(['Class','prediction']).show()

+-----+----------+
|Class|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



### Ocena jakości klasyfikatora

Dokładność i błąd klasyfikacji

In [24]:
evaluator = MulticlassClassificationEvaluator(labelCol='Class', predictionCol="prediction", metricName='accuracy')
evaluator.evaluate(predictions)

0.9993021632937893

F1-score

In [25]:
evaluator = MulticlassClassificationEvaluator(labelCol='Class', predictionCol="prediction", metricName='f1')
evaluator.evaluate(predictions)

0.9992606863559548

AUC

In [26]:
evaluator = BinaryClassificationEvaluator(labelCol='Class', rawPredictionCol='rawPrediction')
evaluator.evaluate(predictions)

0.9556063082470323

Confusion Matrix

In [27]:
metrics = MulticlassMetrics(predictions.select('prediction','Class').rdd)
metrics.confusionMatrix()

DenseMatrix(2, 2, [57216.0, 30.0, 10.0, 64.0], 0)