## Spark ML - badanie oszustw z kartami kredytowymi

Źródło danych: https://www.kaggle.com/mlg-ulb/creditcardfraud

#### Wczytanie danych

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import DoubleType

In [3]:
from pyspark import SparkConf

spark = SparkSession.builder.master('local').appName('MiNADZD').getOrCreate()

In [None]:
df = spark.read.csv('./dataset/finalDataset.csv', header=True, inferSchema=True)

df = df.drop('ind')
df = df.drop('_C0')
df = df.drop('id')

for feature in df.columns:
    df = df.withColumn(feature, df[feature].cast(DoubleType()))

df

Rozmiar danych po wczytaniu zbioru:

In [None]:
'Rozmiar danych: ' + str(df.count()) + ' rows x ' + str(len(df.columns)) + ' columns.'

In [None]:
df.show(5)

Przeniesienie wszystkich cech do jednej kolumny w wektorze

In [None]:
inputColumns = []
for feature in df.columns:
    if feature != 'satisfaction':
        inputColumns.append(feature)

assembler = VectorAssembler(inputCols=inputColumns, outputCol='features')
df = assembler.transform(df)

In [None]:
df = df.select('satisfaction', 'features')
df

Normalizacja

In [None]:
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
liNormData = normalizer.transform(df)
liNormData.select('normFeatures').show(5)

Podział na dane uczące i testowe

In [None]:
(trainingNormData, testNormData) = liNormData.randomSplit([0.8,0.2])

### RandomForestClassifier - trenowanie & testowanie 

In [None]:
algo = RandomForestClassifier(featuresCol='normFeatures', labelCol='satisfaction')
model = algo.fit(trainingNormData)

In [None]:
predictions = model.transform(testNormData)

Wyświetlenie pierwszych 20 wyników klasyfikacji:

In [None]:
predictions.select(['satisfaction','prediction', 'probability']).show(5)

### Ocena jakości klasyfikatora

Dokładność i błąd klasyfikacji

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='satisfaction', predictionCol="prediction", metricName='accuracy')
evaluator.evaluate(predictions)

F1-score

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='satisfaction', predictionCol="prediction", metricName='f1')
evaluator.evaluate(predictions)

AUC

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='satisfaction', rawPredictionCol='rawPrediction')
evaluator.evaluate(predictions)

Confusion Matrix

In [None]:
rdd = predictions.select('prediction','satisfaction').rdd

In [None]:
mm = MulticlassMetrics(rdd)

In [None]:
metrics = MulticlassMetrics(predictions.select('prediction','satisfaction').rdd)
metrics.confusionMatrix()
metrics

### LogisticRegression - trenowanie & testowanie 

In [None]:
algo = LogisticRegression(featuresCol='normFeatures', labelCol='satisfaction')
model = algo.fit(trainingNormData)

In [None]:
predictions = model.transform(testNormData)

Wyświetlenie pierwszych 20 wyników klasyfikacji:

In [None]:
predictions.select(['satisfaction','prediction']).show(5)

### Ocena jakości klasyfikatora

Dokładność i błąd klasyfikacji

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='satisfaction', predictionCol="prediction", metricName='accuracy')
evaluator.evaluate(predictions)

F1-score

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='satisfaction', predictionCol="prediction", metricName='f1')
evaluator.evaluate(predictions)

AUC

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='satisfaction', rawPredictionCol='rawPrediction')
evaluator.evaluate(predictions)

Confusion Matrix

In [None]:
metrics = MulticlassMetrics(predictions.select('prediction','satisfaction').rdd)
metrics.confusionMatrix()

### Gradient-Boosted Tree Classfier - trenowanie & testowanie 

In [None]:
algo = GBTClassifier(featuresCol='normFeatures', labelCol='satisfaction')
model = algo.fit(trainingNormData)

In [None]:
predictions = model.transform(testNormData)

Wyświetlenie pierwszych 20 wyników klasyfikacji:

In [None]:
predictions.select(['satisfaction','prediction']).show()

### Ocena jakości klasyfikatora

Dokładność i błąd klasyfikacji

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='satisfaction', predictionCol="prediction", metricName='accuracy')
evaluator.evaluate(predictions)

F1-score

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='satisfaction', predictionCol="prediction", metricName='f1')
evaluator.evaluate(predictions)

AUC

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='satisfaction', rawPredictionCol='rawPrediction')
evaluator.evaluate(predictions)

Confusion Matrix

In [None]:
metrics = MulticlassMetrics(predictions.select('prediction','satisfaction').rdd)
metrics.confusionMatrix()