# Capa

ESTBArreiro - IPS
Bioinformatica - Janeiro 2023
Ano 3, semestre 1

UC: Big Data

Duarte Valente - 202000053
Guilherme Sa   - 202000201

Docentes: Raquel Barreira e António Gonçalves

# Projeto

## Instalar os packaged necessarios pelo terminal

In [None]:
%pip install pyspark

In [1]:
# Spark Session imports
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# Spark ML imports
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier

## Criar Ambiente do Spark
Aqui estamos a criar um ambiente de spark com o nome de CreditcardPrediction e com o computador master 

In [3]:
conf= SparkConf().setAppName('CreditCardPrediction').setMaster('local[*]')
sc= SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)

## Importar os dados (creditcard.csv)

### Importar os dados pelo google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = spark.read.format("csv").option("header","true")\
.option("inferSchema","true").load("../content/drive/MyDrive/Colab Notebooks/creditcard.csv")

### Importar os dados por ficheiro

In [5]:
df = spark.read.csv("creditcard.csv", header=True, inferSchema=True)

## Visualisar dados
Vamos simplesmente observar os dados, com o pandas, com o spark e ver a dimanesao e disparidade dos dados

In [6]:
df.toPandas().head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [7]:
df.show(5)

+----+------------------+-------------------+----------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------+-----+
|Time|                V1|                 V2|              V3|                V4|                 V5|                 V6|                 V7|                V8|                V9|                V10|               V11|               V12|               V13|               V14|               V15|               V16|               V17|                V18|               V19|                V20|                 V21|                V22|     

In [8]:
counts = df.groupBy("Class").count()
counts.show()

+-----+------+
|Class| count|
+-----+------+
|    1|   492|
|    0|284315|
+-----+------+



## Conclusao dos Dados
Depois de visualisar os dados tiramos as seguintes conclusoes:
- Teem uma dimensao de 284.804 entidades
- Teem uma grande amostra de casos de nao-fraude e muito poucos de fraude, o que pode levar a problemas na detecao de fraunde por nao ter amostra suficiente
- Sao constituidos por 30 caracteristicas, o tempo, 28 que nao temos informacao e a Classe, que vai ser a variavel target.

## Tratamento dos dados
Aqui vamos pegar em todas as caracteristicas dos dados exceto a variavel target e criar um objeto vetor para o item 

In [9]:
cols = df.columns
cols.remove('Class')

In [10]:
assembler = VectorAssembler(inputCols= cols, outputCol= 'features')

In [11]:
df = assembler.transform(df)

Agora vamos visualizar os dados depois de serem tratados, e vamos poder observar que de 30 passamos a ter apenas 2 colunas e vamos usar a coluna das features para prever a da class, uma vez que a features nao e nada mais nada menos que todas as outras 29 caracteristicas juntas.

In [12]:
df = df.select(['features', 'Class'])
df.show(5)

+--------------------+-----+
|            features|Class|
+--------------------+-----+
|[0.0,-1.359807133...|    0|
|[0.0,1.1918571113...|    0|
|[1.0,-1.358354061...|    0|
|[1.0,-0.966271711...|    0|
|[2.0,-1.158233093...|    0|
+--------------------+-----+
only showing top 5 rows



## Aplicar Modelos de Machine Learning
Neste trabalho vao ser usados 2 modelos de machine learning, a regressao logistica e o random forest. Estes modelos foram escolhidos por serem geralmente bons candidatos a tratar de classificacao binaria, que era o caso do nosso dataset.

### Defenir Test/Split
Neste caso vamos fazer o split a 70% dos dados para treino e os restantes 30% para teste

In [13]:
training, testing= df.randomSplit([0.7, 0.3])

### Modelo - Regressao Logistica

#### treino e teste do modelo

In [35]:
lr = LogisticRegression(labelCol='Class', featuresCol='features')
model= lr.fit(training) # fit dos dados de treino
predictions = model.transform(testing) # teste do modelo com os dados de teste

#### Verificar resultados do teste

In [36]:
evaluator = BinaryClassificationEvaluator(labelCol='Class', rawPredictionCol='rawPrediction')
accuracy = evaluator.evaluate(predictions)
accuracy

0.9843709695541113

Com isto temos um modelo de previsao a usar regressao logistica em que obtivemos uma precisao de aproximadamente 98%

#### tabela com a previsao

In [37]:
predictions.toPandas()

Unnamed: 0,features,Class,rawPrediction,probability,prediction
0,"[1.0, -0.966271711572087, -0.185226008082898, ...",0,"[8.490171792654719, -8.490171792654719]","[0.9997945642550379, 0.00020543574496212358]",0.0
1,"[2.0, -1.15823309349523, 0.877736754848451, 1....",0,"[8.416240817831296, -8.416240817831296]","[0.9997788041474247, 0.0002211958525752955]",0.0
2,"[2.0, -0.425965884412454, 0.960523044882985, 1...",0,"[8.611205846484701, -8.611205846484701]","[0.9998179788891919, 0.00018202111080811711]",0.0
3,"[7.0, -0.644269442348146, 1.41796354547385, 1....",0,"[6.683145041097031, -6.683145041097031]","[0.9987497304274873, 0.00125026957251273]",0.0
4,"[10.0, 0.38497821518095, 0.616109459176472, -0...",0,"[8.546848696358913, -8.546848696358913]","[0.9998058817082615, 0.00019411829173854311]",0.0
...,...,...,...,...,...
85345,"[172769.0, -1.02971891697297, -1.1106695791175...",0,"[8.350576559177306, -8.350576559177306]","[0.9997637955445959, 0.00023620445540406543]",0.0
85346,"[172770.0, -0.446950896052929, 1.3022123695422...",0,"[8.027573218039668, -8.027573218039668]","[0.9996737672596948, 0.0003262327403051879]",0.0
85347,"[172775.0, 1.97100223582511, -0.69906730104192...",0,"[9.363038141967401, -9.363038141967401]","[0.9999141684563302, 8.583154366981205e-05]",0.0
85348,"[172778.0, -12.5167318477287, 10.1878180402793...",0,"[27.76378210961446, -27.76378210961446]","[0.9999999999991243, 8.757439218243235e-13]",0.0


### Modelo - Random forest

#### treino e teste do modelo
Neste caso, estamos a fazer um modelo com uma profuncidade maxima de 5 para evitar overfitting.

In [39]:
rf = RandomForestClassifier(labelCol="Class", featuresCol="features", maxDepth= 5)
model = rf.fit(training)
predictions = model.transform(testing) # teste do modelo com os dados de teste

#### verificar resultados do teste

In [40]:
evaluator = BinaryClassificationEvaluator(labelCol='Class', rawPredictionCol='rawPrediction')
accuracy = evaluator.evaluate(predictions)
accuracy

0.9596884214755986

Com isto temos um modelo de previsao a usar o Random Forest em que obtivemos uma precisao de aproximadamente 96%

#### tabela com a previsao

In [41]:
predictions.toPandas()

Unnamed: 0,features,Class,rawPrediction,probability,prediction
0,"[1.0, -0.966271711572087, -0.185226008082898, ...",0,"[19.994562590140994, 0.005437409859006456]","[0.9997281295070497, 0.0002718704929503228]",0.0
1,"[2.0, -1.15823309349523, 0.877736754848451, 1....",0,"[19.995453778181545, 0.004546221818454511]","[0.9997726889090772, 0.00022731109092272557]",0.0
2,"[2.0, -0.425965884412454, 0.960523044882985, 1...",0,"[19.995453778181545, 0.004546221818454511]","[0.9997726889090772, 0.00022731109092272557]",0.0
3,"[7.0, -0.644269442348146, 1.41796354547385, 1....",0,"[19.994571482131782, 0.005428517868218554]","[0.9997285741065891, 0.0002714258934109277]",0.0
4,"[10.0, 0.38497821518095, 0.616109459176472, -0...",0,"[19.995453778181545, 0.004546221818454511]","[0.9997726889090772, 0.00022731109092272557]",0.0
...,...,...,...,...,...
85345,"[172769.0, -1.02971891697297, -1.1106695791175...",0,"[19.99503091576794, 0.0049690842320611935]","[0.9997515457883969, 0.0002484542116030596]",0.0
85346,"[172770.0, -0.446950896052929, 1.3022123695422...",0,"[19.995453778181545, 0.004546221818454511]","[0.9997726889090772, 0.00022731109092272557]",0.0
85347,"[172775.0, 1.97100223582511, -0.69906730104192...",0,"[19.995453778181545, 0.004546221818454511]","[0.9997726889090772, 0.00022731109092272557]",0.0
85348,"[172778.0, -12.5167318477287, 10.1878180402793...",0,"[19.992289348194863, 0.0077106518051382085]","[0.9996144674097431, 0.0003855325902569104]",0.0


# Conclusao 
Neste trabalho foi utilizado o PySpark para carregar, processar e analisar um dataset. Em seguida foi utilizado o MLlib para treinar e avaliar modelos de machine learning. Como modelos foram usados a regressao logistica e o Random Forest. Ambos os modelos optiveram um desempenho que consideramos adequado na previsao de se havia fraude ou nao na transacao. Em geral estamos contentes com o nosso trabalho apesar de que nao nos foi possivel apresentar graficos que explicassem os resultados.