# 1. Introduction
This test is intended for candidates applying to Settlement Analyst positions at
CloudWalk. If you get here, we already like you and see you as a good fit with our
company. Now, we propose a challenge similar to the ones that we face on a daily
basis.
The challenges were created with the objective of helping you build the knowledge base
needed to implement the technical assessment in the end. Enjoy!
  - The first challenge will help you understand better how the payments industry
works.
  - The second challenge is a real world problem.
  - The third challenge is an analysis of hypothetical data.
We expect you to understand our role and challenges facing the financial industry, and
to bring data driven solutions.


# 2. Tasks
## 2.1. Understand the Industry


  

###1. Explain the money flow, the information flow and the role of the main players in the payment industry.
  - The money flow is the journey of payment, when a customer buys something using a credit card, the payment gateway encrypys the customer's data. The transaction flow passing to payment processor to verifies that the customer's data and transaction infomation are correct, so they sends the transaction info to the customer's bank or credit card to approve or deny the transfer of funds. Once the bank approves the transaction, the payment processor ralays that information to the payment gateway and request the funds be transferred from the customer's bank to the merchant's bank, when the merchant's bank receives the funds and the customer is alerted that their transaction is complete via the gateway payment.  



  
 


###2. Explain the difference between acquirer, sub-acquirer and payment gateway, and how the flow explained in the previous question changes for these players.
  - Payment gateway is a system that transmits data from purchases made in your store at checkout to companies that process the payment. As the first player in the flow, it's responsible for sending this information to acquirers, card brands and issuing banks then obtain a response about the continuation of the process or its cancellation.
  - An acquirer (also called a creditor) is a company that specializes in processing payments, meaning that it processes credit or debit card payments on behalf of a merchant. The acquirer receives the payment information, processes it and passes it to the card brand (when the payment method is credit card) and the issuing bank.
  - A sub-acquirer is a company that processes payments and transmits the generated data to the other players involved in the payment flow.

###3. Explain what chargebacks are, how they differ from a cancellation and what is their connection with fraud in the acquiring world.
  - A chargeback is the return of credit card funds used to make a purchase to the buyer. A chargeback can occur if a consumer disputes a purchase made using their credit card, claiming that it was fraudulent or made without their knowledge or permission.
  - A cancellation happens when the client or the merchant make a mistake, so they can ask for a payment cancellation, but Chargebacks were originally initiated as a means of providing consumer protection from fraud.
  - In the acquiring world the players work to protect clients from purchase that was not made with the buyer’s knowledge or consent.


## 2.2. Solve the problem
  - A client sends you an email asking for a chargeback status. You check the system, and
see that we have received his defense documents and sent them to the issuer, but the
issuer has not accepted our defense. They claim that the cardholder continued to affirm
that she did not receive the product, and our documents were not sufficient to prove
otherwise.
  - You respond to our client informing that the issuer denied the defense, and the next day
he emails you back, extremely angry and disappointed, claiming the product was
delivered and that this chargeback is not right.
  - Considering that the chargeback reason is “Product/Service not provided”, what would
you do in this situation?



# 3. Get your hands dirty
## 3.1 Attached there’s a spreadsheet with hypothetical transactional data. Imagine that you are trying to understand if there is any kind of suspicious behavior.
  1. Analyze the data provided and present your conclusions.
  2. In addition to the spreadsheet data, what other data would you look at to try to find patterns of possible frauds?
  3. Considering your conclusions, what could you do to prevent frauds and/or
chargebacks?

## 3.2 - Solve the problem

*Stop credit card fraud: Implement the concept of a simple anti-fraud.*

An Anti-fraud works by receiving information about a transaction and inferring whether it is a fraudulent transaction or not before authorizing it. 
We work mostly with Ruby and Python, but you can use any programming language that you want. 

Please use the data provided on challenge 2 to test your solution. Consider that transactions with the flag ```has_cbk = true``` are transactions with fraud chargebacks.

Your Anti-fraud must have at least:
1 endpoint that receives transaction data and returns a recommendation to “approve/deny” the transaction.

Example payload:
```json
{
"transaction_id" : 2342357,
"merchant_id" : 29744,
"user_id" : 97051,
"card_number" : "434505******9116",
"transaction_date" : "2019-11-31T23:16:32.812632",
"transaction_amount" : 373,
"device_id" : 285475
}
```
Example response:
```json
{ 
"transaction_id" : 2342357,
"recommendation" : "approve"
}
```

You are free to determine the methods to approve/deny the transactions, but a few ways to do it are:

- rule-based  - you define which cases get approved/denied based on predefined rules;
- score-base  - you create a method/model (you could use machine learning models here if you want)  to determine the risk-- score of a transaction and make your decision based on it; 
- a combination of both;
 
Things to watch for:
- Latency
- Security
- Architecture
- Coding style

#### Antifraud Requirements

- Reject transaction if user is trying too many transactions in a row;
- Reject transactions above a certain amount in a given period;
- Reject transaction if a user had a chargeback before (note that this information does not comes on the payload. The chargeback data is received **days after the transaction was approved**)



# 4. Deliverables

You are expected to submit a compacted git repository with your answers and your project.

We hope you have fun, learn and challenge yourself during this task :)

#Data Analysis

In [4]:
import pandas as pd

# bibliotecas para padronizar os dados e buscar pelas variaveis mais representativas
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE

# separacao em treino/teste e modelo de regressao
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# metricas
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score,precision_score

In [5]:
#utilizei o meu drive pessoal para carregar os dados propostos
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
#lendo os arquivos CSV
df = pd.read_csv('/content/drive/MyDrive/Processos Seletivos/transactional-sample.csv')

###Column Meanings
  - ransaction_id: identification of the transaction
  - merchant_id: identification of the merchant
  - user_id: identification of the user
  - card_number: number of the card used in the transaction
  - transaction_date: when the transaction took place
  - transaction_amount: the amount of the transaction
  - device_id: the identification of the device used in the transaction (some of them are missing)
  - has_cbk: tell us if chargeback happened or not.

In [7]:
df.head()

Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk
0,21320398,29744,97051,434505******9116,2019-12-01T23:16:32.812632,374.56,285475.0,False
1,21320399,92895,2708,444456******4210,2019-12-01T22:45:37.873639,734.87,497105.0,True
2,21320400,47759,14777,425850******7024,2019-12-01T22:22:43.021495,760.36,,False
3,21320401,68657,69758,464296******3991,2019-12-01T21:59:19.797129,2556.13,,True
4,21320402,54075,64367,650487******6116,2019-12-01T21:30:53.347051,55.36,860232.0,False


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3199 entries, 0 to 3198
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   transaction_id      3199 non-null   int64  
 1   merchant_id         3199 non-null   int64  
 2   user_id             3199 non-null   int64  
 3   card_number         3199 non-null   object 
 4   transaction_date    3199 non-null   object 
 5   transaction_amount  3199 non-null   float64
 6   device_id           2369 non-null   float64
 7   has_cbk             3199 non-null   bool   
dtypes: bool(1), float64(2), int64(3), object(2)
memory usage: 178.2+ KB


In [9]:
df[df.device_id.isnull()].has_cbk.describe()

count       830
unique        2
top       False
freq        763
Name: has_cbk, dtype: object

In [10]:
print("Formato:", df.shape)
print("\n\n\n")
for coluna in df.columns:
  print(coluna + ": ")
  print(df[coluna].unique())
  print(df[coluna].describe())
  display (df[df[coluna].isnull()])
  print("\n\n\n")

Formato: (3199, 8)




transaction_id: 
[21320398 21320399 21320400 ... 21323594 21323595 21323596]
count    3.199000e+03
mean     2.132200e+07
std      9.236161e+02
min      2.132040e+07
25%      2.132120e+07
50%      2.132200e+07
75%      2.132280e+07
max      2.132360e+07
Name: transaction_id, dtype: float64


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk






merchant_id: 
[29744 92895 47759 ...   656 52493  9603]
count     3199.000000
mean     48771.128790
std      29100.360839
min         16.000000
25%      23426.000000
50%      48752.000000
75%      73915.000000
max      99799.000000
Name: merchant_id, dtype: float64


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk






user_id: 
[97051  2708 14777 ... 59275     7     8]
count     3199.000000
mean     50891.077212
std      29515.282827
min          6.000000
25%      24267.500000
50%      52307.000000
75%      76837.000000
max      99974.000000
Name: user_id, dtype: float64


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk






card_number: 
['434505******9116' '444456******4210' '425850******7024' ...
 '528052******3611' '544315******7773' '650487******9884']
count                 3199
unique                2925
top       554482******7640
freq                    10
Name: card_number, dtype: object


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk






transaction_date: 
['2019-12-01T23:16:32.812632' '2019-12-01T22:45:37.873639'
 '2019-12-01T22:22:43.021495' ... '2019-11-01T10:23:50.555604'
 '2019-11-01T01:29:45.799767' '2019-11-01T01:27:15.811098']
count                           3199
unique                          3199
top       2019-12-01T23:16:32.812632
freq                               1
Name: transaction_date, dtype: object


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk






transaction_amount: 
[3.7456e+02 7.3487e+02 7.6036e+02 ... 1.5500e+00 3.5968e+02 2.4167e+03]
count    3199.000000
mean      767.812904
std       889.095904
min         1.220000
25%       205.235000
50%       415.940000
75%       981.680000
max      4097.210000
Name: transaction_amount, dtype: float64


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk






device_id: 
[2.85475e+05 4.97105e+05         nan ... 2.00000e+00 6.11488e+05
 4.00000e+00]
count      2369.000000
mean     493924.859856
std      283785.584545
min           2.000000
25%      259344.000000
50%      495443.000000
75%      733243.000000
max      999843.000000
Name: device_id, dtype: float64


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk
2,21320400,47759,14777,425850******7024,2019-12-01T22:22:43.021495,760.36,,False
3,21320401,68657,69758,464296******3991,2019-12-01T21:59:19.797129,2556.13,,True
18,21320416,97583,6434,498401******3796,2019-12-01T19:56:26.207496,396.84,,False
21,21320419,3788,18009,550209******5149,2019-12-01T19:44:40.414039,296.27,,False
32,21320430,56977,69758,464296******3991,2019-12-01T19:17:21.731168,2803.32,,True
...,...,...,...,...,...,...,...,...
3194,21323592,50493,49581,650486******4139,2019-11-01T13:05:34.054967,744.15,,False
3195,21323593,9603,59275,528052******3611,2019-11-01T11:47:02.404963,1403.67,,False
3196,21323594,57997,84486,522688******9874,2019-11-01T10:23:50.555604,1.55,,False
3197,21323595,35930,7,544315******7773,2019-11-01T01:29:45.799767,359.68,,False






has_cbk: 
[False  True]
count      3199
unique        2
top       False
freq       2808
Name: has_cbk, dtype: object


Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk








In [11]:
#replacing null values for number '0'
df.device_id = df.device_id.fillna(0)

In [12]:
#sorting out the 'transaction_date_dt'
df['transaction_date_dt'] = pd.to_datetime(df['transaction_date'], errors='coerce')    #corrige os erros com os valores das datas de nascimento, adequa para DateTime
df['day'] = df['transaction_date_dt'].dt.day
df['month'] = df['transaction_date_dt'].dt.month
df['year'] = df['transaction_date_dt'].dt.year
df['time'] = df['transaction_date_dt'].dt.time
df['hour'] = df['transaction_date_dt'].dt.hour

In [13]:
#replacing '*' ind the card_numbers for none to use in the model, we need just numbers
df.card_number = df.card_number.str.translate({ord(i): None for i in '*'})

#ML models

In [14]:
#sorting out the target data ans other columns to test the model
X = df.drop(["has_cbk","transaction_date","transaction_date_dt","day","month","year","time"], axis=1)
y = df.has_cbk

In [15]:
dic_has_cbk = {False: 0, True: 1}
y = y.map(dic_has_cbk)

In [16]:
#normalizing X
X_std = StandardScaler().fit_transform(X)
#sorting out the training and test data in 20%
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size = 0.2, random_state=10) 


In [17]:
#fist model - DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

df_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred), 
            columns=["Predicted False","Predicted True"],
            index  =["Predicted False", "Predicted True"])
df_confusion_matrix

Unnamed: 0,Predicted False,Predicted True
Predicted False,510,47
Predicted True,33,50


In [18]:
metrics = pd.DataFrame({"Metrics":["Accuracy","Precision","Specificity","Sensitivity"]})

def finals_metrics(df,m):
  ac = accuracy_score(y_test, y_pred)
  pre = precision_score(y_test, y_pred)
  esp = df_confusion_matrix.iloc[0,0] / (df_confusion_matrix.iloc[0,1] + df_confusion_matrix.iloc[0,0])
  sens = df_confusion_matrix.iloc[1,1] / (df_confusion_matrix.iloc[1,1] + df_confusion_matrix.iloc[1,0])
  metrics[m] = [ac,pre,esp,sens]
  return metrics

finals_metrics(df,'DecisionTree')

Unnamed: 0,Metrics,DecisionTree
0,Accuracy,0.875
1,Precision,0.515464
2,Specificity,0.915619
3,Sensitivity,0.60241


In [19]:
##second model - LogisticRegression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

df_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred), 
            columns=["Predicted False","Predicted True"],
            index  =["Predicted False", "Predicted True"])
df_confusion_matrix

Unnamed: 0,Predicted False,Predicted True
Predicted False,552,5
Predicted True,75,8


In [20]:
finals_metrics(df,'LogisticRegression')

Unnamed: 0,Metrics,DecisionTree,LogisticRegression
0,Accuracy,0.875,0.875
1,Precision,0.515464,0.615385
2,Specificity,0.915619,0.991023
3,Sensitivity,0.60241,0.096386


In [21]:
##third model - GaussianNaiveBayes
gnb = GaussianNB()
lr.fit(X_train, y_train)
y_pred = gnb.fit(X_train, y_train).predict(X_test)

df_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred), 
            columns=["Predicted False","Predicted True"],
            index  =["Predicted False", "Predicted True"])
df_confusion_matrix

Unnamed: 0,Predicted False,Predicted True
Predicted False,532,25
Predicted True,69,14


In [22]:
finals_metrics(df,'GaussianNB')

Unnamed: 0,Metrics,DecisionTree,LogisticRegression,GaussianNB
0,Accuracy,0.875,0.875,0.853125
1,Precision,0.515464,0.615385,0.358974
2,Specificity,0.915619,0.991023,0.955117
3,Sensitivity,0.60241,0.096386,0.168675


In [23]:
#testing with Transaction_id
def transaction_id_pred(id):
   teste = X[X.transaction_id == id]
   X_teste = StandardScaler().fit_transform(teste)
   y_pred = lr.predict(X_teste)
   if y_pred == 0:
    return print('Transaction_id:',id,'\nRecommendation:' , 'approve')
   else :
    return print('Transaction_id:',id,'\nRecommendation:' , 'repprove')


In [24]:
transaction_id_pred(21320398)

Transaction_id: 21320398 
Recommendation: approve


#Anti-Fraud Method

In [25]:
df.head()

Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk,transaction_date_dt,day,month,year,time,hour
0,21320398,29744,97051,4345059116,2019-12-01T23:16:32.812632,374.56,285475.0,False,2019-12-01 23:16:32.812632,1,12,2019,23:16:32.812632,23
1,21320399,92895,2708,4444564210,2019-12-01T22:45:37.873639,734.87,497105.0,True,2019-12-01 22:45:37.873639,1,12,2019,22:45:37.873639,22
2,21320400,47759,14777,4258507024,2019-12-01T22:22:43.021495,760.36,0.0,False,2019-12-01 22:22:43.021495,1,12,2019,22:22:43.021495,22
3,21320401,68657,69758,4642963991,2019-12-01T21:59:19.797129,2556.13,0.0,True,2019-12-01 21:59:19.797129,1,12,2019,21:59:19.797129,21
4,21320402,54075,64367,6504876116,2019-12-01T21:30:53.347051,55.36,860232.0,False,2019-12-01 21:30:53.347051,1,12,2019,21:30:53.347051,21


In [26]:
df.hour[3]

21

In [27]:
df_users = df[df.user_id.duplicated()]

In [28]:
df_card = df_users[df_users.card_number.duplicated()]

In [29]:
df_card

Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk,transaction_date_dt,day,month,year,time,hour
9,21320407,56107,81152,6505169201,2019-12-01T21:04:55.066909,345.68,486.0,True,2019-12-01 21:04:55.066909,1,12,2019,21:04:55.066909,21
34,21320432,49710,5541,6062823381,2019-12-01T19:12:42.641216,2515.13,656429.0,True,2019-12-01 19:12:42.641216,1,12,2019,19:12:42.641216,19
131,21320529,93520,77959,4329577262,2019-12-01T10:59:39.973715,2.34,589318.0,False,2019-12-01 10:59:39.973715,1,12,2019,10:59:39.973715,10
141,21320539,42356,74585,5368057429,2019-12-01T01:50:36.735006,591.58,126381.0,False,2019-12-01 01:50:36.735006,1,12,2019,01:50:36.735006,1
143,21320541,27220,9005,5549063672,2019-12-01T01:49:35.507757,80.45,0.0,False,2019-12-01 01:49:35.507757,1,12,2019,01:49:35.507757,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3120,21323518,62052,18227,5502093098,2019-11-03T22:14:24.384898,963.47,0.0,False,2019-11-03 22:14:24.384898,3,11,2019,22:14:24.384898,22
3141,21323539,8942,76819,5522898870,2019-11-03T16:09:21.333406,2040.02,0.0,True,2019-11-03 16:09:21.333406,3,11,2019,16:09:21.333406,16
3142,21323540,8942,76819,5522898870,2019-11-03T16:08:01.904202,1551.77,0.0,True,2019-11-03 16:08:01.904202,3,11,2019,16:08:01.904202,16
3165,21323563,41354,19820,6062826581,2019-11-02T16:33:21.333131,4031.00,0.0,True,2019-11-02 16:33:21.333131,2,11,2019,16:33:21.333131,16


no meu pensamento temos q fazer um código, rápido, com o mínimo de informação possível, então temos q trabalhar apenas com dados diarios, 24 horas talvez, 48h, não sei dizer ao certo. um código bem restrito, um pente fino realmente. Se passar desse pente fino, ai podemos fazer um código para uma busca mais ampla, buscar históricos dos clientes, formar perfis, para q no futuro possamos agregar esse perfil ao código rápido, que temos diáriamente. estou trabalhando nesse processo, acredito que me falta muito conhecimento de negócio para concretizar algumas ideias, me falta também um pouco de conhecimento técnico de como funciona na realidade o recolhimento dessas informações, tempo de demora.

In [38]:
transacao_teste = df.iloc[0:1,:]
transacao_teste

Unnamed: 0,transaction_id,merchant_id,user_id,card_number,transaction_date,transaction_amount,device_id,has_cbk,transaction_date_dt,day,month,year,time,hour
0,21320398,29744,97051,4345059116,2019-12-01T23:16:32.812632,374.56,285475.0,False,2019-12-01 23:16:32.812632,1,12,2019,23:16:32.812632,23


In [None]:
for i in range(len(df)):
    

In [30]:
for i in range(len(df_card)):
  if transaction_date_dt[i]

SyntaxError: ignored

In [None]:
from datetime import datetime

In [31]:
a = (df.transaction_date_dt[0] - df.transaction_date_dt[1])

In [32]:
a.seconds

1854

In [None]:
def testes(index):
  if (7 <= df.hour[index] <= 21) and (df.transaction_amount[index] >= 1000):
      #Reject transactions above a certain amount in a given period;
      print('Transaction_id:',id,'\nRecommendation:' , 'repprove')
  else:
      for i in range(len(df)):
          df.user_id  
          



            user id igual, card number igual em curto periodo de tempo mais de 3 vezes
  
  
  else:
      print("ok") 

In [None]:
testes(3)

In [None]:
##second model - DecisionTreeClassifier
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [None]:
def transaction_id_pred(id):
   teste = X[X.transaction_id == id]
   X_teste = StandardScaler().fit_transform(teste)
   y_pred = lr.predict(X_teste)
   if y_pred == 0:
    return print('Transaction_id:',id,'\nRecommendation:' , 'approve')
   else :
    return print('Transaction_id:',id,'\nRecommendation:' , 'repprove')

now we can work in a big scale