#  Evaluation

Here, a number of models have been trained and how effectively they predict fraud cases using data based on [this Kaggle dataset] (https://www.kaggle.com/dalpozz/creditcardfraud).

Each line in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` to` V28` and also` Amount`, which is the value of the transaction.
'Amount' => Amount of each transaction

The target is stored in the `class` column, where the value 1 corresponds to a fraud instance and 0 corresponds to a non-fraud instance.

Target column -> 'class'
Fraud => 1
No Fraud => 0

In [1]:
import numpy as np
import pandas as pd


Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [None]:
def answer_one():
    
    # Your code here
    data_frame = pd.read_csv('fraud_data.csv')
    X, y = data_frame.drop('Class', axis=1), data_frame.Class;
    
    result = len(y[y==1]) / (len(y[y==1]) + len(y[y==0]))
    
    return result
answer_one()

In [2]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)



Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [None]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score,accuracy_score

    
    # Creating a dataset with imbalanced binary classes:  
    # Negative class (0) is 'not digit 1' - Not Fraud
    # Positive class (1) is 'digit 1' - Fraud
    y_binary_imbalanced = y.copy()
    y_binary_imbalanced[y_binary_imbalanced != 1] = 0

    original =  y
    imbalanced = y_binary_imbalanced
    
    #Agora vamos criar uma partição de teste e trainamento neste conjunto de desequilíbrios.
    X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
    
    # Negative class (0) is most frequent
    #A classe negativa (0) é mais frequente
    dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
    
    # Therefore the dummy 'most_frequent' classifier always predicts class 0
    # Portanto, o classificador fictício 'most_frequent' sempre prediz a classe 0
    y_dummy_predictions = dummy_majority.predict(X_test)
    
    #calculation of scores
    accuracy = (accuracy_score(y_test, y_dummy_predictions))
    recall = (recall_score(y_test, y_dummy_predictions))
    
    
    return (accuracy,recall)
answer_two()

#### Accuracy ####
#Accuracy in classification problems is the number of predictions
# corrects made by the model in all types of predictions made.
# 0.985 is almost perfect, HOWEVER -> Accuracy should NEVER be
# used as a measure when classes of variables
# destination in the data are majority of a class.

####Recall#####
# basically, if we want to focus more on
# minimizing false negatives, we would like our recall
# be as close to 100% as possible
# Recall is a measure that indicates what proportion
# of patients who actually had cancer was diagnosed by the algorithm as having cancer
# 0 so it's really, really bad.



Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [None]:
def answer_three():
    from sklearn.metrics import accuracy_score, recall_score, precision_score
    from sklearn.svm import SVC
    
    #Model SVC with default parameters. SVC without mencions of kernel, the default is rbf
    svc_model = SVC().fit(X_train, y_train)
    
    #Threshold = using 0.5 by default
    svm_predicted = svc_model.predict(X_test)

    #Scores
    accuracy = accuracy_score(y_test, svm_predicted)
    recall = recall_score(y_test, svm_predicted)
    precision = precision_score(y_test, svm_predicted)
                  
    
    return (accuracy,recall,precision)
answer_three()

##Precision##
#Precisão é ser preciso. Portanto, mesmo que tenhamos conseguido 
#capturar apenas um caso de câncer e o tenhamos capturado corretamente, somos 100% precisos.
#MINIMIZAR FALSOS POSITIVOS: utilize PRECISION O MAIS PROXIMO POSSIVEL DE 100%,

##Recall###
#FALSOS NEGATIVOS: utilize RECALL O MAIS PROXIMO POSSIVEL DE 100%
#Nosso Recall está baixo, quer dizer que existem falsos negativos no modelo

##Acuracy###
#percentual de acertos em classificação, mas isto não pode ser usado em classes desequilibradas.




Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [3]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC
    from sklearn.calibration import CalibratedClassifierCV
    from sklearn.model_selection import train_test_split

    #SVC without mencions of kernel, the default is rbf
    svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
    
    #decision_function scores: Predict confidence scores for samples
    #Quando é fornecido um conjunto de pontos de teste, o método decision_function fornece 
    #para cada um um valor de pontuação do classificador que indica com que confiança o 
    #classificador prediz a classe positiva. Portanto, haverá pontuações positivas de grande 
    #magnitude para esses pontos
    #ou ele prevê uma classe negativa; haverá pontuações negativas de grande magnitude para 
    #pontos negativos.
    y_score = svc.decision_function(X_test)
    
    #Set a threshold -220
    y_score = np.where(y_score > -220, 1, 0)
    conf_matrix = confusion_matrix(y_test, y_score)
    
    
    

    
    #### threshold ###
    # input threshold in the model after trained this model
    # threshold is a threshold of separation of class
    # see more in https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65
    # process predict_proba
    # Predicts the odds
    # Choose the class most likely
    # There is a 0.5 rating limit
    # Class 1 is predicted if the probability is greater than 0.5
    # Class 0 is predicted if the probability is <0.5
    
    
    return conf_matrix
answer_four()

array([[5320,   24],
       [  14,   66]])



Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [30]:


def answer_five():
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import roc_curve, auc , roc_auc_score, precision_recall_curve, f1_score
    #import matplotlib.pyplot as plt
   

    lr = LogisticRegression().fit(X_train, y_train)
    y_score = lr.decision_function(X_test)    
    precision, recall, _ = precision_recall_curve(y_test, y_score)
    fpr, tpr, _ = roc_curve(y_test, y_score)
 
    #Looking at the precision recall curve, what is the recall when the precision is 0.75?
    df_ACU = pd.DataFrame()
    df_ACU['precision'] = precision
    df_ACU['recall'] = recall
    search_recall = df_ACU.precision[df_ACU.precision == 0.75]
    searchedQ1 = df_ACU.iloc[search_recall.index]
    searchedQ1 = float(searchedQ1['recall'])
    
    #Looking at the roc curve, what is the true positive rate when the false positive rate is 0.16?
    df_ROC = pd.DataFrame()
    df_ROC['true_positive_rate'] = tpr
    df_ROC['false_positive_rate'] = fpr
    search_tpr = df_ROC.false_positive_rate[df_ROC.false_positive_rate == 0.16]
    if len(search_tpr) == 0:
        searchedQ2 = tpr[np.argmin(abs(fpr - 0.16))]
        
        #Does not exist df_ROC.false_positive_rate == 0.16
        #To answer the original question, you can either interpolate or choose the closest value.
        #abs Esta função obtém o valor absoluto do seu argumento, o módulo.
    else:
        searchedQ2 = df_ROC.iloc[search_tpr.index]

    
    return (searchedQ1,searchedQ2)

answer_five()



###ROC CURVES####


### Precision recall curve ####
# Accuracy is a ratio between the number of true positives divided by the sum
# true positives and false positives. He describes
# as a model is good at predicting the positive class.
# Accuracy is referred to as the positive predictive value

# RECALL is calculated as the ratio of the number of true positives divided
# the sum of true positives and false negatives. Remember is the same as sensitivity.


# OBBSERVAR PRECISION and RECALL is useful in cases where there is an imbalance in the observations between
# two classes. Specifically,
# there are many examples of no event (class 0) and just a few examples of an event (class 1).

# A recovery precision curve is a graph of precision (y-axis) and recovery (x-axis) for different limits

(0.825, 0.9375)



Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*

In [26]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import cross_val_score
 
    grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
   
   
    #train de model with many parameters for "C" and penalty='l1'

    lr = LogisticRegression()
    # Usamos o GridSearchCV para encontrar o valor da gama que otimiza uma determinada métrica de avaliação
    grid_lr_recall = GridSearchCV(lr, param_grid = grid_values, cv=3, scoring = 'recall')
    grid_lr_recall.fit(X_train, y_train)
    y_decision_fn_scores_recall = grid_lr_recall.decision_function(X_test)
    
    
   
        
       
    ##The precision, recall, and accuracy scores for every combination 
    #of the parameters in param_grid are stored in cv_results_
    CVresults = []
    CVresults = pd.DataFrame(grid_lr_recall.cv_results_)
    
    #test scores and mean of them
    split_test_scores = np.vstack((CVresults['split0_test_score'], CVresults['split1_test_score'], CVresults['split2_test_score']))
    mean_scores = split_test_scores.mean(axis=0).reshape(5, 2)
    
    
        
    # The L1 standard (Mean absolute error - Ridge) which is calculated as the sum of the absolute values ​​of the target and predicted vector.
    # L1 is most used in time series
    # The L1 standard of a vector can be calculated in NumPy using the norm () function with a parameter to specify the order of the standard,
    # in this case 1 l1 = norm (a, 1), where a is the array
    
    
    # The L2 standard (mean square error - Lasso) which is calculated as the square root of the sum of the vector values ​​to the target and predicted square.
    # The L2 norm of a vector can be calculated in NumPy using the norm () function
    # with standard parameters. l2 = norm (a), where a is the array
    
    
    return mean_scores
answer_six()

array([[ 0.66666667,  0.76086957],
       [ 0.80072464,  0.80434783],
       [ 0.8115942 ,  0.8115942 ],
       [ 0.80797101,  0.8115942 ],
       [ 0.80797101,  0.80797101]])