# Predicting Fraudulent Transactions

You’ll try logistic regression on the real-world problem of fraud detection and find out how it fares.

## Instructions:

The instructions for this activity are divided into the following stages:

1. Prepare the data

2. Split the data into training and testing sets

3. Model and fit the data into a logistic regression

4. Predict the testing labels 

5. Calculate the performance metrics

#### Prepare the Data

1. Load the `transaction_fraud_data.csv` file from the `Resources` folder into a Pandas DataFrame. Set the “id” column as the index.

2. Note that you want to predict the `fraud` variable. Answer the following question: Using `value_counts`, how many fraudulent transactions exist in this dataset?

#### Split the Data into Training and Testing Sets

1. Using the `transaction_fraud_data` DataFrame, separate the data into training and testing data. Start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

2. Split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

#### Model and Fit the Data to a Logistic Regression

1. Declare a `LogisticRegression` model.

2. Fit the training data to the model, and save the model.

#### Predict the Testing Labels

1. Make predictions about fraud by using the testing dataset, and save those predictions.

#### Calculate the Performance Metrics

1. Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`. 
2. Answer the following question: For this dataset, how well did the model predict the actually fraudulent transactions?


## Resources:

Following are links to modules from the scikit learn library that will be utilized:

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

[classifiction report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)


In [23]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression


# Prepare the Data

### Step 1: Load the `transaction_fraud_data.csv` file from the `Resources` folder into a Pandas DataFrame. Set the “id” column as the index.

In [24]:
# Read in the transaction_fraud_data.csv file into a PandasDataFrame.
transaction_fraud_data = Path('../Resources/transaction_fraud_data.csv')

fraud_data = pd.read_csv(
    transaction_fraud_data, 
    index_col='id',     
    )

fraud_data.head()


Unnamed: 0_level_0,Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7,Z_8,Z_9,fraud
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,-2.346302,-1.026583,-10.363716,-8.05426,7.519907,1.860217,9.056866,0.392113,-12.937505,-0.801264,0
1,-3.296776,-8.4877,9.175655,1.097409,-1.766353,-2.293392,-2.247549,-0.041269,8.216953,8.883102,0
2,12.83961,4.475612,-5.213528,-5.72266,-4.07339,-5.661766,5.967037,-9.826743,-17.443248,5.26647,0
3,13.237325,13.605183,-5.958039,4.392244,4.763587,3.781628,-2.722725,-5.814775,11.236515,2.582494,0
4,4.161311,2.520646,7.17165,1.301273,-5.40819,4.651314,9.639546,4.648132,3.928619,2.358164,0


### Step 2: Answer the following question:

Note that you want to predict the `fraud` variable. Answer the following question: Using `value_counts`, how many fraudulent transactions exist in this dataset?

In [25]:
# The  column 'fraud' is the thing you want to predict. 
# Class 0 indicates no-fraud trasactions and class 1 indicates fraudulent transactions
# Using value_counts, how many fraudulent transactions are in this dataset?
fraud_data['fraud'].value_counts()


0    993
1      7
Name: fraud, dtype: int64

# Split the data into training and testing sets

### Step 1: Using the `transaction_fraud_data` DataFrame, separate the data into training and testing data. Start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

In [26]:
# The target column should be the binary `fraud` column.
target = fraud_data['fraud']


# The features column should be all of the features. 
features = fraud_data[['Z_1', 'Z_2', 'Z_3', 'Z_4', 'Z_5', 'Z_6', 'Z_7', 'Z_8', 'Z_9']]


### Step 2: Split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

In [27]:
# Split the dataset using the train_test_split function
training_features, testing_features, training_targets, testing_targets = train_test_split(
                                                    features, 
                                                    target, 
                                                    random_state=1, 
                                                    stratify=target)


# Model and Fit the Data to a Logistic Regression

### Step 1: Declare a `LogisticRegression` model.

In [28]:
# Declare a logistic regression model.
# Apply a random_state of 7 to the model
logistic_regression_model = LogisticRegression(solver='lbfgs', random_state=7)
logistic_regression_model


LogisticRegression(random_state=7)

### Step 2: Fit the training data to the model, and save the model.

In [29]:
# Fit and save the logistic regression model using the training data
lr_model = logistic_regression_model.fit(training_features, training_targets)
lr_model

LogisticRegression(random_state=7)

# Predict the Testing Labels

### Step 1: Make predictions about fraud by using the testing dataset, and save those predictions.

In [30]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predictions = logistic_regression_model.score(features, target)
training_predictions = logistic_regression_model.score(features, target)

# Review the predictions
print(f"Testing Data Score: {testing_predictions}")
print(f"Training Data Score: {training_predictions}")

Testing Data Score: 0.996
Training Data Score: 0.996


# Calculate the Performance Metrics

### Step 1: Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`.

In [31]:
# Display the accuracy score for the test dataset.
accuracy_score(testing_predictions, testing_targets)


TypeError: Singleton array 0.996 cannot be considered a valid collection.

### Step 2: Answer the following question

**Question:**: For this dataset, how well did the model predict the actual fraudulent transactions?

**Answer:**: # YOUR ANSWER HERE 