DSML investigation:

You are part of the Suisse Impossible Mission Force, or SIMF for short. You need to uncover a rogue agent that is trying to steal sensitive information.

Your mission, should you choose to accept it, is to find that agent before stealing any classified information. Good luck!

# Assignement part four
#### Identifying the suspects' credit score
We received informations that the rogue agent has a good credit score.

Our spies at SIMF have managed to collect financial information relating to our suspects as well as a training dataset.

Create a Neural Network over the training dataset `df` to identify which of the suspects have a good Credit_Mix


## Getting to know our data

* Age: a users age
* Occupation: a users employment field
* Annual_Income: a users annual income
* Monthly_Inh_Salary: the calculated salary received by a given user on a monthly basis
* Num_Bank_Accounts: the number of bank accounts possessed by a given user
* Num_Credit_Cards: the number of credit card given user possesses
* Interest_Rate: The interest rate on those cards (if multiple then its the average)
* Num_of_Loans: The number of loans of each user
* Delay_from_due_date: payment tardiness of user
* Num_of_Delayed_Payment: the count of delayed payments
* Changed_Credit_Limit: changes made to the credit limit for each user's account
* Num_Credit_Inquiries: number of credit inquiries
* Credit_Mix: The users credit score
* Outsting_Debt: Outstanding debt
* Credit_Utilization_Ratio: the percentage of borrowed money over borrowing allowance
* Payment_of_Min_Amount: does the user usually pay the minimal amount (categorical)
* Total_EMI_per_month: Monthly repayments to be made
* Amount_invested_monthly: The amout put in an investment fund by the user on a monthly basis
* Payment_Behaviour: the users payment behavior (categorical)
* Monthly_Balance: The users end of the month balance
* AutoLoan: If the user has an active loan for their vehicule
* Credit-BuilderLoan: If the user has a loan to increase their credit score
* DebtConsolidationLoan, HomeEquityLoan, MortgageLoan, NotSpecified, PaydayLoan, PersonalLoan, StudentLoan: different types of loans(categorical features)



In [577]:
# Import required packages
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

%matplotlib inline

In [578]:
df = pd.read_csv("https://raw.githubusercontent.com/michalis0/MGT-502-Data-Science-and-Machine-Learning/main/data/train_classification.csv", index_col='Unnamed: 0').dropna()
suspects = pd.read_csv("https://raw.githubusercontent.com/michalis0/MGT-502-Data-Science-and-Machine-Learning/main/data/suspects.csv", index_col='Unnamed: 0').dropna()

In [579]:
df.head()

Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inh_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,...,Monthly_Balance,AutoLoan,Credit-BuilderLoan,DebtConsolidationLoan,HomeEquityLoan,MortgageLoan,NotSpecified,PaydayLoan,PersonalLoan,StudentLoan
0,23,Scientist,19114.12,1824.843333,3,4,3,4,3,7,...,186.266702,1,1,0,1,0,0,0,1,0
1,24,Scientist,19114.12,1824.843333,3,4,3,4,3,9,...,361.444004,1,1,0,1,0,0,0,1,0
3,24,Scientist,19114.12,4182.004291,3,4,3,4,4,5,...,343.826873,1,1,0,1,0,0,0,1,0
5,28,Teacher,34847.84,3037.986667,2,4,6,1,3,3,...,303.355083,0,1,0,0,0,0,0,0,0
8,35,Engineer,143162.64,4182.004291,1,5,8,3,8,1942,...,854.226027,2,0,0,0,0,1,0,0,0


In [580]:
df["Credit_Mix"].unique()

array(['Good', 'Standard', 'Bad'], dtype=object)

# 1. Preparing the data
## 1.1 Data cleaning
 Perform One-Hot Encoding over the "Occupation" feature.

 Then, perform Label Encoding over "Payment_of_Min_Amount" and "Payment_Behaviour".

 After performing the one-hot and label encoding, add the encoded features to the data frame and remove the corresponding categorical features.

In [581]:
# One-hot encoding:
df = pd.concat([df, pd.get_dummies(df['Occupation'], prefix='Occupation')], axis=1)
df=df.drop(['Occupation'], axis=1)

#Label encoding
le_min_amount = LabelEncoder()
le_behaviour = LabelEncoder()

# Encoding columns
df['Payment_of_Min_Amount'] = le_min_amount.fit_transform(df['Payment_of_Min_Amount'])
df['Payment_Behaviour'] = le_behaviour.fit_transform(df['Payment_Behaviour'])

In [582]:
df.columns

Index(['Age', 'Annual_Income', 'Monthly_Inh_Salary', 'Num_Bank_Accounts',
       'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
       'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
       'Num_Credit_Inquiries', 'Credit_Mix', 'Outsting_Debt',
       'Credit_Utilization_Ratio', 'Payment_of_Min_Amount',
       'Total_EMI_per_month', 'Amount_invested_monthly', 'Payment_Behaviour',
       'Monthly_Balance', 'AutoLoan', 'Credit-BuilderLoan',
       'DebtConsolidationLoan', 'HomeEquityLoan', 'MortgageLoan',
       'NotSpecified', 'PaydayLoan', 'PersonalLoan', 'StudentLoan',
       'Occupation_Accountant', 'Occupation_Architect', 'Occupation_Developer',
       'Occupation_Doctor', 'Occupation_Engineer', 'Occupation_Entrepreneur',
       'Occupation_Journalist', 'Occupation_Lawyer', 'Occupation_Manager',
       'Occupation_Mechanic', 'Occupation_MediaManager', 'Occupation_Musician',
       'Occupation_Scientist', 'Occupation_Teacher', 'Occupation_Writer'],
      

In [583]:
suspects.columns

Index(['Age', 'Annual_Income', 'Monthly_Inh_Salary', 'Num_Bank_Accounts',
       'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
       'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
       'Num_Credit_Inquiries', 'Outsting_Debt', 'Credit_Utilization_Ratio',
       'Payment_of_Min_Amount', 'Total_EMI_per_month',
       'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance',
       'AutoLoan', 'Credit-BuilderLoan', 'DebtConsolidationLoan',
       'HomeEquityLoan', 'MortgageLoan', 'NotSpecified', 'PaydayLoan',
       'PersonalLoan', 'StudentLoan', 'Occupation_Accountant',
       'Occupation_Architect', 'Occupation_Developer', 'Occupation_Doctor',
       'Occupation_Engineer', 'Occupation_Entrepreneur',
       'Occupation_Journalist', 'Occupation_Lawyer', 'Occupation_Manager',
       'Occupation_Mechanic', 'Occupation_MediaManager', 'Occupation_Musician',
       'Occupation_Scientist', 'Occupation_Teacher', 'Occupation_Writer',
       'userID'],
   

## 1.2 Dataset splitting and rescaling

a) Split the dataset in two, first X with your independent features and then y with the dependent feature **CreditMix**.

b) Split X and y into training and test sets. The training set should contain 80% of the observations, and the test set should contain the remaining 20%. Set random state equal to 42.

c) Then perform :
* Label Encoding over the **CreditMix** feature.
* A MinMaxScaller over all the independent features.

In [584]:
# Split the dataset
X=df[['Age', 'Annual_Income', 'Monthly_Inh_Salary', 'Num_Bank_Accounts',
       'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
       'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
       'Num_Credit_Inquiries', 'Outsting_Debt',
       'Credit_Utilization_Ratio', 'Payment_of_Min_Amount',
       'Total_EMI_per_month', 'Amount_invested_monthly', 'Payment_Behaviour',
       'Monthly_Balance', 'AutoLoan', 'Credit-BuilderLoan',
       'DebtConsolidationLoan', 'HomeEquityLoan', 'MortgageLoan',
       'NotSpecified', 'PaydayLoan', 'PersonalLoan', 'StudentLoan',
       'Occupation_Accountant', 'Occupation_Architect', 'Occupation_Developer',
       'Occupation_Doctor', 'Occupation_Engineer', 'Occupation_Entrepreneur',
       'Occupation_Journalist', 'Occupation_Lawyer', 'Occupation_Manager',
       'Occupation_Mechanic', 'Occupation_MediaManager', 'Occupation_Musician',
       'Occupation_Scientist', 'Occupation_Teacher', 'Occupation_Writer']]
y = df['Credit_Mix'].copy()

#Label encoding over creditmix
le_creditmix = LabelEncoder()
y = le_creditmix.fit_transform(y)
y = pd.DataFrame(y, columns=['Credit_Mix'])

# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42)

#MinMaxscaler
scaler=MinMaxScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.fit_transform(X_test)
X_scaled=scaler.fit_transform(X)
suspects_scaled=scaler.fit_transform(suspects)
y_scaled=scaler.fit_transform(y)



### 1.2.2 Final touches
Convert your datasets to `Torch tensors` of type `torch.float` for X and `torch.long` for y.

In [585]:
#Your code here:
X_train_scaled_tensor = torch.tensor(X_train_scaled, dtype=torch.float)
X_test_scaled_tensor = torch.tensor(X_test_scaled, dtype=torch.float)
X_scaled_tensor = torch.tensor(X_scaled, dtype=torch.float)
#X_tensor = torch.tensor(X.values, dtype=torch.float)
y_tensor = torch.tensor(y.values, dtype=torch.float)
y_scaled_tensor = torch.tensor(y_scaled, dtype=torch.float)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float)
suspects_scaled_tensor = torch.tensor(suspects_scaled, dtype=torch.float)


In [586]:
suspects2 = suspects.drop(columns=['userID'])
suspects2_scaled=scaler.fit_transform(suspects2)
suspects2_scaled_tensor = torch.tensor(suspects2_scaled, dtype=torch.float)



In [587]:
print(X_train_scaled_tensor.size(), y_train_tensor.size())

torch.Size([23378, 42]) torch.Size([23378, 1])


# 2 Model preparation:

## 2.1 Define a Neural network model and instantiate it.
Set the following parameters:
* `hidden layer` : 1 with 150 nurons;
* `activation function` : ReLU
* `criterion` : [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)

In [588]:
# Define the neural network class
class Net(nn.Module):
    def __init__(self, D_in, H1, D_out):
        super(Net, self).__init__()
        self.linear1 = nn.Linear(D_in, H1)  # Linear transformation for the hidden layer
        self.linear2 = nn.Linear(H1, D_out)  # Linear transformation for the output layer
        self.activation = nn.ReLU()  # Activation function for the hidden layer

    def forward(self, x):
        y_pred = self.activation(self.linear1(x))   # Hidden layer: linear transformation + ReLU
        y_pred = self.linear2(y_pred)               # Output layer: linear transformation
        return y_pred

D_in = X_train_scaled_tensor.shape[1]  # Input dimension (number of features)
H1 = 150  # Number of neurons in the hidden layer
D_out = len(torch.unique(y_train_tensor))  # Output dimension (number of classes)

model = Net(D_in, H1, D_out)

# Initialize the CrossEntropyLoss criterion
criterion = nn.CrossEntropyLoss()

model

Net(
  (linear1): Linear(in_features=42, out_features=150, bias=True)
  (linear2): Linear(in_features=150, out_features=3, bias=True)
  (activation): ReLU()
)

## 2.2 Finding the best model:
Identify, amongst the following options the best parameters for your model:

* `criterion` : [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
* `iterations` : 150, 250, 500, 1000
* `learning rate` : 0.00005, 0.001, 1.049, 12.031

Set `random seed` as torch.manual_seed(42).


_Hint: restart your runtime between each execution to ensure that previous neural networks dont interfere with your current one_

_You can evaluate your model based on it's accuracy over the test set_

In [589]:
if y_train_tensor.dim() > 1:
    y_train_tensor = torch.argmax(y_train_tensor, dim=1)

torch.manual_seed(42)

# Define your Net class here as before

# Initialization parameters
iterations_options = [150, 250, 500, 1000]
learning_rate_options = [0.00005]
best_accuracy = 0
best_params = {}

for iterations in iterations_options:
    for lr in learning_rate_options:
        # Reinitialize your model and optimizer
        model = Net(D_in, 150, D_out)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

        for _ in range(iterations):
            optimizer.zero_grad()
            outputs = model(X_train_scaled_tensor)  # Use your entire training dataset
            loss = criterion(outputs, y_train_tensor)
            loss.backward()
            optimizer.step()

        # Calculate accuracy
        with torch.no_grad():
            outputs = model(X_test_scaled_tensor)
            _, predicted_indices = torch.max(outputs, 1)
            prediction = predicted_indices.cpu().numpy()
            correct = y_test_tensor.cpu().numpy()
            accuracy = accuracy_score(correct, prediction)

        print(f'Iterations: {iterations}, Learning Rate: {lr}, Accuracy: {accuracy:.4f}')

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'iterations': iterations, 'learning rate': lr}

print(f'Best Parameters: {best_params}, Best Accuracy: {best_accuracy:.2f}%')

Iterations: 150, Learning Rate: 5e-05, Accuracy: 0.2306
Iterations: 250, Learning Rate: 5e-05, Accuracy: 0.3610
Iterations: 500, Learning Rate: 5e-05, Accuracy: 0.2322
Iterations: 1000, Learning Rate: 5e-05, Accuracy: 0.2296
Best Parameters: {'iterations': 250, 'learning rate': 5e-05}, Best Accuracy: 0.36%


In [590]:
if y_train_tensor.dim() > 1:
    y_train_tensor = torch.argmax(y_train_tensor, dim=1)

torch.manual_seed(42)

# Initialization parameters
iterations_options = [150, 250, 500, 1000]
learning_rate_options = [0.001]
best_accuracy = 0
best_params = {}

for iterations in iterations_options:
    for lr in learning_rate_options:
        # Reinitialize your model and optimizer
        model = Net(D_in, 150, D_out)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

        for _ in range(iterations):
            optimizer.zero_grad()
            outputs = model(X_train_scaled_tensor)  # Use your entire training dataset
            loss = criterion(outputs, y_train_tensor)
            loss.backward()
            optimizer.step()

        # Calculate accuracy
        with torch.no_grad():
            outputs = model(X_test_scaled_tensor)
            _, predicted_indices = torch.max(outputs, 1)
            prediction = predicted_indices.cpu().numpy()
            correct = y_test_tensor.cpu().numpy()
            accuracy = accuracy_score(correct, prediction)

        print(f'Iterations: {iterations}, Learning Rate: {lr}, Accuracy: {accuracy:.4f}')

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'iterations': iterations, 'learning rate': lr}

print(f'Best Parameters: {best_params}, Best Accuracy: {best_accuracy:.2f}%')

Iterations: 150, Learning Rate: 0.001, Accuracy: 0.2310
Iterations: 250, Learning Rate: 0.001, Accuracy: 0.2310
Iterations: 500, Learning Rate: 0.001, Accuracy: 0.2310
Iterations: 1000, Learning Rate: 0.001, Accuracy: 0.2310
Best Parameters: {'iterations': 150, 'learning rate': 0.001}, Best Accuracy: 0.23%


In [591]:
if y_train_tensor.dim() > 1:
    y_train_tensor = torch.argmax(y_train_tensor, dim=1)

torch.manual_seed(42)


# Initialization parameters
iterations_options = [150, 250, 500, 1000]
learning_rate_options = [1.049]
best_accuracy = 0
best_params = {}

for iterations in iterations_options:
    for lr in learning_rate_options:
        # Reinitialize your model and optimizer
        model = Net(D_in, 150, D_out)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

        for _ in range(iterations):
            optimizer.zero_grad()
            outputs = model(X_train_scaled_tensor)  # Use your entire training dataset
            loss = criterion(outputs, y_train_tensor)
            loss.backward()
            optimizer.step()

        # Calculate accuracy
        with torch.no_grad():
            outputs = model(X_test_scaled_tensor)
            _, predicted_indices = torch.max(outputs, 1)
            prediction = predicted_indices.cpu().numpy()
            correct = y_test_tensor.cpu().numpy()
            accuracy = accuracy_score(correct, prediction)

        print(f'Iterations: {iterations}, Learning Rate: {lr}, Accuracy: {accuracy:.4f}')

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'iterations': iterations, 'learning rate': lr}

print(f'Best Parameters: {best_params}, Best Accuracy: {best_accuracy:.2f}%')

Iterations: 150, Learning Rate: 1.049, Accuracy: 0.2310
Iterations: 250, Learning Rate: 1.049, Accuracy: 0.2310
Iterations: 500, Learning Rate: 1.049, Accuracy: 0.2310
Iterations: 1000, Learning Rate: 1.049, Accuracy: 0.2310
Best Parameters: {'iterations': 150, 'learning rate': 1.049}, Best Accuracy: 0.23%


In [592]:
if y_train_tensor.dim() > 1:
    y_train_tensor = torch.argmax(y_train_tensor, dim=1)

torch.manual_seed(42)


# Initialization parameters
iterations_options = [150, 250, 500, 1000]
learning_rate_options = [12.031]
best_accuracy = 0
best_params = {}

for iterations in iterations_options:
    for lr in learning_rate_options:
        # Reinitialize your model and optimizer
        model = Net(D_in, 150, D_out)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

        for _ in range(iterations):
            optimizer.zero_grad()
            outputs = model(X_train_scaled_tensor)  # Use your entire training dataset
            loss = criterion(outputs, y_train_tensor)
            loss.backward()
            optimizer.step()

        # Calculate accuracy
        with torch.no_grad():
            outputs = model(X_test_scaled_tensor)
            _, predicted_indices = torch.max(outputs, 1)
            prediction = predicted_indices.cpu().numpy()
            correct = y_test_tensor.cpu().numpy()
            accuracy = accuracy_score(correct, prediction)

        print(f'Iterations: {iterations}, Learning Rate: {lr}, Accuracy: {accuracy:.4f}')

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'iterations': iterations, 'learning rate': lr}

print(f'Best Parameters: {best_params}, Best Accuracy: {best_accuracy:.2f}%')

Iterations: 150, Learning Rate: 12.031, Accuracy: 0.2310
Iterations: 250, Learning Rate: 12.031, Accuracy: 0.2310
Iterations: 500, Learning Rate: 12.031, Accuracy: 0.2310
Iterations: 1000, Learning Rate: 12.031, Accuracy: 0.2310
Best Parameters: {'iterations': 150, 'learning rate': 12.031}, Best Accuracy: 0.23%


In [593]:
learning_rate_options = [0.00005, 0.001, 1.049, 12.031]
iterations_list= 150
best_accuracy = 0
best_params = {}



for iterations in range(iterations_list):
    for lr in learning_rate_options:
        # Reinitialize your model and optimizer
        model = Net(D_in, 150, D_out)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

    # Forward pass: compute prediction on training set
    y_pred = model(X_train_scaled_tensor)

    # Compute accuracy
    with torch.no_grad():
            outputs = model(X_test_scaled_tensor)
            _, predicted_indices = torch.max(outputs, 1)
            prediction = predicted_indices.cpu().numpy()
            correct = y_test_tensor.cpu().numpy()
            accuracy = accuracy_score(correct, prediction)

    if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'iterations': iterations, 'learning rate': lr}
    #print(f'Iterations: {iterations}, Learning Rate: {lr}, Accuracy: {accuracy:.4f}')


print(f'Best Parameters: {best_params}, Best Accuracy: {best_accuracy:.2f}%')



Best Parameters: {'iterations': 140, 'learning rate': 12.031}, Best Accuracy: 0.52%


*Question 1:*

**Could we use BCELoss instead of CrossEntropyLoss?**

# 3. Predict over the suspects dataset

Now it's time to use the model to make predictions over the suspect dataset!

Use the following parameters:


* `hidden layer` : 1 with 150 neurons
* `output layer` : 3 neurons
* `optimizer` : [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
* `criterion` : [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
* `iterations` : 1000
* `learning rate` : 1.049

Set `random seed` as np.random.seed(42)



*Question 2:*

**Why does our model has 3 neurons in the output layer?**

In [594]:
np.random.seed(42)

y_tensor = y_tensor.squeeze().long()

In [595]:
# Define the neural network class
class NeuralNetwork(nn.Module):
    def __init__(self, D_in, H1, D_out):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(D_in, H1)  # Linear transformation for the hidden layer
        self.linear2 = nn.Linear(H1, D_out)  # Linear transformation for the output layer
        self.activation = nn.ReLU()  # Activation function for the hidden layer

    def forward(self, x):
        x = self.activation(self.linear1(x))  # Hidden layer: linear transformation + ReLU
        x = self.linear2(x)  # Output layer: linear transformation
        return x

# Parameters for the model
D_in = 42
H1 = 150
D_out = 3

# Instantiate the model
model = NeuralNetwork(D_in, H1, D_out)

# Define the optimizer and the loss function
optimizer = optim.SGD(model.parameters(), lr=1.049)
criterion = nn.CrossEntropyLoss()

np.random.seed(42)
torch.manual_seed(42)

# Training loop
for epoch in range(1000):  # 1000 iterations
    # Zero the parameter gradients
    optimizer.zero_grad()

    # Forward pass
    outputs = model(X_scaled_tensor)
    loss = criterion(outputs, y_tensor)

    # Backward and optimize
    loss.backward()
    optimizer.step()

print('Finished Training')


Finished Training


In [596]:
# After training, set the model to evaluation mode for prediction
model.eval()
with torch.no_grad():
    predictions = model(suspects2_scaled_tensor)
 #_, class_labels = torch.max(y_pred, dim=1)

#class_labels
probabilities = torch.softmax(predictions, dim=1)
predicted_classes = torch.argmax(probabilities, dim=1)
predicted_classes

tensor([2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
        2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 1, 1,
        1, 2, 2, 2, 0, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 0, 0, 2, 2, 0, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 0, 0, 2, 1, 2, 1, 2, 2, 2, 0, 0, 0, 2, 2,
        2, 2, 2, 2, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 2, 2, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 2, 2, 2, 2, 2, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0,
        0, 0, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 2, 1, 2, 2, 2,
        2, 0, 0, 0, 2, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2,
        2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
        0, 0, 2, 2, 2, 2, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0,
        0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2,

In [597]:
suspects2_scaled_tensor.shape

torch.Size([714, 42])

In [598]:
suspects['Predicted_Credit_Mix'] = predicted_classes
suspects['Predicted_Credit_Mix']

0       2
1       2
3       2
5       2
8       1
       ..
1231    0
1233    2
1235    2
1236    0
1237    0
Name: Predicted_Credit_Mix, Length: 714, dtype: int64

Display the table with two columns: 'userID' and the corresponding predicted Credit_Mix.

In [599]:
final_sus=suspects[['userID', 'Predicted_Credit_Mix']]
final_sus

Unnamed: 0,userID,Predicted_Credit_Mix
0,317991,2
1,241892,2
3,303376,2
5,761992,2
8,373318,1
...,...,...
1231,458293,0
1233,218415,2
1235,173906,2
1236,178685,0


In [600]:
value_counts_percent = y['Credit_Mix'].value_counts(normalize=True) * 100

# Print the percentages
print('Training dataset:', value_counts_percent)

value_counts_percent2 = final_sus['Predicted_Credit_Mix'].value_counts(normalize=True) * 100

# Print the percentages
print('Predicted_Credit_Mix:', value_counts_percent2)

Training dataset: Credit_Mix
2    45.926154
1    30.671047
0    23.402799
Name: proportion, dtype: float64
Predicted_Credit_Mix: Predicted_Credit_Mix
2    57.142857
0    34.313725
1     8.543417
Name: proportion, dtype: float64


As mentioned in the beginning, we have reasons to believe that the suspect had a very good credit score. But we must make no errors, because a lot is at stake. We must be consident in our predictions.

Therefore, we need to analyze not just the predicted category but also how certain the model is about each prediction. Display the probabilities of observations in the 'suspects' dataset falling within the given classes.

_Hint: you can display the probabilities simply as a dataframe, but for better overview you can use visualization tools_

In [601]:
# Your code here
final_sus_filtered = final_sus[final_sus['Predicted_Credit_Mix'] == 2]
final_sus_filtered

Unnamed: 0,userID,Predicted_Credit_Mix
0,317991,2
1,241892,2
3,303376,2
5,761992,2
10,676003,2
...,...,...
1217,866017,2
1218,887305,2
1219,459020,2
1233,218415,2


*Question 3:*

**Which of the following suspects have a good credit mix according to your model's predictions?**


*   200865
*   761992
*   858566
*   862880
*   526987


In [602]:
final_sus_filtered['userID'].isin([200865]).any()

False

In [603]:
final_sus_filtered['userID'].isin([761992]).any()

True

In [604]:
final_sus_filtered['userID'].isin([858566]).any()

False

In [605]:
final_sus_filtered['userID'].isin([862880]).any()

False

In [606]:
final_sus_filtered['userID'].isin([526987]).any()

True