### Introduction

You are a very inexperienced adventurer, just starting off on your first of what is sure to
be many, many adventures! Soon after excitedly leaving your first tavern, you find the flash
drive of a past adventurer who met an unfortunate demise.
On it, there is a calculator (privoro_monster_calc.py) that appears to calculate the
probability of you winning a fight against a monster, given the monster’s information and
the dead adventurer. You also find what appears to be a datasheet containing past examples
(monsters_list.csv) on which the calculator was trained.
While you are happy to have come across this data, given how you found this calculator,
you decide to give the calculator code a once over before using it for yourself, suspecting
that it may have materially contributed to the...deadness of the adventurer you’ve stumbled
across.

### Problem

Your goal is to fix the calculator in any way you see fit. You may change any code and you
are permitted to use numpy, pandas, or scipy if you wish. When changing code, you should
try to make the code more pythonic and concise where possible, yet also maintain readability.


Please do not use libraries like PyTorch, scikit-learn, or tensorflow here, as they are not
needed. Note that there is no single correct answer. The goal is to make the code clean and
readable.
Estimated time to complete this problem: 1-2 hours

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm

In [2]:
def create_dist(data):
#     1. Commented unnecessary creation of two variables
#     mu = np.mean(data)
#     std = np.std(data)
    return norm(np.mean(data), np.std(data))

In [3]:
def prob_of_class(prior, dists, data):
    prob = prior
    for i in range(len(dists)):
        prob *= dists[i].pdf(data[i])

    return prob

In [4]:
def predict(p_w, dists_win, data):
    p_winning = prob_of_class(p_w, dists_win, data)
    p_losing = 1 - p_winning

    answer = 0
    if p_winning > p_losing:
        answer = 1
    
    return answer

In [5]:
def calc_acc(p_w, dists_win, win, lost):
    correct = 0
    total = len(win) + len(lost)
    for d in win:
        guess = predict(p_w, dists_win, d)
        if guess == 1:
            correct += 1
    
    for d in lost:
        guess = predict(p_w, dists_win, d)
        if guess == 0:
            correct += 1

    print(f'Acc: {correct/total}')

In [6]:
#importing the data
monster_df = pd.read_csv("monster_list.csv")

# calculate the prior
# 2. Corrected the column name
winning = monster_df['monster_defeated'].to_numpy()

# 3. Simplified the process of obtaining the winning probability 
# p_winning = 0
# for i in winning:
#     if i == 1:
#         p_winning += 1

# p_winning = p_winning / len(winning)
p_winning = winning.mean()

#split the data
data = monster_df.drop(columns=['monster_name', 'monster_defeated']).to_numpy()
train_win = data[winning == 1]

# 4. Replaced assignment operator with relational operator 
train_lost = data[winning == 0]

# 5. Combined identical blocks of code in a loop for iterating for each feature 
for j in range(train_win.shape[1]):
    col_data = []
    
    for i in range(len(train_win)):
        col_data.append(train_win[i][j])
    exec(f'win_p{j+1} = create_dist(col_data)')


win_pdfs = [win_p1, win_p2, win_p3, win_p4, win_p5]

In [7]:
test_monsters = pd.read_csv('monster_list_test.csv')

# 6. Replaced to_list() with to_numpy() as to_list() is for pd.Series
test_data = test_monsters.drop(columns=['monster_name', 'monster_defeated']).to_numpy()
test_winning = test_monsters['monster_defeated'].to_numpy()

test_win = test_data[test_winning == 1]
test_lost = test_data[test_winning == 0]

#test it out
calc_acc(p_w = p_winning, dists_win = win_pdfs, win = test_win, lost = test_lost)

Acc: 0.42857142857142855


The fixes done by me have been indicated through comments in the above code. Apart from the aforementioned changes, I have also improved the readability by adding spacing between operators

## Bonus

### Fixing Calculator Logic

Note: Please do not complete this section until the previous part is complete; this is a bonus
and not required.
1
The logic behind the calculator is broken. There is a logical error in the way certain
probabilities are calculated and after fixing it, the test set (monster_list_test.csv) accuracy should be 100%. Note that you do not need to compute priors for each type of monster,
just the prior for winning overall. Fixing the logical error is sufficient to get 100% accuracy
on the test set.

• Fix the logic in the calculator.

• Please write a (very) short description of what you changed and why. 2-3 sentences is
sufficient.

In [8]:
import numpy as np
import pandas as pd
from scipy.stats import norm

In [9]:
def create_dist(data):
    return norm(np.mean(data), np.std(data))

In [10]:
def prob_of_class(prior, dists, data):
    prob = prior
    for i in range(len(dists)):
        prob *= dists[i].pdf(data[i])

    return prob

In [11]:
def predict(p_w, dists_win, dists_lost, data):
    p_winning = prob_of_class(p_w, dists_win, data)
    
    # 1. Modified the way of calculating the probability of losing
    # For more details, please refer to the desciption below
    p_losing = prob_of_class(1 - p_w, dists_lost, data)

    answer = 0
    if p_winning > p_losing:
        answer = 1
    
    return answer

In [12]:
def calc_acc(p_w, dists_win, dists_lost, win, lost):
    correct = 0
    total = len(win) + len(lost)
    for d in win:
        guess = predict(p_w, dists_win, dists_lost, d)
        if guess == 1:
            correct += 1
    
    for d in lost:
        guess = predict(p_w, dists_win, dists_lost, d)
        if guess == 0:
            correct += 1

    print(f'Acc: {correct/total}')

In [13]:
#importing the data
monster_df = pd.read_csv("monster_list.csv")

# calculate the prior
winning = monster_df['monster_defeated'].to_numpy()
p_winning = winning.mean()

#split the data
data = monster_df.drop(columns = ['monster_name', 'monster_defeated']).to_numpy()
train_win = data[winning == 1]
train_lost = data[winning == 0]

for j in range(train_win.shape[1]):
    col_data = []
    
    for i in range(len(train_win)):
        col_data.append(train_win[i][j])
    exec(f'win_p{j+1} = create_dist(col_data)')

# 2. Creating dist for training data where the adventurer lost
for j in range(train_lost.shape[1]):
    col_data = []
    
    for i in range(len(train_lost)):
        col_data.append(train_lost[i][j])
    exec(f'lost_p{j+1} = create_dist(col_data)')


win_pdfs = [win_p1, win_p2, win_p3, win_p4, win_p5]
lost_pdfs = [lost_p1, lost_p2, lost_p3, lost_p4, lost_p5]

In [14]:
test_monsters = pd.read_csv('monster_list_test.csv')
test_data = test_monsters.drop(columns=['monster_name', 'monster_defeated']).to_numpy()
test_winning = test_monsters['monster_defeated'].to_numpy()

test_win = test_data[test_winning == 1]
test_lost = test_data[test_winning == 0]

#test it out
calc_acc(p_w = p_winning, dists_win = win_pdfs, dists_lost = lost_pdfs, win = test_win, lost = test_lost)

Acc: 1.0


The above block contains the fixed calculator logic. The problem with the earlier approach was that we were working with probability density functions (norm.pdf(x)) of each feature instead of considering their actual probabilities. To convert PDFs to probabilities, we might want to integrate the PDFs with limits being the interval or a simple approximation would be to mulitply PDF with the length of the Interval under consideration. Even with the assumption that our PDFs are continous, we may not have a true equality owing to the fact that the PDF values may vary over that interval under consideration. Also, when we compute the PDF of a particular datapoint, we cannot ask for the probability that X is an single integer (`P(X = x) = 0` for any real number x). Instead, we may see that as the probability that x is close to a single number.

Considering all these points, the `p_winning` that was computed earlier does not yield the actual probability of winning. It was computed by multiplying the prior for winning overall and the PDFs of each feature. Which is why `1 - p_winning` will not give `p_losing`.And, since prior probability is less than 0.5, irrespective of what my other probabilities are, the final probability will always be less than 0.5 classifying all the results as a loss. 

So, For obtaining `p_losing`, I've leveraged the data where the adventurer lost to the monster to get the mean and stand deviation of the distribution for each feature and essentially repeating the same steps we followed while computing `p_winning` except for the fact that the prior probabibility here would be of losing overall. After computing both `p_winning` and `p_losing`, I'm comparing their probabibility values to check which one's higher and assigning the appropriate outcome as my predicition.

### Picking a new classifier

Note: Please do not complete this section until the other parts are complete; this is a bonus
and not required.
In this section, suppose you are given the freedom to implement any kind of classifier
you wish. All libraries are now fair game, including PyTorch, scikit-learn, and tensorflow.

• What method would you use to predict your chances of survival against a given monster,
and how would you (at a high level) set it up?

• Why is that method superior to the (attempted) method in the original calculator

#### Response

**What method would you use to predict your chances of survival against a given monster, and how would you (at a high level) set it up?**

If we are going to use Machine Learning models, our objective should be to generalize the pattern while training. Especially for smaller datasets, the chances of overfitting is quite high which can result in poor test accuracy. So, inorder to avoid overfitting, we should go with simple models rather than using complex neural networks(which requires a large training set to perform better). 

I would start with something as simple as a Logistic regression or a tree based model like XGBoost or a combination of both of these models.We might also model this data with Naive Bayes, Linear SVC. 

**Why is that method superior to the (attempted) method in the original calculator?**

The original calculator uses Probabibility Density estimates to approximate the probability of observing a win or loss against a monster. This may not only be less accurate but also can be quite prone to outliers. 

Logistic Regression on the other hand is not affected by this problem. Since, it is a simple model it'll not try to learn the noise or errors present in the data. The regularization feature bolsters this further by penalizing in cases where we have correlated features. 

A Tree based classifier like XGBoost on the other hand can be used with smaller depths to avoid overfitting. Advantages includes flexibiliy, ability to run parallel processes, inbuilt missing data handler

Or we may also employ a combination of both Logistic Regression and XGBoost models(this helps in variance reduction and hence will be able to generalize better).

I'll illustrate the aforementioned approaches below:

#### Logistic Regression

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [16]:
monster_df = pd.read_csv("monster_list.csv")
winning = monster_df['monster_defeated'].to_numpy()
data = monster_df.drop(columns = ['monster_name', 'monster_defeated']).to_numpy()

#Using L2 Regularization along with inverse of regularization strength as parameters
#Setting a small value for 'C' will yield stronger regularization
lr = LogisticRegression(penalty = 'l2', C = 0.1)
lr.fit(data, winning)

LogisticRegression(C=0.1)

In [17]:
test_monsters = pd.read_csv('monster_list_test.csv')
test_data = test_monsters.drop(columns=['monster_name', 'monster_defeated']).to_numpy()
test_winning = test_monsters['monster_defeated'].to_numpy()

x = lr.predict(test_data)

In [18]:
accuracy_score(test_winning, x)

1.0

#### XGBoost

In [19]:
from xgboost import XGBClassifier

  from pandas import MultiIndex, Int64Index


In [20]:
monster_df = pd.read_csv("monster_list.csv")
winning = monster_df['monster_defeated'].to_numpy()
data = monster_df.drop(columns = ['monster_name', 'monster_defeated']).to_numpy()

#Using lesser max_depth to avoid overfitting, making model more conservative with 
#high gamma and eta values along with regularization 
xgb = XGBClassifier(max_depth=2, gamma=2, eta=0.8, reg_alpha=0.5, reg_lambda=0.5)
xgb.fit(data, winning)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eta=0.8, gamma=2, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.800000012,
              max_delta_step=0, max_depth=2, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0.5, reg_lambda=0.5, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [21]:
test_monsters = pd.read_csv('monster_list_test.csv')
test_data = test_monsters.drop(columns=['monster_name', 'monster_defeated']).to_numpy()
test_winning = test_monsters['monster_defeated'].to_numpy()

x = xgb.predict(test_data)

In [22]:
accuracy_score(test_winning, x)

0.7142857142857143

#### Combination of models

In [23]:
monster_df = pd.read_csv("monster_list.csv")
winning = monster_df['monster_defeated'].to_numpy()
data = monster_df.drop(columns = ['monster_name', 'monster_defeated']).to_numpy()

In [24]:
test_monsters = pd.read_csv('monster_list_test.csv')
test_data = test_monsters.drop(columns=['monster_name', 'monster_defeated']).to_numpy()
test_winning = test_monsters['monster_defeated'].to_numpy()

In [25]:
models = [LogisticRegression(penalty = 'l2', C = 0.1),
          XGBClassifier(max_depth=2, gamma=2, eta=0.8, reg_alpha=0.5, reg_lambda=0.5)]

preds = pd.DataFrame()
for i, m in enumerate(models):
    m.fit(data, winning)
    preds[i] = m.predict_proba(test_data)[:,1]

weights = [1, 0.4]
# Calculating the weighted probabilities with weightage '1' for Logistic Regression  
# and'0.4' for XGBoost
preds['weighted_pred'] = (preds * weights).sum(axis=1) / sum(weights)

#Taking all predictions probabilities over 0.5 as label 1
x = preds['weighted_pred'] > 0.5
x = x.values.astype(int)
accuracy_score(test_winning, x)





1.0