# Generating MAP@K Submissions

Unlike many multiclass classification metrics, MAP@K permits the evaluation of multiple predictions for a single row.
It's often used for reccomendation systems, where multiple results are recommended, but also makes sense for our vector
borne disease problem, as narrowing possible diagnoses down to 2 or 3 might allow us to efficiently run additional tests
to arrive at a final diagnosis.

Under MAP@K if a "relevant" result (in our case, the true prognosis) is surfaced anywhere in the top K predictions (in our case, 3)
the score will be > 0. The earlier the true prognosis is surfaced, the higher the score will be.

In this notebook we use a simple LogisticRegression model to predict probabilities for all possible prognoses, and then include the
top 3 most probable prognoses as our predictions. This is just one way to generate predictions for MAP@K, and not all the steps are
stricly necessary (e.g. you _can_ submit more or less than 3 predictions, but we always surface 3 here to make it clear that's the
maximum number that will be considered in scoring).

# Set Up 

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
train_df = pd.read_csv('/kaggle/input/playground-series-s3e13/train.csv')
train_df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lyme_disease
1,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tungiasis
2,2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,Lyme_disease
3,3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zika
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,Rift_Valley_fever


In [3]:
# Encoder our prognosis labels as integers for easier decoding later
enc = OrdinalEncoder()
train_df['prognosis'] = enc.fit_transform(train_df[['prognosis']])
train_df['prognosis'][:10] # Note the values here are now ordinal (floats)

0     3.0
1     7.0
2     3.0
3    10.0
4     6.0
5     3.0
6     8.0
7     7.0
8     4.0
9     0.0
Name: prognosis, dtype: float64

In [4]:
# We split out our own test set so we can calculate an example MAP@K
X_train, X_test, y_train, y_test = train_test_split(train_df.drop('prognosis', axis=1), train_df['prognosis'], random_state=42)

# Training a LogisticRegression model and predicting probabilities

In [5]:
# We'll train a simple LogisticRegression model on our training split
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)

# Normally lr_clf.predict would just predict the most likely prognosis, but with predict_proba we can get probabilities for all possible prognoses.
predictions = lr_clf.predict_proba(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# Generating Top K for a single row

In [6]:
# Lets take a look at getting our top 3 prognoses for just a single prediction first
print("Output of predict_proba:")
print(predictions[0])

Output of predict_proba:
[0.12681672 0.1958262  0.28140433 0.04834938 0.07772    0.02590886
 0.06871412 0.04280739 0.05009817 0.04869181 0.03366302]


In [7]:
# We can get the indices of the highest probabilities with argsort
sorted_prediction_ids = np.argsort(-predictions[0]) # Note argsort sorts in ascending order, but by making all our values negative we'll end up with the highest probabilities first
print("Indices sorted by probabilities:")
print(sorted_prediction_ids)
# 2 is our first id, and the probability at index 2 from predictions[0] is the highest (0.28140433)

Indices sorted by probabilities:
[ 2  1  0  4  6  8  9  3  7 10  5]


In [8]:
# We can grab the top 3 predictions and then use our encoder to turn them back into string labels
top_3_prediction_ids = sorted_prediction_ids[:3]
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1)) # inverse_transform expects a 2D array, so we reshape our vector (but this won't be necessary when we run on multiple predictions at once)
top_3_predictions
# We got a list of 3 predictions! Great!

array([['Japanese_encephalitis'],
       ['Dengue'],
       ['Chikungunya']], dtype=object)

# Generating Top K for all rows

In [9]:
# Now let's look at doing the above for a whole set of predictions at once:
sorted_prediction_ids = np.argsort(-predictions, axis=1)
top_3_prediction_ids = sorted_prediction_ids[:,:3]

# Because enc.inverse_transform expects a specific shape (a 2D array with 1 column) we can save the original shape to reshape to after decoding
original_shape = top_3_prediction_ids.shape
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1))
top_3_predictions = top_3_predictions.reshape(original_shape)
top_3_predictions[:10] # Spot check our first 10 values

array([['Japanese_encephalitis', 'Dengue', 'Chikungunya'],
       ['Dengue', 'West_Nile_fever', 'Rift_Valley_fever'],
       ['Rift_Valley_fever', 'Dengue', 'West_Nile_fever'],
       ['Zika', 'Yellow_Fever', 'Japanese_encephalitis'],
       ['Chikungunya', 'Tungiasis', 'Dengue'],
       ['Tungiasis', 'Dengue', 'Rift_Valley_fever'],
       ['West_Nile_fever', 'Plague', 'Zika'],
       ['Rift_Valley_fever', 'West_Nile_fever', 'Japanese_encephalitis'],
       ['Dengue', 'Chikungunya', 'Rift_Valley_fever'],
       ['Zika', 'Yellow_Fever', 'Plague']], dtype=object)

## Calculating MAP@K on our validation set
So now we have our top K (3) predictions, but what is our MAP@K?
To calculate this we'll use the mapk function from the ml_metrics library. The function is pasted below to avoid having to install the package.

In [10]:
# Sourced from the ml_metrics package at https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [11]:
# Our MAP@K score here is ~0.3456
mapk(y_test.values.reshape(-1, 1), top_3_prediction_ids, k=3)

0.34557438794726925

# Generating Test Predictions


In [12]:
test_df = pd.read_csv('/kaggle/input/playground-series-s3e13/test.csv')
test_df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash
0,707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,708,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,709,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,710,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,711,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# Make predictions
predictions = lr_clf.predict_proba(test_df)

# Get the sorted indices of predictions and take the top 3
sorted_prediction_ids = np.argsort(-predictions, axis=1)
top_3_prediction_ids = sorted_prediction_ids[:,:3]

# Because enc.inverse_transform expects a specific shape (a 2D array with 1 column) we can save the original shape to reshape to after decoding
original_shape = top_3_prediction_ids.shape
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1))
top_3_predictions = top_3_predictions.reshape(original_shape)
top_3_predictions[:10] # Spot check our first 10 values

array([['Tungiasis', 'Rift_Valley_fever', 'West_Nile_fever'],
       ['Chikungunya', 'Dengue', 'Malaria'],
       ['West_Nile_fever', 'Japanese_encephalitis', 'Tungiasis'],
       ['Japanese_encephalitis', 'Tungiasis', 'Rift_Valley_fever'],
       ['West_Nile_fever', 'Japanese_encephalitis', 'Plague'],
       ['Zika', 'Yellow_Fever', 'Plague'],
       ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'],
       ['Tungiasis', 'Dengue', 'West_Nile_fever'],
       ['Yellow_Fever', 'Zika', 'Japanese_encephalitis'],
       ['Yellow_Fever', 'Plague', 'Zika']], dtype=object)

In [14]:
# Now to get our array of labels into a single column for our submission we can just join on on a space across axis 1
test_df['prognosis'] = np.apply_along_axis(lambda x: np.array(' '.join(x), dtype="object"), 1, top_3_predictions)
test_df['prognosis'][:10] # Spot check our first 10 values

0          Tungiasis Rift_Valley_fever West_Nile_fever
1                           Chikungunya Dengue Malaria
2      West_Nile_fever Japanese_encephalitis Tungiasis
3    Japanese_encephalitis Tungiasis Rift_Valley_fever
4         West_Nile_fever Japanese_encephalitis Plague
5                             Zika Yellow_Fever Plague
6           Japanese_encephalitis Yellow_Fever Malaria
7                     Tungiasis Dengue West_Nile_fever
8              Yellow_Fever Zika Japanese_encephalitis
9                             Yellow_Fever Plague Zika
Name: prognosis, dtype: object

In [15]:
test_df.to_csv('submission.csv', columns=['id', 'prognosis'], index=False)