# Generating MAP@K Submissions

Unlike many multiclass classification metrics, MAP@K permits the evaluation of multiple predictions for a single row.
It's often used for reccomendation systems, where multiple results are recommended, but also makes sense for our vector
borne disease problem, as narrowing possible diagnoses down to 2 or 3 might allow us to efficiently run additional tests
to arrive at a final diagnosis.

Under MAP@K if a "relevant" result (in our case, the true prognosis) is surfaced anywhere in the top K predictions (in our case, 3)
the score will be > 0. The earlier the true prognosis is surfaced, the higher the score will be.

In this notebook we use a simple LogisticRegression model to predict probabilities for all possible prognoses, and then include the
top 3 most probable prognoses as our predictions. This is just one way to generate predictions for MAP@K, and not all the steps are
stricly necessary (e.g. you _can_ submit more or less than 3 predictions, but we always surface 3 here to make it clear that's the
maximum number that will be considered in scoring).

source: https://www.kaggle.com/code/wlifferth/generating-map-k-predictions

# Set Up 

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
import numpy as np

In [18]:
train_df = pd.read_csv('./Data/enfermedades/train-solo-aedes-aegypti.csv')
train_df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,3,0,0,1,1,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,Zika
1,11,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Zika
2,16,0,1,1,1,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,Zika
3,17,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fiebre_Amarilla
4,49,1,1,1,0,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,Fiebre_Amarilla


In [19]:
# Encoder our prognosis labels as integers for easier decoding later
enc = OrdinalEncoder()
train_df['prognosis'] = enc.fit_transform(train_df[['prognosis']])
train_df['prognosis'][:10] # Note the values here are now ordinal (floats)

0    3.0
1    3.0
2    3.0
3    2.0
4    2.0
5    2.0
6    3.0
7    3.0
8    2.0
9    1.0
Name: prognosis, dtype: float64

In [20]:
# We split out our own test set so we can calculate an example MAP@K
X_train, X_test, y_train, y_test = train_test_split(train_df.drop('prognosis', axis=1), train_df['prognosis'], random_state=42)

# Training a LogisticRegression model and predicting probabilities

In [21]:
# We'll train a simple LogisticRegression model on our training split
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)

# Normally lr_clf.predict would just predict the most likely prognosis, but with predict_proba we can get probabilities for all possible prognoses.
predictions = lr_clf.predict_proba(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Generating Top K for a single row

In [22]:
# Lets take a look at getting our top 3 prognoses for just a single prediction first
print("Output of predict_proba:")
print(predictions[0])

Output of predict_proba:
[0.02980985 0.00428756 0.35521485 0.61068775]


In [23]:
# We can get the indices of the highest probabilities with argsort
sorted_prediction_ids = np.argsort(-predictions[0]) # Note argsort sorts in ascending order, but by making all our values negative we'll end up with the highest probabilities first
print("Indices sorted by probabilities:")
print(sorted_prediction_ids)
# 2 is our first id, and the probability at index 2 from predictions[0] is the highest (0.28140433)

Indices sorted by probabilities:
[3 2 0 1]


In [24]:
# We can grab the top 3 predictions and then use our encoder to turn them back into string labels
top_3_prediction_ids = sorted_prediction_ids[:3]
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1)) # inverse_transform expects a 2D array, so we reshape our vector (but this won't be necessary when we run on multiple predictions at once)
top_3_predictions
# We got a list of 3 predictions! Great!

array([['Zika'],
       ['Fiebre_Amarilla'],
       ['Chikungunya']], dtype=object)

# Generating Top K for all rows

In [25]:
# Now let's look at doing the above for a whole set of predictions at once:
sorted_prediction_ids = np.argsort(-predictions, axis=1)
top_3_prediction_ids = sorted_prediction_ids[:,:3]

# Because enc.inverse_transform expects a specific shape (a 2D array with 1 column) we can save the original shape to reshape to after decoding
original_shape = top_3_prediction_ids.shape
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1))
top_3_predictions = top_3_predictions.reshape(original_shape)
top_3_predictions[:10] # Spot check our first 10 values

array([['Zika', 'Fiebre_Amarilla', 'Chikungunya'],
       ['Zika', 'Fiebre_Amarilla', 'Dengue'],
       ['Chikungunya', 'Zika', 'Dengue'],
       ['Chikungunya', 'Dengue', 'Zika'],
       ['Dengue', 'Fiebre_Amarilla', 'Chikungunya'],
       ['Chikungunya', 'Dengue', 'Fiebre_Amarilla'],
       ['Zika', 'Fiebre_Amarilla', 'Dengue'],
       ['Zika', 'Fiebre_Amarilla', 'Chikungunya'],
       ['Fiebre_Amarilla', 'Zika', 'Dengue'],
       ['Chikungunya', 'Fiebre_Amarilla', 'Zika']], dtype=object)

## Calculating MAP@K on our validation set
So now we have our top K (3) predictions, but what is our MAP@K?
To calculate this we'll use the mapk function from the ml_metrics library. The function is pasted below to avoid having to install the package.

In [26]:
# Sourced from the ml_metrics package at https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [27]:
# Our MAP@K score here is ~0.3456
mapk(y_test.values.reshape(-1, 1), top_3_prediction_ids, k=3)

0.4838709677419355

# Generating Test Predictions


In [28]:
test_df = pd.read_csv('./Data/enfermedades/test.csv')
test_df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash
0,707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,708,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,709,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,710,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,711,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# Make predictions
predictions = lr_clf.predict_proba(test_df)

# Get the sorted indices of predictions and take the top 3
sorted_prediction_ids = np.argsort(-predictions, axis=1)
top_3_prediction_ids = sorted_prediction_ids[:,:3]

# Because enc.inverse_transform expects a specific shape (a 2D array with 1 column) we can save the original shape to reshape to after decoding
original_shape = top_3_prediction_ids.shape
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1))
top_3_predictions = top_3_predictions.reshape(original_shape)
top_3_predictions[:10] # Spot check our first 10 values

array([['Fiebre_Amarilla', 'Zika', 'Chikungunya'],
       ['Chikungunya', 'Dengue', 'Zika'],
       ['Fiebre_Amarilla', 'Zika', 'Dengue'],
       ['Fiebre_Amarilla', 'Dengue', 'Zika'],
       ['Zika', 'Fiebre_Amarilla', 'Dengue'],
       ['Zika', 'Fiebre_Amarilla', 'Chikungunya'],
       ['Fiebre_Amarilla', 'Zika', 'Dengue'],
       ['Fiebre_Amarilla', 'Chikungunya', 'Dengue'],
       ['Zika', 'Fiebre_Amarilla', 'Chikungunya'],
       ['Fiebre_Amarilla', 'Zika', 'Chikungunya']], dtype=object)

In [30]:
# Now to get our array of labels into a single column for our submission we can just join on on a space across axis 1
test_df['prognosis'] = np.apply_along_axis(lambda x: np.array(' '.join(x), dtype="object"), 1, top_3_predictions)
test_df['prognosis'][:10] # Spot check our first 10 values

0      Fiebre_Amarilla Zika Chikungunya
1               Chikungunya Dengue Zika
2           Fiebre_Amarilla Zika Dengue
3           Fiebre_Amarilla Dengue Zika
4           Zika Fiebre_Amarilla Dengue
5      Zika Fiebre_Amarilla Chikungunya
6           Fiebre_Amarilla Zika Dengue
7    Fiebre_Amarilla Chikungunya Dengue
8      Zika Fiebre_Amarilla Chikungunya
9      Fiebre_Amarilla Zika Chikungunya
Name: prognosis, dtype: object

In [31]:
#test_df.to_csv('submission.csv', columns=['id', 'prognosis'], index=False)
joblib.dump(enc,'./encoders/enfermedades-solo-encoder.pkl')
joblib.dump(lr_clf,'./modelos/enfermedades-solo-regression-model.pkl')

['./modelos/enfermedades-solo-regression-model.pkl']

In [32]:
#revisando coeficinetes de las variables
# Get the coefficients of the trained model
coefficients = lr_clf.coef_[0]

# Get the names of the features (variables) in the same order as the coefficients
feature_names = X_train.columns

# Create a dictionary with feature names as keys and their corresponding coefficients as values
coefficients_dict = dict(zip(feature_names, coefficients))

# Sort the coefficients in ascending order to find the least important features
sorted_coefficients = sorted(coefficients_dict.items(), key=lambda x: abs(x[1]))

# Print the feature names and their corresponding coefficients in ascending order
for feature, coefficient in sorted_coefficients:
    print(f"{feature}: {coefficient}")

yellow_eyes: 0.0007981812300670256
id: 0.0022899037435425893
headache: -0.004615096178124839
pleural_effusion: -0.009259094323878267
breathing_restriction: -0.03965811613396146
finger_inflammation: -0.05687377420240714
gum_bleed: -0.06820831071464888
fatigue: 0.07119257486458486
red_eyes: -0.07385727313337413
orbital_pain: -0.08896545056495594
muscle_pain: -0.11187689623816145
joint_pain: -0.11703711606242378
digestion_trouble: -0.11767179234607923
paralysis: -0.13492150924330998
swelling: -0.13569558443677443
gastro_bleeding: -0.13687511931080562
weakness: -0.13852758466285214
speech_problem: -0.14029326641250175
slow_heart_rate: -0.15271756248071772
microcephaly: -0.15373336279832436
back_pain: -0.1813583375008872
myalgia: -0.18793596443354574
bitter_tongue: -0.19850412245900736
chills: -0.20364087121576593
inflammation: 0.21377637409731481
confusion: -0.21405898419949435
bullseye_rash: -0.21642776299141508
convulsion: -0.21652343853804745
weight_loss: -0.21746221461244164
abdominal_