# Generating MAP@K Submissions

Unlike many multiclass classification metrics, MAP@K permits the evaluation of multiple predictions for a single row.
It's often used for reccomendation systems, where multiple results are recommended, but also makes sense for our vector
borne disease problem, as narrowing possible diagnoses down to 2 or 3 might allow us to efficiently run additional tests
to arrive at a final diagnosis.

Under MAP@K if a "relevant" result (in our case, the true prognosis) is surfaced anywhere in the top K predictions (in our case, 3)
the score will be > 0. The earlier the true prognosis is surfaced, the higher the score will be.

In this notebook we use a simple LogisticRegression model to predict probabilities for all possible prognoses, and then include the
top 3 most probable prognoses as our predictions. This is just one way to generate predictions for MAP@K, and not all the steps are
stricly necessary (e.g. you _can_ submit more or less than 3 predictions, but we always surface 3 here to make it clear that's the
maximum number that will be considered in scoring).

source: https://www.kaggle.com/code/wlifferth/generating-map-k-predictions

# Set Up 

In [369]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
import numpy as np

In [370]:
train_df = pd.read_csv('./Data/enfermedades/train.csv')
train_df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,0,1,1,0,1,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,Lyme_disease
1,1,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,Tungiasis
2,2,0,1,1,1,0,1,1,1,1,...,1,1,1,1,1,0,1,1,1,Lyme_disease
3,3,0,0,1,1,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,Zika
4,4,0,0,0,0,0,0,0,0,1,...,0,1,0,0,1,1,1,0,0,Rift_Valley_fever


In [371]:
# Encoder our prognosis labels as integers for easier decoding later
enc = OrdinalEncoder()
train_df['prognosis'] = enc.fit_transform(train_df[['prognosis']])
train_df['prognosis'][:10] # Note the values here are now ordinal (floats)

0     3.0
1     7.0
2     3.0
3    10.0
4     6.0
5     3.0
6     8.0
7     7.0
8     4.0
9     0.0
Name: prognosis, dtype: float64

In [372]:
# We split out our own test set so we can calculate an example MAP@K
X_train, X_test, y_train, y_test = train_test_split(train_df.drop('prognosis', axis=1), train_df['prognosis'], random_state=20)

# Training a LogisticRegression model and predicting probabilities

In [373]:
# We'll train a simple LogisticRegression model on our training split
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)

# Normally lr_clf.predict would just predict the most likely prognosis, but with predict_proba we can get probabilities for all possible prognoses.
predictions = lr_clf.predict_proba(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Generating Top K for a single row

In [374]:
# Lets take a look at getting our top 3 prognoses for just a single prediction first
print("Output of predict_proba:")
print(predictions[0])

Output of predict_proba:
[0.01527665 0.0524777  0.23635498 0.00236329 0.03374978 0.04641326
 0.1760735  0.0096713  0.00691568 0.16684116 0.2538627 ]


In [375]:
# We can get the indices of the highest probabilities with argsort
sorted_prediction_ids = np.argsort(-predictions[0]) # Note argsort sorts in ascending order, but by making all our values negative we'll end up with the highest probabilities first
print("Indices sorted by probabilities:")
print(sorted_prediction_ids)
# 2 is our first id, and the probability at index 2 from predictions[0] is the highest (0.28140433)

Indices sorted by probabilities:
[10  2  6  9  1  5  4  0  7  8  3]


In [376]:
# We can grab the top 3 predictions and then use our encoder to turn them back into string labels
top_3_prediction_ids = sorted_prediction_ids[:3]
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1)) # inverse_transform expects a 2D array, so we reshape our vector (but this won't be necessary when we run on multiple predictions at once)
top_3_predictions
# We got a list of 3 predictions! Great!

array([['Zika'],
       ['Japanese_encephalitis'],
       ['Rift_Valley_fever']], dtype=object)

# Generating Top K for all rows

In [377]:
# Now let's look at doing the above for a whole set of predictions at once:
sorted_prediction_ids = np.argsort(-predictions, axis=1)
top_3_prediction_ids = sorted_prediction_ids[:,:3]

# Because enc.inverse_transform expects a specific shape (a 2D array with 1 column) we can save the original shape to reshape to after decoding
original_shape = top_3_prediction_ids.shape
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1))
top_3_predictions = top_3_predictions.reshape(original_shape)
top_3_predictions[:10] # Spot check our first 10 values

array([['Zika', 'Japanese_encephalitis', 'Rift_Valley_fever'],
       ['Malaria', 'Lyme_disease', 'Plague'],
       ['Chikungunya', 'Dengue', 'Rift_Valley_fever'],
       ['Tungiasis', 'Dengue', 'Rift_Valley_fever'],
       ['Malaria', 'Plague', 'Lyme_disease'],
       ['Dengue', 'Tungiasis', 'Rift_Valley_fever'],
       ['West_Nile_fever', 'Malaria', 'Lyme_disease'],
       ['Yellow_Fever', 'Plague', 'Malaria'],
       ['Lyme_disease', 'Malaria', 'West_Nile_fever'],
       ['Chikungunya', 'Dengue', 'West_Nile_fever']], dtype=object)

## Calculating MAP@K on our validation set
So now we have our top K (3) predictions, but what is our MAP@K?
To calculate this we'll use the mapk function from the ml_metrics library. The function is pasted below to avoid having to install the package.

In [378]:
# Sourced from the ml_metrics package at https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [379]:
# Our MAP@K score here is ~0.3456
mapk(y_test.values.reshape(-1, 1), top_3_prediction_ids, k=3)

0.30696798493408667

# Generating Test Predictions


In [380]:
test_df = pd.read_csv('./Data/enfermedades/test.csv')
test_df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash
0,707,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,708,1,1,0,1,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,709,1,1,0,1,1,1,1,0,1,...,0,0,0,0,0,1,0,0,0,0
3,710,0,1,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,711,0,0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [381]:
# Make predictions
predictions = lr_clf.predict_proba(test_df)

# Get the sorted indices of predictions and take the top 3
sorted_prediction_ids = np.argsort(-predictions, axis=1)
top_3_prediction_ids = sorted_prediction_ids[:,:3]

# Because enc.inverse_transform expects a specific shape (a 2D array with 1 column) we can save the original shape to reshape to after decoding
original_shape = top_3_prediction_ids.shape
top_3_predictions = enc.inverse_transform(top_3_prediction_ids.reshape(-1, 1))
top_3_predictions = top_3_predictions.reshape(original_shape)
top_3_predictions[:10] # Spot check our first 10 values

array([['Rift_Valley_fever', 'Zika', 'Japanese_encephalitis'],
       ['Dengue', 'Chikungunya', 'Rift_Valley_fever'],
       ['West_Nile_fever', 'Rift_Valley_fever', 'Tungiasis'],
       ['Japanese_encephalitis', 'Yellow_Fever', 'Rift_Valley_fever'],
       ['West_Nile_fever', 'Plague', 'Japanese_encephalitis'],
       ['Zika', 'Yellow_Fever', 'Plague'],
       ['Malaria', 'Japanese_encephalitis', 'Yellow_Fever'],
       ['Yellow_Fever', 'Dengue', 'Rift_Valley_fever'],
       ['Yellow_Fever', 'Zika', 'Japanese_encephalitis'],
       ['Zika', 'Plague', 'Yellow_Fever']], dtype=object)

In [382]:
# Now to get our array of labels into a single column for our submission we can just join on on a space across axis 1
test_df['prognosis'] = np.apply_along_axis(lambda x: np.array(' '.join(x), dtype="object"), 1, top_3_predictions)
test_df['prognosis'][:10] # Spot check our first 10 values

0         Rift_Valley_fever Zika Japanese_encephalitis
1                 Dengue Chikungunya Rift_Valley_fever
2          West_Nile_fever Rift_Valley_fever Tungiasis
3    Japanese_encephalitis Yellow_Fever Rift_Valley...
4         West_Nile_fever Plague Japanese_encephalitis
5                             Zika Yellow_Fever Plague
6           Malaria Japanese_encephalitis Yellow_Fever
7                Yellow_Fever Dengue Rift_Valley_fever
8              Yellow_Fever Zika Japanese_encephalitis
9                             Zika Plague Yellow_Fever
Name: prognosis, dtype: object

In [383]:
#test_df.to_csv('submission.csv', columns=['id', 'prognosis'], index=False)
joblib.dump(enc,'./encoders/enfermedades-encoder.pkl')
joblib.dump(lr_clf,'./modelos/enfermedades-regression-model.pkl')

['./modelos/enfermedades-regression-model.pkl']

In [384]:
#revisando coeficinetes de las variables
# Get the coefficients of the trained model
coefficients = lr_clf.coef_[0]

# Get the names of the features (variables) in the same order as the coefficients
feature_names = X_train.columns

# Create a dictionary with feature names as keys and their corresponding coefficients as values
coefficients_dict = dict(zip(feature_names, coefficients))

# Sort the coefficients in ascending order to find the least important features
sorted_coefficients = sorted(coefficients_dict.items(), key=lambda x: abs(x[1]))

# Print the feature names and their corresponding coefficients in ascending order
for feature, coefficient in sorted_coefficients:
    print(f"{feature}: {coefficient}")

id: 0.001437811652981233
chills: -0.027130389882745687
speech_problem: -0.08175832796469958
bullseye_rash: -0.1121411939371994
gum_bleed: -0.1137043664652582
breathing_restriction: -0.12008118917789878
finger_inflammation: -0.12693585705820074
lips_irritation: -0.16609191081923605
stiff_neck: -0.19950454564881756
swelling: -0.21179341200857557
myalgia: -0.2158798468837443
toe_inflammation: -0.22201589945433192
rigor: -0.23316534567207872
vomiting: 0.23347558418656747
microcephaly: -0.24261443510712472
diarrhea: 0.24691998386401406
fatigue: -0.2478096498284009
hypoglycemia: -0.249416272180911
inflammation: -0.25277922614536263
convulsion: -0.25687604109162093
prostraction: -0.2719473900731455
orbital_pain: -0.28060184839052393
confusion: -0.2881573383769369
pleural_effusion: -0.2953873645336351
lymph_swells: -0.2988994994586052
paralysis: -0.30348776423842433
cocacola_urine: -0.30561543134759145
back_pain: -0.31088878137854853
tremor: -0.3119030115033647
gastro_bleeding: -0.316408615307