# Evaluating the performance of ranking model in Search Everywhere

In this project, we will evaluate the effectiveness of a ranking models using a sample dataset. 
We'll start by exploring the differences between the two experiment groups: 0 and 1.
Then we'll assess the model performance using two metrics: MRR and Time-To-Click. Let's start!

## Testing the Differences Between Groups

In [1]:
import pandas as pd
import json

### Loading and organizing the data

Let's start by loading the dataset. Then we'll parse the 'event_data' column from json and create a unique identifier for each (device, session) pair. Then we'll also extract the experiment group from the 'event_data' column and divide the dataset into two groups: 0 and 1.

In [2]:
# Load the dataset
df = pd.read_csv('../data/2024InternshipData.csv')

# Parse the event_data JSON column
df['event_data'] = df['event_data'].apply(json.loads)

# Extract session ID and create a unique identifier for each (device, session) pair
df['session_id'] = df['event_data'].apply(lambda x: x['session_id'])
df['unique_id'] = df['device_id'] + '_' + df['session_id']

# Extract experiment groups into a new column
df['experimentGroup'] = df['event_data'].apply(lambda x: x['experimentGroup'])

# Split the data into experiment groups
groups = {
    0: df[df['experimentGroup'] == 0],
    1: df[df['experimentGroup'] == 1]
}

### Functions for Analysis
Now, let's define few functions to help us with our analysis: 
* `calculate_successful_searches`: for calculating the percentage of successful searches (the ones that finished with user choosing a result)
* `calculate_average_session_time`: for computing average session durations, with an option to calculate only successful or unsuccessful ones
* `is_session_successful`: a helper function to check if a session was successful

These functions will allow us to compare the behaviors of users in each experiment group effectively.


In [11]:
def calculate_successful_searches(df):
    successful_searches = len(
        df[df['event_data'].apply(lambda x: x['selectedIndexes'] is not None)]
    )
    finished_searches = len(df[df['event_id'] == 'sessionFinished'])
    return successful_searches, successful_searches / finished_searches if finished_searches > 0 else 0

def calculate_average_session_time(df, which_ones="all"):
    if which_ones == "all":
        session_durations = df.groupby('unique_id')['time_epoch'].agg(['min', 'max'])
    elif which_ones == "successful":
        successful_sessions = df.groupby('unique_id').filter(is_session_successful)
        session_durations = successful_sessions.groupby('unique_id')['time_epoch'].agg(['min', 'max'])
    elif which_ones == "unsuccessful":
        unsuccessful_sessions = df.groupby('unique_id').filter(lambda x: not is_session_successful(x))
        session_durations = unsuccessful_sessions.groupby('unique_id')['time_epoch'].agg(['min', 'max'])
    else:
        raise ValueError("Invalid value for 'which_ones'. Use 'all', 'successful', or 'unsuccessful'.")
    
    session_durations['duration'] = session_durations['max'] - session_durations['min']
    return round(session_durations['duration'].mean() / 1000, 4)  # Convert ms to seconds and round to 4 decimal places

def is_session_successful(session):
     # Check if any of the actions in the session had selectedIndexes (the session was successful)
     return any(session['event_data'].apply(lambda x: x['selectedIndexes'] is not None))

### Analyzing the Groups 
Let’s iterate through each group, calculate some statistics, and print them out to compare the two groups. 


In [12]:
for group_id, group_df in groups.items():
    print(f'\nGroup {group_id} size: {len(group_df)}')

    # Successful searches
    successful_searches, success_rate = calculate_successful_searches(group_df)
    print(f'Group {group_id} successful searches: {successful_searches}')
    print(f'Group {group_id} percentage of successful searches: {success_rate:.2%}')

    # Average session duration
    avg_time_spent = calculate_average_session_time(group_df)
    print(f'Group {group_id} average time spent on the Search Everywhere tab: {avg_time_spent}s')
    
    # Average successful session duration
    avg_successful_time_spent = calculate_average_session_time(group_df, which_ones="successful")
    print(f'Group {group_id} average time spent on successful searches: {avg_successful_time_spent}s')
    
    # Average unsuccessful session duration
    avg_unsuccessful_time_spent = calculate_average_session_time(group_df, which_ones="unsuccessful")
    print(f'Group {group_id} average time spent on unsuccessful searches: {avg_unsuccessful_time_spent}s')


Group 0 size: 51012
Group 0 successful searches: 4193
Group 0 percentage of successful searches: 57.58%
Group 0 average time spent on the Search Everywhere tab: 25.7688s
Group 0 average time spent on successful searches: 25.6835s
Group 0 average time spent on unsuccessful searches: 25.8836s

Group 1 size: 56332
Group 1 successful searches: 4535
Group 1 percentage of successful searches: 56.66%
Group 1 average time spent on the Search Everywhere tab: 25.7726s
Group 1 average time spent on successful searches: 25.5114s
Group 1 average time spent on unsuccessful searches: 26.1109s


### Observations
As we can see, the groups don't seem to have significant differences in terms of sizes, successful searches and average session durations. In fact, regarding the statistics we calculated, the two groups are almost identical.

## Evaluating the Ranking Model

Now, let's evaluate the ranking model using two metrics: Mean Reciprocal Rank (MRR) and Time-To-Click (TTC). We'll calculate these metrics for each group and compare the results. Then we'll calculate the overall MRR and TTC for the entire dataset.

### Mean Reciprocal Rank (MRR)
We'll start by creating a function to calculate the Reciporal Rank for each unique session.

In [26]:
def calculate_rr(session):
    selected_indexes = session['event_data'].apply(lambda x: x['selectedIndexes'])
    
    valid_indexes = selected_indexes[selected_indexes.notnull()]
    
    # Extract the valid list of indexes and the first selected index from it
    return 1 / (valid_indexes.iloc[0][0] + 1) if not valid_indexes.empty else 0 

Now, let's create a function to calculate the Mean Reciprocal Rank (MRR) for a given dataframe.

In [27]:
def calculate_mrr(df):
    return df.groupby('unique_id').apply(calculate_rr).mean()
    

print(f'MRR for Group 0: {calculate_mrr(groups[0]):.4f}')
print(f'\nMRR for Group 1: {calculate_mrr(groups[1]):.4f}')
print(f'\nOverall MRR: {calculate_mrr(df):.4f}')

  return df.groupby('unique_id').apply(calculate_rr).mean()


MRR for Group 0: 0.1953


  return df.groupby('unique_id').apply(calculate_rr).mean()



MRR for Group 1: 0.2164

Overall MRR: 0.2063


  return df.groupby('unique_id').apply(calculate_rr).mean()
