# KL - Notebook Review  5/31/2022

Original link to notebook reviewed here https://www.kaggle.com/code/ryanholbrook/getting-started-with-ai4code/notebook

I reviewed this notebook as part of a personal improvement project to refresh and grow my data science skills by exposure and review of cutting edge publically available notebooks.  This specific notebook below is part of a Kaggle competition by google to create a model that can, given all the cells in a randomly shuffled data science notebook, correctly rank and reorder the cells.

The top 2 markdown cells in this notebook were created by me to house my general notes, throughout the rest of the notebook my comments are identified by a 'KL' to differentiate from the comments provided by the notebook's author (google).  The primary purpose of this exercise is my own learning, and so comments will be informal and at what I thought was interesting at the time.

-------

This initial notebook provided by google's team is what you would expect.  It uses a pretty basic model (xgboost)
and a small training size of 10k out of the 139k provided.

## Initial Thoughts after notebook review:


## Quick list to increase accuracy 


 * Try reducing from 0.01 for min weight in the tf-idf vectorizor
 * Use full training data set
 * Switch from XGBoost to LGBM to decrease training speed which maintaining same level of accuracy 
 * Try a grid-search on the LGBM hyperparameters.


## More intensive list to increase accuracy


 * Add more features, use a different method of tokenization/converting the text to features
 * use 2 or 3 n grams, split up the words using periods to seperate
 * add custom features for common libraries to indicate if they are present in that code block.
 * Use whatever the most common version of BERT is

I don't think training a neural net from scratch makes sense for this, as if you are going to do a deep learning NN you
might as well use BERT.


## VERY INTENSE method to increase accuracy


 Use a BERT derivative to generate features for the code, combine those with my custom features, then train a NN to rank.




## KL - Notes after examining top open source competition code. 6/31/2022


## Detailed EDA
https://www.kaggle.com/code/odins0n/ai4code-detailed-eda

This notebook has some great EDA on the dataset, and is a reminder to always look around for available resources and documentation on data as part of any project.
 
Also has some good code to work off of for any in-notebook EDA in the future, although personally I generally prefer an
approach of doing personal EDA within the notebook, but exporting an aggregation to microsoft excel and doing
visualizations for other parties in there, due to my familiarity with advanced excel visualziation techniques.

## DistilBert
https://www.kaggle.com/code/aerdem4/ai4code-pytorch-distilbert-baseline

This is a great notebook that goes along with my initial intution, that a BERT derivative should be suitable for this 
project.  I did my capstone project for my Master's Degree using BERT, and so I knew it would be a good start for any complex analysis of text due to the sophisticated way it has learned and can represent text.

In this case, the author used the initial getting-started-with-ai4code notebook (the notebook i'm writing this in)
but used a DistilBert model.  While other BERT derivatives are going to be more powerful, DistilBert is going to be more 
efficient, which is important considering the competition guidelines that the code can be run on kaggle cloud in a 
reasonable time.

Sadly, the author did not provide comments in the code, but it would be a good starting place for future work on this competition.

### Overall

Main lessons to learn here are some great methods to use in efficiently preparing text as well as data preperation of text.  It's also a fantastic example of using itertuples over iterrows which can save a great deal of run-time.


 









# Welcome to the Google AI4Code Competition! #

In this competition you're challenged to reconstruct the order of Kaggle notebooks whose cells have been shuffled. Check out the [Competition Pages](https://www.kaggle.com/competitions/AI4Code/overview) for a complete overview.

This notebook will walk you through making a submission with a simple ranking model. We'll look at how to:
- Wrangle the competition data and create validation splits,
- Represent the code cell orders with a feature,
- Build a ranking model with XGBoost,
- Evaluate predictions with a Python implementation of the competition metric, and,
- Format predictions to make a successful submission.

Our model will be able to learn roughly where a cell should go in a notebook based on what words it contains -- that, for example, cells containing "Introduction" or `import` should usually be near the beginning, while cells containing "Submit" or `submission.csv` should usually be near the end. These simple features are effective at reconstructing the global order of typical data science workflows. An understanding of the *interactions* or *relationships between cells*, however, will be required of the most successful solutions. We encourage you therefore to explore things like modern neural network language models for learning the relationships between natural language and computer code.

# Setup #

In [23]:
#KL - This code is provided by google to assist with starting the project.
#All code comments with KL were provided by me

#Standard library import.  The files are stored as JSON files.
import json
from pathlib import Path

import numpy as np
import pandas as pd
from scipy import sparse
from tqdm import tqdm

#KL this an option that i wish I had learned when i first started.
#Expanding the number of default dataframe columns/rows can be extremely helpful during data exploration.
pd.options.display.width = 180
pd.options.display.max_colwidth = 120

data_dir = Path('YOURPATHHERE')

# Load Data #

The notebooks are stored as individiual JSON files. They've been cleaned of the usual metadata present in Jupyter notebooks, leaving only the `cell_type` and `source`. The [Data](https://www.kaggle.com/competitions/AI4Code/data) page on the competition website has the full documentation of this dataset.

We'll load the notebooks here and join them into a dataframe for easier processing. The full set of training data takes quite a while to load, so we'll just use a subset for this demonstration.

In [24]:
#KL as this is an example notebook, only 10k samples are chosen.
#this would obviously be an important variable to change for the real deal.

NUM_TRAIN = 10000

#KL this function takes a given file path, reads in the json, and returns it while also giving the data types
#and taking the stem of the file path, or the file name for that particular file, and assigning it as the ID

def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )

#KL this code uses 'glob' to do a wildcard search for all files in the the folder 'train' that end in .json'
#it then converts that to a list
paths_train = list((data_dir / 'train').glob('*.json'))[:NUM_TRAIN]

#KL we use the function we defined earlier, for each file path we created
#TQDM is used to provide the helpful progress bar

notebooks_train = [
    read_notebook(path) for path in tqdm(paths_train, desc='Train NBs')
]

#KL We create our master dataframe by concating the json files we just read in.
#we set the index as the ID, which leaves us with the colums of 'cell_id' 'cell_type' (Code or markdown)
#and 'source' which is the actual text of the column.
#note that this means that we have multiple rows with the same index.  Presumably the ID refers to the ID of the notebook as a whole.
#we also sorted by the index of the notebook.

df = (
    pd.concat(notebooks_train)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

df

Train NBs: 100%|████████████████████████████████████████████████████████████████| 10000/10000 [01:01<00:00, 162.43it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,cell_type,source
id,cell_id,Unnamed: 2_level_1,Unnamed: 3_level_1
00001756c60be8,1862f0a6,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
00001756c60be8,2a9e43d6,code,"import numpy as np\nimport pandas as pd\nimport random\n\nfrom sklearn.model_selection import train_test_split, cros..."
00001756c60be8,038b763d,code,import warnings\nwarnings.filterwarnings('ignore')
00001756c60be8,2eefe0ef,code,matplotlib.rcParams.update({'font.size': 14})
00001756c60be8,0beab1cd,code,"def evaluate_preds(train_true_values, train_pred_values, test_true_values, test_pred_values):\n print(""Train R2:\..."
...,...,...,...
125ef3d1595c5d,653dad94,markdown,## 10. Making predictions and evaluating performance\n<p>But how well does our model perform? </p>\n<p>We will now e...
125ef3d1595c5d,19e283d7,markdown,## 5. Handling the missing values (part iii)\n<p>We have successfully taken care of the missing values present in th...
125ef3d1595c5d,c4d70f32,markdown,"## 9. Fitting a logistic regression model to the train set\n<p>Essentially, predicting if a credit card application ..."
125ef3d1595c5d,7981b5de,markdown,<p>In this small project I will build a supervised machine learning model to predict if a credit card application w...


Each notebook has all the code cells given first with the markdown cells following. The code cells are in the correct relative order, while the markdown cells are shuffled. In the next section, we'll see how to recover the correct orderings for notebooks in the training set.

In [21]:
# Get an example notebook
nb_id = df.index.unique('id')[6]
print('Notebook:', nb_id)

print("The disordered notebook:")
nb = df.loc[nb_id, :]
display(nb)
print()

Notebook: 00038c2941faa0
The disordered notebook:


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3e551fb7,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
45049ad8,code,"train_data = pd.read_csv(""/kaggle/input/titanic/train.csv"")\ntest_data = pd.read_csv(""/kaggle/input/titanic/test.csv"")"
123b4f4c,code,import plotly.express as px
0b92cb59,code,train_data.head(20)
df963df4,code,train_data.isnull().sum() #checking out which column has most no. of NaN Values
0f3db81b,code,"px.bar(data_frame=train_data, x='Sex', y='Survived',color='Sex',facet_row_spacing=0, title=""Relation between Gender ..."
33ff3073,code,"total_passengers = train_data['Sex'].count()\ncount_males = 0\ncount_females = 0\nfor i,j in zip(train_data['Sex'], ..."
818c4c15,code,"from sklearn.ensemble import RandomForestClassifier\n\n\ny = train_data[""Survived""]\n\nfeatures = [""Pclass"", ""Sex"", ..."
6cfbe868,markdown,## Survival Rate for Male Passenger is : 12.235 %\n\n## Survival Rate for Female Passenger is : 26.150 %
eadf5c66,markdown,## Who has more luck in here? \n\n\nFrom the above data we can find out that females had more survival rate on Titan...





Your task in this competition is to predict the correct order of the notebook cells, both code and markdown. Since you're given the relative ordering of the code cells among themselves, you could also think of this as predicting where the markdown cells should be placed among the code cells.

For example, a disordered notebook might be:
```
code_1
code_2
code_3
markdown_1
markdown_2
```
and the correctly ordered notebook might be:
```
code_1
markdown_2
code_2
code_3
markdown_1
```
The markdown cells can be in any order, but you would never see `code_2` before `code_1`, for instance.

# Ordering the Cells #

In the `train_orders.csv` file we have, for notebooks in the training set, the correct ordering of cells in terms of the cell ids.

In [28]:
#KL read the csv file 'train_orders.csv'
#after reviewing this file manually, it consists of 2 columns 'id' and 'cell_order'
#the ID column has 1 value, the unique ID of the notebook.
# the cell_order column has multiple values seperated by a single space, this are the cell_ids of all cells in the that notebook.
#note that while the input csv file has 2 columns, we are using 1 for the index.
#this means when we pass in "squeeze" = True that it will convert the input into a list.
df_orders = pd.read_csv(
    data_dir / 'train_orders.csv',
    index_col='id',
    squeeze=True,
).str.split()  # Split the string representation of cell_ids into a list

df_orders

id
00001756c60be8    [1862f0a6, 448eb224, 2a9e43d6, 7e2f170a, 038b763d, 77e56113, 2eefe0ef, 1ae087ab, 0beab1cd, 8ffe0b25, 9a78ab76, 0d136...
00015c83e2717b    [2e94bd7a, 3e99dee9, b5e286ea, da4f7550, c417225b, 51e3cd89, 2600b4eb, 75b65993, cf195f8b, 25699d02, 72b3201a, f2c75...
0001bdd4021779    [3fdc37be, 073782ca, 8ea7263c, 80543cd8, 38310c80, 073e27e5, 015d52a4, ad7679ef, 7fde4f04, 07c52510, 0a1a7a39, 0bcd3...
0001daf4c2c76d    [97266564, a898e555, 86605076, 76cc2642, ef279279, df6c939f, 2476da96, 00f87d0a, ae93e8e6, 58aadb1d, d20b0094, 986fd...
0002115f48f982                                 [9ec225f0, 18281c6c, e3b6b115, 4a044c54, 365fe576, a3188e54, b3f6e12d, ee7655ca, 84125b7a]
                                                                           ...                                                           
fffc30d5a0bc46    [09727c0c, ff1ea6a0, ddfef603, a01ce9b3, 3ba953ee, bf92a015, f4a0492a, 095812e6, 53125cfe, aa32a700, 63340e73, 06d8c...
fffc3b44869198    [978a5137, fa

In [32]:
# Get the correct order
cell_order = df_orders.loc[nb_id]

print("The ordered notebook:")
nb.loc[cell_order, :]

The ordered notebook:


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3e551fb7,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
45049ad8,code,"train_data = pd.read_csv(""/kaggle/input/titanic/train.csv"")\ntest_data = pd.read_csv(""/kaggle/input/titanic/test.csv"")"
8bb41691,markdown,### Checking out the Titanic Dataset \n
123b4f4c,code,import plotly.express as px
0b92cb59,code,train_data.head(20)
5a8b6e2d,markdown,## EDA is all about asking the right questions\n\nWhat are some questions that come to your mind when you're checkin...
df963df4,code,train_data.isnull().sum() #checking out which column has most no. of NaN Values
3c7d19bc,markdown,From the above inference Cabin needs to be either dropped or needs to be filled with Appropriate values
0f3db81b,code,"px.bar(data_frame=train_data, x='Sex', y='Survived',color='Sex',facet_row_spacing=0, title=""Relation between Gender ..."
eadf5c66,markdown,## Who has more luck in here? \n\n\nFrom the above data we can find out that females had more survival rate on Titan...


The correct numeric position of a cell we will call the **rank** of the cell. We can find the ranks of the cells within a notebook by referencing the true ordering of cell ids as given in `train_orders.csv`.

In [34]:
#KL helpful little function.
#essentially takes the index of each row grouped in the same index, and returns it

def get_ranks(base, derived):
    return [base.index(d) for d in derived]

#KL then we use the function on the cellorder list, which returns the rank of each cell within a specific 'index' or notebookid

cell_ranks = get_ranks(cell_order, list(nb.index))
nb.insert(0, 'rank', cell_ranks)

nb

Unnamed: 0_level_0,rank,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3e551fb7,0,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
45049ad8,1,code,"train_data = pd.read_csv(""/kaggle/input/titanic/train.csv"")\ntest_data = pd.read_csv(""/kaggle/input/titanic/test.csv"")"
123b4f4c,3,code,import plotly.express as px
0b92cb59,4,code,train_data.head(20)
df963df4,6,code,train_data.isnull().sum() #checking out which column has most no. of NaN Values
0f3db81b,8,code,"px.bar(data_frame=train_data, x='Sex', y='Survived',color='Sex',facet_row_spacing=0, title=""Relation between Gender ..."
33ff3073,10,code,"total_passengers = train_data['Sex'].count()\ncount_males = 0\ncount_females = 0\nfor i,j in zip(train_data['Sex'], ..."
818c4c15,13,code,"from sklearn.ensemble import RandomForestClassifier\n\n\ny = train_data[""Survived""]\n\nfeatures = [""Pclass"", ""Sex"", ..."
6cfbe868,11,markdown,## Survival Rate for Male Passenger is : 12.235 %\n\n## Survival Rate for Female Passenger is : 26.150 %
eadf5c66,9,markdown,## Who has more luck in here? \n\n\nFrom the above data we can find out that females had more survival rate on Titan...


Sorting a notebook by the cell ranks is another way to order the notebook.

In [36]:
#KL assert frame equal lets you know if there any differences between a left and right dataframe
from pandas.testing import assert_frame_equal
#KL I added print statement
print(assert_frame_equal(nb.loc[cell_order, :], nb.sort_values('rank')))

None


The algorithm we'll be using for our baseline model uses the cell ranks as the target, so let's create a dataframe of the ranks for each notebook.

In [47]:
#KL To frame converts a series object to a dataframe. 
#Here we convert DF orders which is the original reading in of the cells

df_orders_ = df_orders.to_frame().join(
    df.reset_index('cell_id').groupby('id')['cell_id'].apply(list),
    how='right',
)

ranks = {}
#KL itertuples is a way of getting a named tuple for each row, and is much faster than using iterrows
#in this case the tuple is the notebook id, the cell id, and the rank.
#we then reapply the get_ranks function we defined earlier.

for id_, cell_order, cell_id in df_orders_.itertuples():
    ranks[id_] = {'cell_id': cell_id, 'rank': get_ranks(cell_order, cell_id)}

#the end result is a dictionary of the cell_id and rank within the notebook that cell_id came from.    
    
df_ranks = (
    pd.DataFrame
    .from_dict(ranks, orient='index')
    .rename_axis('id')
    .apply(pd.Series.explode)
    .set_index('cell_id', append=True)
)


#KL The end result is a pandas dataframe with TWO indexes, and 1 column.
df_ranks

Unnamed: 0_level_0,Unnamed: 1_level_0,rank
id,cell_id,Unnamed: 2_level_1
00001756c60be8,1862f0a6,0
00001756c60be8,2a9e43d6,2
00001756c60be8,038b763d,4
00001756c60be8,2eefe0ef,6
00001756c60be8,0beab1cd,8
...,...,...
125ef3d1595c5d,653dad94,20
125ef3d1595c5d,19e283d7,10
125ef3d1595c5d,c4d70f32,18
125ef3d1595c5d,7981b5de,0


# Splits #

The `df_ancestors.csv` file identifies groups of notebooks derived from a common origin, that is, notebooks belonging to the same forking tree.

In [48]:
df_ancestors = pd.read_csv(data_dir / 'train_ancestors.csv', index_col='id')
df_ancestors

Unnamed: 0_level_0,ancestor_id,parent_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
00001756c60be8,945aea18,
00015c83e2717b,aa2da37e,317b65d12af9df
0001bdd4021779,a7711fde,
0001daf4c2c76d,090152ca,
0002115f48f982,272b483a,
...,...,...
fffc30d5a0bc46,6aed207b,
fffc3b44869198,a6aaa8d7,
fffc63ff750064,0a1b5b65,
fffcd063cda949,d971e960,


To prevent leakage, the test set has no notebook with an ancestor in the training set. We therefore form a validation split using `ancestor_id` as a grouping factor.

In [50]:
#KL - here, we want to make sure our training set and validation set DO NOT have data leakeage.
#this is an issue because many notebooks are from the same fork.  IF we include notebooks from the same fork in training
#and validation, any alogirithim may simply decide to memorize the commonalities to predict, which we want to avoid.

#Therefore, when we split the notebooks into training and validation, we want to make sure ALL of a fork goes into only 
#validation, OR training.  To do this, we use group shuffle split, and feed in the information from the 'ancestors'

from sklearn.model_selection import GroupShuffleSplit

NVALID = 0.1  # size of validation set

splitter = GroupShuffleSplit(n_splits=1, test_size=NVALID, random_state=0)

# Split, keeping notebooks with a common origin (ancestor_id) together
ids = df.index.unique('id')
ancestors = df_ancestors.loc[ids, 'ancestor_id']
ids_train, ids_valid = next(splitter.split(ids, groups=ancestors))
ids_train, ids_valid = ids[ids_train], ids[ids_valid]

df_train = df.loc[ids_train, :]
df_valid = df.loc[ids_valid, :]

# Feature Engineering #

Let's generate [tf-idf features](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) to use with our ranking model. These features will help our model learn what kinds of words tend to occur most often at various positions within a notebook.

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Training set
tfidf = TfidfVectorizer(min_df=0.01)
X_train = tfidf.fit_transform(df_train['source'].astype(str))
# Rank of each cell within the notebook
y_train = df_ranks.loc[ids_train].to_numpy()
# Number of cells in each notebook
groups = df_ranks.loc[ids_train].groupby('id').size().to_numpy()

In [68]:
#KL I added this codeblock to help illustrate the results of TF-IDF

#(0, 162)The 0 refers to the document number, the 162 refers to the word.  The 0.0454 refers to the
#inverse frequency of that word.  This is the feature vector that we will use.
#note that with min_df - 0.01 it discards all words that are not in at least 1% of all documents.

print(X_train.shape)

print(X_train)

(416743, 280)
  (0, 162)	0.045459803822181045
  (0, 35)	0.06738644517234817
  (0, 234)	0.0873809155041615
  (0, 38)	0.07890611225025879
  (0, 16)	0.08053085489967879
  (0, 254)	0.07368027830383203
  (0, 54)	0.07884171815489997
  (0, 263)	0.08585969572872988
  (0, 171)	0.08355245628758512
  (0, 228)	0.06197821898023017
  (0, 250)	0.08904813742242577
  (0, 40)	0.13213356082602562
  (0, 278)	0.22268566155583808
  (0, 119)	0.08246795502416421
  (0, 174)	0.08193964232301745
  (0, 183)	0.05345768806988965
  (0, 167)	0.23350814043737944
  (0, 15)	0.14022887907750722
  (0, 133)	0.07429332663299502
  (0, 267)	0.06841562311142822
  (0, 166)	0.07112099202943088
  (0, 199)	0.17823644861549026
  (0, 165)	0.08009917775638646
  (0, 190)	0.09086365271147324
  (0, 109)	0.13288019028200737
  :	:
  (416654, 279)	33.0
  (416655, 279)	34.0
  (416656, 279)	35.0
  (416657, 279)	36.0
  (416658, 279)	37.0
  (416659, 279)	38.0
  (416660, 279)	39.0
  (416661, 279)	40.0
  (416662, 279)	41.0
  (416663, 279)	42.0
 

Now let's add the code cell ordering as a feature. We'll append a column that enumerates the code cells in the correct order, like `1, 2, 3, 4, ...`, while having the dummy value `0` for all markdown cells. This feature will help the model learn to put the code cells in the correct order.

In [64]:
# Add code cell ordering
X_train = sparse.hstack((
    X_train,
    np.where(
        df_train['cell_type'] == 'code',
        df_train.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))
print(X_train.shape)

(416743, 280)


# Train #

We'll use the ranking algorithm provided by XGBoost.

In [74]:
#KL Create an XG ranker model and train it.

from xgboost import XGBRanker

model = XGBRanker(
    min_child_weight=10,
    subsample=0.5,
    tree_method='hist',
)
model.fit(X_train, y_train, group=groups)

XGBRanker(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1,
          colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None,
          enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1,
          grow_policy='depthwise', importance_type=None,
          interaction_constraints='', learning_rate=0.300000012, max_bin=256,
          max_cat_to_onehot=4, max_delta_step=0, max_depth=6, max_leaves=0,
          min_child_weight=10, missing=nan, monotone_constraints='()',
          n_estimators=100, n_jobs=0, num_parallel_tree=1,
          objective='rank:pairwise', predictor='auto', random_state=0,
          reg_alpha=0, ...)

# Evaluate #

Now let's see how well our model learned to order Kaggle notebook cells. We'll evaluate predictions on the validation set with a variant of the Kendall tau correlation.

## Validation set ##

First we'll create features for the validation set just like we did for the training set.

In [75]:
# Validation set
X_valid = tfidf.transform(df_valid['source'].astype(str))
# The metric uses cell ids
y_valid = df_orders.loc[ids_valid]

X_valid = sparse.hstack((
    X_valid,
    np.where(
        df_valid['cell_type'] == 'code',
        df_valid.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

Here we'll use the model to predict the rank of each cell within its notebook and then convert these ranks into a list of ordered cell ids.

In [76]:
y_pred = pd.DataFrame({'rank': model.predict(X_valid)}, index=df_valid.index)
y_pred = (
    y_pred
    .sort_values(['id', 'rank'])  # Sort the cells in each notebook by their rank.
                                  # The cell_ids are now in the order the model predicted.
    .reset_index('cell_id')  # Convert the cell_id index into a column.
    .groupby('id')['cell_id'].apply(list)  # Group the cell_ids for each notebook into a list.
)
y_pred.head(10)

id
000757b90aaca0    [8f84d7a9, 93cceeef, eb6ca769, 3cb3d383, bc595bc2, 1c301fa2, abc159f0, b20690ef, 6e3a3d90, a32974f3, 4e2b2854, 336fd...
0015b3a7b1d090    [f9d1d049, 6b839e2c, d0bed918, f7f7add5, a3b5283b, 152bae08, 8b28582f, 37117c2f, 83592431, f66232c0, 91d77777, f9e7d...
001f67f7a77619    [0b66280b, 7f81821a, 46329b3c, 7a1ee188, 8d996785, a1265d02, 118cd02c, 96b9c492, eb9aa0df, 34457700, 72934658, 95722...
0020e93727e09c    [2ae0558f, c99eb39c, 0395ecde, 3de50426, f7501ae7, 660e2f96, a496bff8, 735a621c, c6ae803e, 4fbe373e, 82d91b5d, d3726...
002132326fc1cd    [15bb3783, 37f42952, a935d196, 58c61e7c, f96000e2, 48d204ea, 6a3b489e, 8faaf8eb, 76c4ba1f, d3c00b4a, ffd91295, 1b033...
0030ea6c6281ce    [b8ff09de, 532dd206, 6d1c9755, 1d0804d1, 0aa598c5, ca7507bb, c0bc83de, b0202ee4, 699baeea, c74e572f, 42586e3f, 338d3...
003387e4d35cd9    [1871bc80, 9758b959, 8ef3c7b5, 4bfc51d6, bdfe4baa, f8b34ee9, eef4482a, 3c0e68ab, 5a2f4395, 936256c6, 9edac9dc, 6a7cd...
0035bf6f9e264f    [1de4b0ca, 91

Now let's examine a notebook to see how the model did.

In [77]:
nb_id = df_valid.index.get_level_values('id').unique()[8]

display(df.loc[nb_id])
display(df.loc[nb_id].loc[y_pred.loc[nb_id]])

Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
f88724cf,code,"print(14 * "" >"", ""\t n.B.a. \t"", ""< "" * 14, ""\n\n\n"")\n\n# This Python 3 environment comes with many helpful analyti..."
0887cba0,code,"# Define dictionary\ndictionary = {""column1"":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],\n ""c..."
b3e68875,code,data_missingno.head(10)
9d87e5c8,code,# import missingno library\n\nimport missingno as msno\n\nmsno.matrix(data_missingno)\nplt.show()
16e1f068,code,msno.bar(data_missingno)\nplt.show()
b04c54ec,code,data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')\ndata.head()
46189238,code,"data.rename(columns = {'fixed acidity': 'fixed_acidity', 'volatile acidity': 'volatile_acidity', 'citric acid': 'cit..."
a0c17f96,code,data.info()
4520ba37,code,"# Make the plot\nplt.figure(figsize=(15,10))\nparallel_coordinates(data, 'quality', colormap=plt.get_cmap(""Set1""))\n..."
242e9c42,code,"# Calculate the correlation between individuals.\ncorr = data.iloc[:,0:10].corr()\ncorr"


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
f88724cf,code,"print(14 * "" >"", ""\t n.B.a. \t"", ""< "" * 14, ""\n\n\n"")\n\n# This Python 3 environment comes with many helpful analyti..."
0887cba0,code,"# Define dictionary\ndictionary = {""column1"":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],\n ""c..."
b3e68875,code,data_missingno.head(10)
9d87e5c8,code,# import missingno library\n\nimport missingno as msno\n\nmsno.matrix(data_missingno)\nplt.show()
b04c54ec,code,data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')\ndata.head()
16e1f068,code,msno.bar(data_missingno)\nplt.show()
46189238,code,"data.rename(columns = {'fixed acidity': 'fixed_acidity', 'volatile acidity': 'volatile_acidity', 'citric acid': 'cit..."
a0c17f96,code,data.info()
4520ba37,code,"# Make the plot\nplt.figure(figsize=(15,10))\nparallel_coordinates(data, 'quality', colormap=plt.get_cmap(""Set1""))\n..."
1a269618,code,# import networkx library\nimport networkx as nx\n\n# Transform it in a links data frame (3 columns only):\nlinks = ...


## Metric ##

This competition uses a variant of the [Kendall tau correlation](https://www.kaggle.com/competitions/AI4Code/overview/evaluation), which will measure how close to the correct order our predicted orderings are. See this notebook for more on this metric: [Competition Metric - Kendall Tau Correlation](https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation/notebook).

In [78]:
#KL Kendall Tau is basically a measure of how many times you have to swap 2 adjacaent cells to get the right order.
#as such it penalizes predictions which are drastically off the correct number more than those which are only 1 or 2 off.
#they implement their own custom method for this below

from bisect import bisect


def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):
        j = bisect(sorted_so_far, u)
        inversions += i - j
        sorted_so_far.insert(j, u)
    return inversions


def kendall_tau(ground_truth, predictions):
    total_inversions = 0
    total_2max = 0  # twice the maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

Let's test the metric with a dummy submission created from the ids of the shuffled notebooks.

In [79]:
y_dummy = df_valid.reset_index('cell_id').groupby('id')['cell_id'].apply(list)
kendall_tau(y_valid, y_dummy)

0.37616093940244355

Comparing this to the score on the predictions, we can see that our model was indeed able to improve the cell ordering somewhat.

In [80]:
kendall_tau(y_valid, y_pred)

0.585791847404556

# Submission #

To create a submission for this competition, we'll apply our model to the notebooks in the test set. Note that this is a **Code Competition**, which means that the test data we see here is only a small sample. When we submit our notebook for scoring, this example data will be replaced with the full test set of about 20,000 notebooks.

First we load the data.

In [81]:
paths_test = list((data_dir / 'test').glob('*.json'))
notebooks_test = [
    read_notebook(path) for path in tqdm(paths_test, desc='Test NBs')
]
df_test = (
    pd.concat(notebooks_test)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

Test NBs: 100%|██████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 78.64it/s]


Then create the tf-idf and code cell features.

In [82]:
X_test = tfidf.transform(df_test['source'].astype(str))
X_test = sparse.hstack((
    X_test,
    np.where(
        df_test['cell_type'] == 'code',
        df_test.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

And then create predictions on the test set.

In [83]:
y_infer = pd.DataFrame({'rank': model.predict(X_test)}, index=df_test.index)
y_infer = y_infer.sort_values(['id', 'rank']).reset_index('cell_id').groupby('id')['cell_id'].apply(list)
y_infer

id
0009d135ece78d    [ddfd239c, c6cd22db, 1372ae9b, 7f388a41, 90ed07ab, 8cb8d28a, f9893819, 2843a25a, 06dbf8cf, 0a226b6a, 39e937ec, ba55e...
0010483c12ba9b                       [54c7cab3, fe66203e, 7844d5f8, 5ce8863c, 7f270e34, 4a32c095, 02a0be6d, 865ad516, 4a0777c4, 4703bb6d]
0010a919d60e4f    [aafc3d23, b7578789, 80e077ec, b190ebb4, ed415c3c, 322850af, 8ce62db4, 5115ebe5, 868c4eae, 4ae17669, 23607d04, 80433...
0028856e09c5b7                                                                                   [012c9d02, d22526d1, 3ae7ece3, eb293dfc]
Name: cell_id, dtype: object

The `sample_submission.csv` file shows what a correctly formatted submission must look like. We'll just use it as a visual check, but you might like to directly modify the values of sample submission instead. (This would help prevent failed submissions due to missing notebook ids or incorrectly named columns, for instance.)

In [84]:
y_sample = pd.read_csv(data_dir / 'sample_submission.csv', index_col='id', squeeze=True)
y_sample

id
0009d135ece78d       ddfd239c c6cd22db 1372ae9b 90ed07ab 7f388a41 2843a25a 06dbf8cf f9893819 ba55e576 39e937ec e25aa9bd 0a226b6a 8cb8d28a
0010483c12ba9b                                  54c7cab3 fe66203e 7844d5f8 5ce8863c 4a0777c4 4703bb6d 4a32c095 865ad516 02a0be6d 7f270e34
0010a919d60e4f    aafc3d23 80e077ec b190ebb4 ed415c3c 322850af c069ed33 868c4eae 80433cf3 bd8fbd76 0e2529e8 1345b8b2 cdae286f 4907b9ef...
0028856e09c5b7                                                                                        012c9d02 d22526d1 3ae7ece3 eb293dfc
Name: cell_order, dtype: object

We can see that a correctly formatted submission needs the index named `id` and the column of cell orders named `cell_order`. Moreover, we need to convert the list of cell ids into a space-delimited string of cell ids.

In [85]:
y_submit = (
    y_infer
    .apply(' '.join)  # list of ids -> string of ids
    .rename_axis('id')
    .rename('cell_order')
)
y_submit

id
0009d135ece78d       ddfd239c c6cd22db 1372ae9b 7f388a41 90ed07ab 8cb8d28a f9893819 2843a25a 06dbf8cf 0a226b6a 39e937ec ba55e576 e25aa9bd
0010483c12ba9b                                  54c7cab3 fe66203e 7844d5f8 5ce8863c 7f270e34 4a32c095 02a0be6d 865ad516 4a0777c4 4703bb6d
0010a919d60e4f    aafc3d23 b7578789 80e077ec b190ebb4 ed415c3c 322850af 8ce62db4 5115ebe5 868c4eae 4ae17669 23607d04 80433cf3 c069ed33...
0028856e09c5b7                                                                                        012c9d02 d22526d1 3ae7ece3 eb293dfc
Name: cell_order, dtype: object

And finally we'll write out the formatted submissions to a file `submission.csv`. When we submit our notebook, it will be rerun on the full test data to create the submission file that's actually scored.

In [86]:
y_submit.to_csv('submission.csv')