
<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px;
            text-align:center">
<br>
<h3><center><b>Google AI4Code – Understand Code in Python Notebooks</b></center></h3>
<h5><center>Predict the relationship between code and comments</center></h5>
    <br>
<p><center><b>Research teams across Google and Alphabet are exploring new ways that machine learning can assist software developers, and want to rally more members of the developer community to help explore this area too. Python notebooks provide a unique learning opportunity, because unlike a lot of standard source code, notebooks often follow narrative format, with comment cells implemented in markdown that explain a programmer's intentions for corresponding code cells. An understanding of the relationships between code and markdown could lend to fresh improvements across many aspects of AI-assisted development, such as the construction of better data filtering and preprocessing pipelines for model training, or automatic assessments of a notebook's readability.

The dataset approximately consists of 160,000 public Python notebooks from Kaggle and  X, the moonshot factory to design a competition that challenges participants to use this dataset of published notebooks to build creative techniques aimed at better understanding the relationship between comment cells and code cells.</b> <center><p>

<br>

<h6> Sources: https://www.kaggle.com/competitions/AI4Code/data<br></h6>
</div>
    


<div class='alert alert-info'>
<br>
<h3> <center><b> THE GOAL 🥅</b> </center></h3>    

<h4>The goal of this competition is to understand the relationship between code and comments in Python notebooks. You are challenged to reconstruct the order of markdown cells in a given notebook based on the order of the code cells, demonstrating comprehension of which natural language references which code.</h4>

<br>
<h3> <center><b> The TASK ☑️</b> </center></h3>
<br>
<h4> The task is to predict the correct ordering of the cells in a given notebook whose markdown cells have been shuffled.</h4>
<br>

<h3><center><b> THE DATA 📚</b></center></h3>

* Total notebooks in the dataset: 160,000 Notebooks
* Training dataset: 140,000 Notebooks as 140,000 JSON files
* Test dataset: 20,000 Notebooks as 20,000 JSON files
* train_order.csv - Gives the correct order of the cells for each notebook in the training dataset
* train_ancestors.csv - On Kaggle, a user may "fork" (that is, copy) the notebook of another user to create their own version. This file contains the forking history of notebooks in the training set. Note: There is no corresponding file for the test set.

<br>

<h3><center><b>Predictions are evaluated by the Kendall tau correlation 🧪</b></center></h3>
</div>

<div class='alert alert - warnings'>

 

<div class='alert alert-info'>
<h4><center>Markdown(Comment cell) and Code cell</center></h4>
</div>

![](https://storage.googleapis.com/kaggle-media/Images/notebook_cell_examples.png)

In [None]:
import pandas as pd
import numpy as np
import time
from tqdm import tqdm
import gc
gc.collect()


from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer

from PIL import Image
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
from nltk.util import ngrams


import re
from collections import Counter

import nltk
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots

<div class='alert alert-info'>
<h4><center>In this notebook for the EDA Purposes, we will using the dataset by Darien Schettler - https://www.kaggle.com/dschettler8845<br><br>
    Dataset Link - https://www.kaggle.com/datasets/dschettler8845/ai4code-train-dataframe/ </center></h4>

</div>

In [None]:
train_data=pd.read_csv('../input/ai4code-train-dataframe/train.csv')
#train_data.drop(columns=['Unnamed: 0'],inplace=True)
print('Number of records in the train data:',train_data.shape)
print('Number of Notebooks:',train_data['id'].nunique())
print('Number of different cell types:',train_data['cell_type'].nunique(),'(Code and Markdown cell)')

gc.collect()

In [None]:
train_data.info(memory_usage='deep')

In [None]:
train_data.head()

In [None]:
print('Number of records in the train data:',train_data.shape)
print('Number of Notebooks:',train_data['id'].nunique())
print('Number of different cell types:',train_data['cell_type'].nunique(),'(Code and Markdown cell)')


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
sns.set(rc={'figure.figsize':(10,6)})
custom_colors = ["#4e89ae", "#c56183","#ed6663","#ffa372"]

def triple_plot(x, title,c): # Use trtrain_datae plot for numeric and important key features 
    fig, ax = plt.subplots(3,1,figsize=(20,10),sharex=True)
    sns.distplot(x, ax=ax[0],color=c)
    ax[0].set(xlabel=None)
    ax[0].set_title('Histogram + KDE')
    sns.boxplot(x, ax=ax[1],color=c)
    ax[1].set(xlabel=None)
    ax[1].set_title('Boxplot')
    sns.violinplot(x, ax=ax[2],color=c)
    ax[2].set(xlabel=None)
    ax[2].set_title('Violin plot')
    #fig.suptitle(title, fontsize=30)
    #plt.tight_layout(pad=3.0)
    plt.show();

#Aggregating by notebook ids to get the count
counter=pd.DataFrame(train_data.groupby(['id']).agg({'cell_id':'count','cell_type':'nunique'}))
counter.rename(columns={'cell_id':'Cell_count'},inplace=True)
counter=counter[counter['Cell_count']!=0]

triple_plot(counter['Cell_count'],'Distribution of cells in Notebooks',custom_colors[2])

<div class='alert alert-info'>

<h4> The above graph shows us the distribution of number of cells present in each and every notebook that is present in the training dataset</h4><br>
<h4>As you can see, the above distribution is skewed(Log normal distribution), at this point it would be ideal to use median as a central tendency measure rather than using mean</h4><br>
<h4>We also see outliers in the above graph, bu the violin chart shows us that the distribution is very less</h4>
</div>

In [None]:
print('Median number of cells in the notebooks:',counter['Cell_count'].median())

In [None]:
counter.describe()

<div class='alert alert-info'>

<h4> Obviously, there are only two cell types - Code cell and mardown cell</h4><br>
<h4>As you can see, Nearly 75% of the data points(Notebooks) have a cell count of 57 and 50% of the data points have a cell count of 35</h4><br>
    </div>

<div class='alert alert-info'>
<h4><center>Let's look at the training examples</center></h4>
</div>

In [None]:
# I have just shown the cells for one single notebook (8a2564b730a575)
train_data[train_data['id']=='8a2564b730a575']

<div class='alert alert-info'>
<h4><center>Let's look at an another example as similiar to the above one but just in elaborate manner</center></h4>
</div>

In [None]:
[print('\033[1m'+'This is Cell {}:\n \t\n'.format(j+1)+'\033[0m'+str(i) + '\n'+'---------------------------------------------------------\n\n')for j,i in enumerate(train_data[train_data['id']==train_data['id'].iloc[2]]['source'])]

<div class='alert alert-info'>

<h3><center><b>Please Note:</b></center></h3>

<h4>The code cells are in their original (correct) order. The markdown cell(Cell 12 above) has been shuffled and placed after the code cells.</h4><br><br>

<h4>Your task is to predict the correct ordering of the markdown cell in a given notebook as they are shuffled.</h4><br>
</div>

<div class='alert alert-info'>

<h3><center><b>Binning the cell count</b></center></h3>

<h4>In order for better interpretation, I have binned the cell counts of  notebooks into five different categories </h4><br>

<br>
    
* Very small Notebook - 0 to 10 cells
* Small notebook - 11 to 20 cells
* Medium Notebooks - 21 to 50 cells
* Large Notebooks - 51 to 100 cells
* Very large notebooks - >100 cells 

<h6> You can always change the numbers according to your need in the cell_binner list down below</h6>
</div>

In [None]:
#Binning conditions
cell_binner=[0,10,20,50,100,99999]
labels=['Very Small NB','Small NB','Medium NB','Large NB','Very Large NB']

#Lets bin the cell count into five different categories as explained above 
counter['cells_bin']=pd.cut(counter['Cell_count'],bins=cell_binner,labels=labels)

#Also merge the binned results with the train data
train_data=train_data.merge(counter.reset_index()[['id','cells_bin']],on=['id'],how='left')

#Distribution calculation of the bins 
nb_split=pd.DataFrame(counter['cells_bin'].value_counts()).reset_index()
nb_split.columns=['Category','Number of NBs']
nb_split['Percentage(%)']=round(((nb_split['Number of NBs']/nb_split['Number of NBs'].sum())*100),2)
nb_split

<div class='alert alert-info'>
<h4> <center>Nearly 44% of the notebooks come under the medium category(21-50 cells) followed by Large notebboks, which is around 23%</center></h4>
</div>

In [None]:
def build_wordcloud(df, title):
    wordcloud = WordCloud(
        background_color='black',colormap="Oranges", 
        stopwords=set(STOPWORDS), 
        max_words=50, 
        max_font_size=40, 
        random_state=666
    ).generate(str(df))

    fig = plt.figure(1, figsize=(14,14))
    plt.axis('off')
    fig.suptitle(title, fontsize=16)
    fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
    


<div class='alert alert-info'>
    <h4><center> Lets see the Word Cloud for different categories</center> </h4>
</div>

In [None]:
build_wordcloud(train_data['source'], 'Prevalent words in all the Notebooks')

In [None]:
build_wordcloud(train_data[train_data['cells_bin']=='Small NB']['source'], 'Prevalent words in all the Small Notebooks')

In [None]:
build_wordcloud(train_data[train_data['cells_bin']=='Very Small NB']['source'], 'Prevalent words in all the Very Small Notebooks')

In [None]:
build_wordcloud(train_data[train_data['cells_bin']=='Medium NB']['source'], 'Prevalent words in all the Medium Notebooks')

In [None]:
build_wordcloud(train_data[train_data['cells_bin']=='High NB']['source'], 'Prevalent words in all the higher number of cells (50-100) Notebook')

In [None]:
build_wordcloud(train_data[train_data['cells_bin']=='Very High NB']['source'], 'Prevalent words in all the very higher number of cells (>100 cells) notebook')

In [None]:
build_wordcloud(train_data[train_data['cell_type']=='code']['source'], 'Prevalent words in all the CODE cells')

In [None]:
build_wordcloud(train_data[train_data['cell_type']=='markdown']['source'], 'Prevalent words in all the Markdown cells')

In [None]:
%%time
train_data['text_length']=train_data['source'].apply(lambda x: len(str(x).split()))

<div class='alert alert-info'>
    <h4><center> Let's observe the text length in markdowns</center></h4>
    </div>

In [None]:
triple_plot(train_data[(train_data['cell_type']=='markdown')]['text_length'],'Distribution of text length in markdowns',custom_colors[2])

<div cla

<div class='alert alert-info'>
<h4><center>Due to the extreme outliers, it s hard to interpret any inference so using IQR, we can follow the below approach to replace the outliers with a NULL value:</center></h4>
<br>
    
* Calculate the first and third quartile (Q1 and Q3).
* Further, evaluate the interquartile range, IQR = Q3-Q1.
* Estimate the lower bound, the lower bound = Q1*1.5
* Estimate the upper bound, upper bound = Q3*1.5
* Replace the data points that lie outside of the lower and the upper bound with a NULL value

</div>

In [None]:
for x in ['text_length']:
    q75,q25 = np.percentile(train_data.loc[:,x],[75,25])
    intr_qr = q75-q25
 
    max = q75+(1.5*intr_qr)
    min = q25-(1.5*intr_qr)
 
    train_data.loc[train_data[x] < min,x] = np.nan
    train_data.loc[train_data[x] > max,x] = np.nan

In [None]:
triple_plot(train_data[(train_data['cell_type']=='markdown')]['text_length'],'Distribution of text length in markdowns',custom_colors[2])

<div class='alert alert-info'>
    <h4> <center><br> N-GRAM</center></h4>
    <h4><center><br>Listing below the top N-gram sequential words used in Markdown cells</center></h4>
</div>



In [None]:
%%time
def ngram_df(corpus,nrange,n=None):
    vec = CountVectorizer(stop_words = 'english',ngram_range=nrange).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    total_list=words_freq[:n]
    df=pd.DataFrame(total_list,columns=['text','count'])
    return df
unigram_df=ngram_df(train_data[(train_data['cell_type']=='markdown')]['source'],(1,1),20)
bigram_df=ngram_df(train_data[(train_data['cell_type']=='markdown')]['source'],(2,2),20)
trigram_df=ngram_df(train_data[(train_data['cell_type']=='markdown')]['source'],(3,3),20)

In [None]:
fig = make_subplots(
    rows=3, cols=1,subplot_titles=("Unigram","Bigram",'Trigram'),
    specs=[[{"type": "scatter"}],
           [{"type": "scatter"}],
           [{"type": "scatter"}]
          ])

fig.add_trace(go.Bar(
    y=unigram_df['text'][::-1],
    x=unigram_df['count'][::-1],
    marker={'color': "blue"},  
    text=unigram_df['count'],
    textposition = "outside",
    orientation="h",
    name="Months",
),row=1,col=1)

fig.add_trace(go.Bar(
    y=bigram_df['text'][::-1],
    x=bigram_df['count'][::-1],
    marker={'color': "blue"},  
    text=bigram_df['count'],
     name="Days",
    textposition = "outside",
    orientation="h",
),row=2,col=1)

fig.add_trace(go.Bar(
    y=trigram_df['text'][::-1],
    x=trigram_df['count'][::-1],
    marker={'color': "blue"},  
    text=trigram_df['count'],
     name="Days",
    orientation="h",
    textposition = "outside",
),row=3,col=1)

fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Top N Grams',xaxis_title=" ",yaxis_title=" ",
                  showlegend=False,title_x=0.5,height=1200,template="plotly_dark")
fig.show()

<div class='alert alert-info'>
<h4> <center> Stay tuned for further EDA and Prediction⏳ </center></h4>
</div>