**Created by Sanskar Hasija**

**AI4Code Detailed EDA📊**

**12 May 2021**


  # <center> <span style="color:#00BFC4;"> AI4CODE DETAILED EDA📊</span> </center>
## <center><span style="color:#00BFC4;">If you find this notebook useful, support with an upvote👍</span></center>

# <center><span style="color:#e76f51;">Table of Contents</span>
<a id="toc"></a>
- [1. Introduction](#1)
- [2. Imports](#2)
- [3. EDA](#3)
    - [3.1 Train Data](#3.1)
        - [3.1.1 Train Data Distribution ](#3.1.1)
    - [3.2 Code Cell Analysis](#3.2)
        - [3.2.1 Code Cells Length Distribution ](#3.2.1)
        - [3.2.2 Code Cells WordCloud ](#3.2.2)
    - [3.3 Markdown Cell Analysis](#3.2)
        - [3.3.1 Markdown Cells Length Distribution ](#3.3.1)
        - [3.3.2 Markdown Cells WordCloud ](#3.3.2)
    - [3.4 Notebooks Analysis](#3.4)
        - [3.4.1 Code Cell Count Analysis ](#3.4.1)
        - [3.4.2 Markdown Cell Count Analysis ](#3.4.2)
        - [3.4.3 Minimum Cell Count Analysis ](#3.4.3)
    

<a id="1"></a>
# **<center><span style="color:#00BFC4;">Introduction </span></center>**

![](https://raw.githubusercontent.com/sanskar-hasija/kaggle/main/images/ai4code_image.png)

<b>The goal of this competition is to understand the relationship between code and comments in Python notebooks. You are challenged to reconstruct the order of markdown cells in a given notebook based on the order of the code cells, demonstrating comprehension of which natural language references which code.</b><br>

<b>Predictions are evaluated by the Kendall tau correlation between predicted cell orders and ground truth cell orders accumulated across the entire collection of test set notebooks.</b><br>

<b>Check more about Kendall tau correlation - https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient</b>

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="2"></a>
# **<center><span style="color:#00BFC4;">IMPORTS </span></center>**

In [None]:
import os
import json
import wordcloud
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from tqdm.notebook import tqdm, trange

## <span style="color:#e76f51;"> Loading Data : </span>


Pandas Dataframe used is created by [Darien Schettler](https://www.kaggle.com/dschettler8845). Link to dataset - https://www.kaggle.com/datasets/dschettler8845/ai4code-train-dataframe

In [None]:
## train dataframes
df = pd.read_csv("../input/ai4code-train-dataframe/train.csv", index_col= [0,1])
df.dropna(inplace = True)

df_ancestors = pd.read_csv('../input/AI4Code/train_ancestors.csv', index_col='id')
df_orders = pd.read_csv("../input/AI4Code/train_orders.csv",index_col='id',squeeze=True,).str.split()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="3"></a>
# **<center><span style="color:#00BFC4;">EDA</span></center>**

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations :</u></b><br>

* <i> A total of <b><u>139256</u></b> notebooks are provided <b><u>train</u></b> set.</i><br>
* <i> A total of <b><u>4</u></b> notebooks are provided <b><u>test</u></b> set. This will be replaced with a <b><u> hidden test</u></b> set for scoring</i><br>
* <i> There are total of <b><u>146300</u></b> cells in <b><u>train </u></b>the train dataframe constructed which include two types of cell_type.</i><br>
* <i> Two types of cell_type - <b><u>code</u></b> and <b><u>markdown</u></b>.</i>
* <i> Almost <b><u>2/3rd</u></b> of the training data consist of Code Cells and remaining <b><u>1/3rd</u></b> consist of Markdown Cells.</i>
</div>

In [None]:
print(f"\033[94mNumber of notebooks present in train set  = ",len(os.listdir("../input/AI4Code/train")))
print(f"\033[94mNumber of notebooks present in test set  = ",len(os.listdir("../input/AI4Code/test")))

<a id="3.1"></a>
### <span style="color:#e76f51;"> Quick View of train data  : </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b>`df` is a Multi-level indexing Padnas Dataframe with 2 index - id and cell_id. </b><br>
Learn more about Multi-level indexing Padnas Dataframe - https://pandas.pydata.org/docs/user_guide/advanced.html
</div>

In [None]:
df.head()

<a id="3.1.1"></a>
### <span style="color:#e76f51;"> Train data distribution  : </span>

In [None]:
code_df = df[df["cell_type"] == "code"]
mkd_df = df[df["cell_type"] == "markdown"]


print(f'\033[94mNumber of Code Cells: {len(code_df)}')
print(f'\033[94mNumber of Markdown Cells: {len(mkd_df)}')

labels=['Code Cells', 'Markdown Cells']
values= [len(code_df), len(mkd_df)]
colors = ['#DE3163', '#58D68D']

fig = go.Figure(data=[go.Pie(
    labels=labels, 
    values=values, 
    pull=[0.1, 0 ],
    marker=dict(colors=colors, 
                line=dict(color='#000000', 
                          width=2))
)])
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="3.2"></a>
## <span style="color:#e76f51;"> Code cells analysis  : </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations:</u></b><br>

* <i> Mean Length for Code Cells is <b><u>25 words</u></b></i><br>
* <i> Max words in a Code Cell is <b><u>74589 words</u></b></i><br>
* <i> There are many outliers in Code Cells</i><br>
</div>

### <span style="color:#e76f51;"> Sample Code Cell: </span>

In [None]:
print(f'\033[94m')
print(code_df.iloc[0]["source"])

<a id="3.2.1"></a>
### <span style="color:#e76f51;"> Code cells Length Distribution  : </span>

In [None]:
code_lengths = np.array([len(code_df["source"][i].split()) for i in range(len(code_df))])
print(f'\033[94m Min Code Cells Length = ', min(code_lengths))
print(f'\033[94m Mean Code cells Length = ', round(np.mean(code_lengths),2))
print(f'\033[94m Max Code Cells Length = ', max(code_lengths))

In [None]:
fig,ax= plt.subplots(figsize= (18,6))
plt.boxplot(code_lengths, vert = False)
plt.xlabel("Lenght of Code Cells");

<a id="3.2.2"></a>
### <span style="color:#e76f51;"> Code cells WordCloud  : </span>

In [None]:
wordcloud_notes = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 400,
                      background_color='white').generate("".join(code_df["source"][:1000]))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud_notes, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_notes);

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

## <span style="color:#e76f51;"> Markdown cells analysis  : </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations:</u></b><br>

* <i> Mean Length for Markdown Cells is <b><u>29 words</u></b></i><br>
* <i> Max words in a Markdown Cell is <b><u>38939 words</u></b></i><br>
* <i> There are many outliers in Markdown Cells as well.</i><br>
</div>

### <span style="color:#e76f51;"> Sample Markdown Cell: </span>

In [None]:
print(f'\033[94m')
print(mkd_df.iloc[59]["source"])

<a id="3.3.1"></a>
### <span style="color:#e76f51;"> Markdown Cells Length Distribution  : </span>

In [None]:
mkd_lengths = np.array([len(mkd_df ["source"][i].split()) for i in range(len(mkd_df))])
print(f'\033[94m Min Markdown Cells Length = ', min(mkd_lengths))
print(f'\033[94m Mean Markdown cells Length = ', round(np.mean(mkd_lengths),2))
print(f'\033[94m Max Markdown Cells Length = ', max(mkd_lengths))

In [None]:
fig,ax= plt.subplots(figsize= (18,6))
plt.boxplot(mkd_lengths, vert = False)
plt.xlabel("Lenght of Markdown Cells");

<a id="3.3.3"></a>
### <span style="color:#e76f51;"> Markdown Cells WordCloud  : </span>

In [None]:
wordcloud_notes = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 400,
                      background_color='white').generate("".join(mkd_df["source"][:1000]))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud_notes, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_notes);

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="3.4"></a>
## <span style="color:#e76f51;"> Notebooks Analysis: </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations:</u></b><br>

* <i> Minimum count for both Code Cells and Markdown cells across all notebooks is <b><u>1</u></b>.</i><br>
* <i> Mean <b><u>Code cells</u></b> count across all notebooks is  <b><u>30 cells</u></b></i><br>
* <i> Mean <b><u>Markdown cells</u></b> count across all notebooks is  <b><u>15 cells</u></b></i><br>
* <i> Max count of <b><u>Code cells</u></b> and  <b><u>Markdown cells</u></b> across all notebooks is <b><u>809 cells</u></b> and <b><u>537 cells</u></b> respectively.</i><br>
  
</div>

In [None]:
## loading code_cell counts from notebooks 
notebook_ids = [notebook[:-5] for notebook in os.listdir("../input/AI4Code/train")]
code_counts= []
markdown_counts= [] 
for i in trange(len(notebook_ids)):
    temp_df = df.loc[(notebook_ids[i])]
    code_counts.append((temp_df["cell_type"] == "code").sum())
    markdown_counts.append((temp_df["cell_type"] == "markdown").sum())

counts_df = pd.DataFrame(data = np.array([notebook_ids, code_counts, markdown_counts, ]).T, columns = ["notebook_id", "code_count", "markdown_count"])
counts_df["markdown_count"] = counts_df["markdown_count"].astype(str).astype(int)
counts_df["code_count"] = counts_df["code_count"].astype(str).astype(int)
counts_df["total_count"] = counts_df["code_count"] + counts_df["markdown_count"]
print(f'\033[94m Minimum Cell count in any notebook', counts_df["total_count"].min())
print(f'\033[94m Maximum Cell count in any notebook', counts_df["total_count"].max())
print(f'\033[94m Mean of Cell counts across all notebooks', round(counts_df["total_count"].mean(), 2 ))
counts_df.head()

### <span style="color:#e76f51;"> Outlier Notebooks Analysis: </span>

In [None]:
k = 100
top_k = counts_df.sort_values(by = ["total_count"], ascending=False)[:k]
fig = px.bar(data_frame=top_k, 
             x = "notebook_id" ,
             y = [ "code_count", "markdown_count"], 
             color_discrete_sequence=['#DE3163', '#58D68D']
         
            )
fig.update_layout(
    title={
        'text': "Cell Type Count analysis for top 100 cell count notebooks(OUTLIERS)",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Notebook ID",
    yaxis_title="Count",
    template="plotly_white"
    
)
fig.update_traces(marker_line_color='black',
                  marker_line_width=0.9,opacity = 0.9)
fig.show()

<a id="3.4.1"></a>
### <span style="color:#e76f51;"> Code Cell Count Analysis: </span>

In [None]:
print(f'\033[94m Minimum Code Cell count in any notebook', counts_df["code_count"].min())
print(f'\033[94m Maximum Code Cell count in any notebook', counts_df["code_count"].max())
print(f'\033[94m Mean of Code Cell counts across all notebooks', round(counts_df["code_count"].mean(), 2 ))

### <span style="color:#e76f51;"> Code Cell Count Distribution across all notebooks: </span>

In [None]:
fig = px.histogram(data_frame=counts_df, 
                   x= "code_count",
                   color_discrete_sequence=["#DE3163"],
                   marginal="violin")
fig.update_layout(
    title={
        'text': "Code Cell Count Distribution",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Code Cells",
    yaxis_title="Count",
    showlegend=False,
    template="plotly_white"
)
fig.show()
fig = px.histogram(data_frame=counts_df[counts_df["code_count"]<100], 
                   x= "code_count",
                   color_discrete_sequence=["#58D68D"],
                   marginal="violin")
fig.update_layout(
    title={
        'text': "Code Cell Count Distribution (COUNTS < 100 )",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Code Cells",
    yaxis_title="Count",
    showlegend=False,
    template="plotly_white"
)
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="3.4.2"></a>
### <span style="color:#e76f51;"> Markdown Cell Count Analysis: </span>

In [None]:
print(f'\033[94m Minimum Markdown Cell count in any notebook', counts_df["markdown_count"].min())
print(f'\033[94m Maximum Markdown Cell count in any notebook', counts_df["markdown_count"].max())
print(f'\033[94m Mean of Markdown Cell counts across all notebooks', round(counts_df["markdown_count"].mean(), 2 ))

### <span style="color:#e76f51;"> Markdown Cell Count Distribution across all notebooks: </span>

In [None]:
fig = px.histogram(data_frame=counts_df, 
                   x= "markdown_count",
                   color_discrete_sequence=["#DE3163"],
                   marginal="violin")
fig.update_layout(
    title={
        'text': "Markdown Cell Count Distribution",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Markdown Cells",
    yaxis_title="Count",
    showlegend=False,
    template="plotly_white"
)

fig.show()
fig = px.histogram(data_frame=counts_df[counts_df["markdown_count"]<100], 
                   x= "markdown_count",
                   color_discrete_sequence=["#58D68D"],
                   marginal="violin")
fig.update_layout(
    title={
        'text': "Markdown Cell Count Distribution (COUNTS < 100 )",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Markdown Cells",
    yaxis_title="Count",
    showlegend=False,
    template="plotly_white"
)
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="3.4.3"></a>
## <span style="color:#e76f51;"> Minimum Cell Count Analysis: </span>

In [None]:
either_one = counts_df[(counts_df["markdown_count"] == 1) | (counts_df["code_count"] == 1 )] 
both_one = counts_df[(counts_df["markdown_count"] == 1) & (counts_df["code_count"] == 1 )] 
code_count_one = counts_df[counts_df["code_count"] == 1 ]
markdown_count_one  = counts_df[counts_df["markdown_count"] == 1 ]

print(f"\033[94mTotal notebook with either 1 code cell or 1 markdown cell = ", len(either_one))
print(f"\033[94mTotal notebook with both 1 code cell and 1 markdown cell = ", len(both_one ))
print(f"\033[94mNotebook counts with only 1 code cell  = " ,len(code_count_one))
print(f"\033[94mNotebook counts with only 1 markdown cell  = " ,len(markdown_count_one))

### <span style="color:#e76f51;"> Code cells count vs Markdown cells count: </span>

In [None]:
fig = px.scatter(data_frame=counts_df, 
                 x = "code_count", 
                 y = "markdown_count", 
                 size = "code_count",
                 color_discrete_sequence=["#DE3163"])
fig.add_shape(type='line',
                x0=0,
                y0=0,
                x1=800,
                y1=800,
                line=dict(color='Black'),
                xref='x',
                yref='y',name = "X=Y line"
             )
fig.update_layout(
    title={
        'text': "Code Cell Counts vs Markdown Cell Counts",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Code Cell Counts",
    yaxis_title="Markdown Cell Counts",
    showlegend=False,
    template="plotly_white"
)
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    
    
### <center>Thank you for reading🙂</center>
### <center>If you have any feedback or find anything wrong, please let me know!</center>
