# Visualising Cricket - Manhattan and Worm
Two of the common statistical visualisations that you see during cricket reporting and broadcasts is the Manhattan plot and 'The Worm'. The Manhattan plot is essentially a breakdown of the runs scored and wickets taken in each over of a game displayed as a bar chart. 'Thw Worm' is a line graph that plots the runs of both teams against one another so for any given point in the game you can see how many runs they had. 

Both involve aggregating the data to over-level. I will do this first and show you how to aggregate the ball by ball data available into a useable format for these graphics. 

In [None]:
# Import libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from IPython.display import HTML, display

# Import the ball by ball data and match information
raw_ball_data = pd.read_csv('/kaggle/input/ipl-complete-dataset-20082020/IPL Ball-by-Ball 2008-2020.csv')
raw_match_data = pd.read_csv('/kaggle/input/ipl-complete-dataset-20082020/IPL Matches 2008-2020.csv')

In [None]:
# Filter to games since start of 2020
match_mask = (pd.to_datetime(raw_match_data.date) > pd.to_datetime('2020-01-01'))
match_data = raw_match_data[match_mask]

# Join in the match information into the ball-level data
combined_data = match_data.set_index('id').join(raw_ball_data.set_index('id'), how='inner', lsuffix='match_')

# Maintain just the bare data required for the plots 
ball_data = combined_data.loc[:, ['inning', 'over', 'ball', 'total_runs', 'is_wicket']].sort_values(['id', 'inning', 'over', 'ball'])
ball_data.head(5)

Above you can see the filtered version of the ball-level data that we have for the IPL matches that have been played since the start of 2020; `ball_data`. This provides us with all the information to be able to aggregate the data to the level that we need to be able to produce the final Manhattan plot. 

## Building a set of plotting functions 
In order to go from `ball_data` to the final graphics, we are going to have to do some aggregations and formatting; it is best that we pull these elements out into distinct functions so that the code base is easier to maintain and, hopefully, easier to follow. We need to start with one that takes the ball-level data and aggregates it an over-level as that is what the graphics use. 

In [None]:
def over_breakdown(
    ball_data,
    match_id,
    inning,
    over_col = 'over', 
    runs_col = 'total_runs', 
    wicket_col = 'is_wicket'
    ):
    """
    Aggregate ball-level data to over-level for a given match innings
    
    Parameters
    ----------
    ball_data: pd.DataFrame
        Ball-level data to aggregate
    match_id: numeric
        The id of the match (stored as the index of ball_data)
    inning: numeric
        The inning of the match to aggregate 
    over_col: str
        Name of the column that represents the over number
    runs_col: str
        Name of the column that represents the total runs scored from the ball
    wicket_col: str
        Name of the column that represents the number of wickets taken 

    Returns
    -------
    pandas.DataFrame
        Contains the cooridinates for the scatter points for wickets
    """
    # Limit ball_data to the match and inning given, and apply grouping by right columns
    mask = (ball_data.index == match_id) & (ball_data.inning == inning)
    grouped_data = ball_data.loc[mask, [over_col, runs_col, wicket_col]].groupby(over_col)
    return grouped_data.sum().reset_index()

In [None]:
for inn in [1,2]:
    overs = over_breakdown(ball_data, 1216492, inn)
    display(HTML(overs.head().to_html()))

Now that we have an over-level of the match innings that we are interested in, we have what we need to generate a Manhattan plot. A feature of the Manhattan plot is that often the wickets 'sit' on top of the bars, so you have an over where 4 runs were scored, but 2 wickets were lost would be a bar with 2 points stacked vertically on top of the bar. To do this, I'm going to define a function that works out the x and y positions of each wicket point, given an over-level data set, that we can call within our Manhattan function. 

In [None]:
def wicket_scatter_points(
    over_data, 
    over_col = 'over', 
    runs_col = 'total_runs', 
    wicket_col = 'is_wicket', 
    base_wicket_spacing = 0.5,
    inter_wicket_spacing = 0.7
    ):
    """
    Calculate x and y position for wicket scatter points for plt.bar()
    
    Parameters
    ----------
    breakdown_data: pd.DataFrame
        Aggregated data at over-level (result of over_breakdown())
    over_col: str
        Name of the column that represents the over number
    runs_col: str
        Name of the column that represents the total runs scored from the ball
    wicket_col: str
        Name of the column that represents the number of wickets taken 
    base_wicket_spacing: numeric
        The numeric spacing to be placed between the top of the bar and the first wicket point
    inter_wicket_spacing: numeric
        The numeric spacing to be placed between wicket points on the graph

    Returns
    -------
    pandas.DataFrame
        Contains the cooridinates for the scatter points for wickets
    """
    # Only keep those overs that contain wickets 
    wickets_mask = (over_data[wicket_col] > 0)
    wickets_data = over_data.loc[wickets_mask, [over_col, runs_col, wicket_col]]
    wickets_data_array = np.array(wickets_data)
    
    # Create a list of y coordinates for each wicket in the over-level array and then `explode` that into multiple rows
    wickets = pd.DataFrame([[x[0], x[1], np.array(range(0, x[2])) * inter_wicket_spacing] for x in wickets_data_array]).explode(2)
    wickets.columns = ['x', 'runs', 'offset']
    
    # Calculate the y position using the base spacing and the offset calculated above 
    wickets['y'] = wickets['runs'] + base_wicket_spacing + wickets['offset']
    return wickets.loc[:, ['x', 'y']]

In [None]:
over_example = over_breakdown(ball_data, 1216492, 1)
display(HTML(over_example.tail(2).to_html()))
wicket_scatter_points(over_example, 'over', 'total_runs', 'is_wicket', 0.7, 1)

You can see above that the overs that contained multiple wickets have multiple coordinates now that are spaced `base_wicket_spacing` runs about the bar with `inter_wicket_spacing` runs between each point so this can be adjusted manually if required. i.e. in the penultimate over, 2 wickets were lost and 5 runs were scored, the first wicket should have a y value of _runs + base_wicket_spacing_ and the second wicket has a y value of _runs + base_wicket_spacing + inter_wicket_spacing_.

## Manhattan plot
Now that we have the primary functions that we need for formatting the data in the right way we can think about putting that together and plotting the data on a given `axes`. 

In [None]:
def innings_manhattan(
    ax, 
    ball_data, 
    match_id, 
    inning,
    over_col = 'over', 
    runs_col = 'total_runs', 
    wicket_col = 'is_wicket', 
    base_wicket_spacing = 0.5,
    inter_wicket_spacing = 0.7
    ):
    """
    Plot a Manhattan plot for a given match's innings
    
    Parameters
    ----------
    ax: Axes
        the axes object to plot the graphic on
    ball_data: pd.DataFrame
        ball-level data with match_id as the index column
    match_id: numeric
        id for the match that you want to plot
    inning: numeric:
        the innings in the given match to plot
    over_col: str
        Name of the column that represents the over number
    runs_col: str
        Name of the column that represents the total runs scored from the ball
    wicket_col: str
        Name of the column that represents the number of wickets taken 
    base_wicket_spacing: numeric
        The numeric spacing to be placed between the top of the bar and the first wicket point
    inter_wicket_spacing: numeric
        The numeric spacing to be placed between wicket points on the graph

    Returns
    -------
    axes
        Returns the input axes with additions for plot
    """
    breakdown = over_breakdown(ball_data, match_id, inning, over_col, runs_col, wicket_col)
    wickets = wicket_scatter_points(breakdown, over_col, runs_col, wicket_col, base_wicket_spacing, inter_wicket_spacing)
    wicket_colour = '#e74c3c'
    run_colour = '#3498db'
    ax.bar(breakdown.index + 1, breakdown.total_runs, fill=run_colour)
    ax.set_xticks(range(1, 21))
    ax.set_yticks(range(0, np.max(breakdown.total_runs) + 1, 2))
    ax.set_xlabel('Over')
    ax.set_ylabel('Runs from over')
    ax.scatter(wickets.x + 1, wickets.y, s=50, c=wicket_colour)
    
    return ax

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(10, 6), dpi=180, sharex=True)
# Note spacing adjustment here to account for the figsize and axes proportions
innings_manhattan(ax[0], ball_data, 1216492, 1, 'over', 'total_runs', 'is_wicket', 0.7, 1.1) 

# Remove the additions for the top axes as sharex=True
ax[0].set_xlabel('')
ax[0].tick_params(axis='x', which='both', bottom=False)

innings_manhattan(ax[1], ball_data, 1216492, 2, 'over', 'total_runs', 'is_wicket', 0.7, 1.1)
plt.show()

That looks pretty good! We could probably add the functionality to pass in additional attributes that set the colour of the bars and points, but the functionality is there. 

## Worm plot
We have already done some the data cleaning required for this, but just need to have the over-level data be accumulative instead so that we can track the progress of the game overall.

In [None]:
over_example['cumsum_runs'] = over_example['total_runs'].cumsum()
over_example['cumsum_wickets'] = over_example['is_wicket'].cumsum()
over_example

We actually only need the accumulative version of the **runs** for this particular application, but the ongoing wickets (and thus score) would be uesful in other places so I have included it above. 

In [None]:
def add_innings_worm(
    ax, 
    ball_data, 
    match_id, 
    inning,
    over_col = 'over', 
    runs_col = 'total_runs', 
    wicket_col = 'is_wicket', 
    base_wicket_spacing = 0.5,
    inter_wicket_spacing = 0.7, 
    line_colour='#3498db',
    wicket_colour='#e74c3c'
    ):
    """
    Plot a line plot for a given match's innings
    
    Parameters
    ----------
    ax: Axes
        the axes object to plot the graphic on
    ball_data: pd.DataFrame
        ball-level data with match_id as the index column
    match_id: numeric
        id for the match that you want to plot
    inning: numeric:
        the innings in the given match to plot
    over_col: str
        Name of the column that represents the over number
    runs_col: str
        Name of the column that represents the total runs scored from the ball
    wicket_col: str
        Name of the column that represents the number of wickets taken 
    base_wicket_spacing: numeric
        The numeric spacing to be placed between the top of the bar and the first wicket point
    inter_wicket_spacing: numeric
        The numeric spacing to be placed between wicket points on the graph

    Returns
    -------
    axes
        Returns the input axes with additions for plot
    """
    breakdown = over_breakdown(ball_data, match_id, inning, over_col, runs_col, wicket_col)
    breakdown['cumsum_runs'] = breakdown[runs_col].cumsum()
    wickets = wicket_scatter_points(breakdown, over_col, 'cumsum_runs', wicket_col, base_wicket_spacing, inter_wicket_spacing)
    ax.plot(breakdown.index + 1, breakdown.cumsum_runs, c=line_colour)
    ax.set_xticks(range(1, 21))
    ax.set_yticks(range(0, np.max(breakdown.cumsum_runs) + 1, 10))
    ax.set_xlabel('Over')
    ax.set_ylabel('Runs')
    ax.scatter(wickets.x + 1, wickets.y, s=25, c=wicket_colour)
    
    return ax

We can use this function to add a line to the generated `Axes` for each innings. As it is likely that we are going to be using the same `Axes` for this plot, I have added the ability to define the colours of both the points and the line so that you are able to differentiate between the two lines. 

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5), dpi=180, sharex=True)
add_innings_worm(ax, ball_data, 1216492, 1, 'over', 'total_runs', 'is_wicket', 3, 3, '#e67e22', '#e67e22') 
add_innings_worm(ax, ball_data, 1216492, 2, 'over', 'total_runs', 'is_wicket', 3, 3, '#9b59b6', '#9b59b6') 
plt.show()

- - -

If you've got any suggestions then please let me know, or any ideas or tips about how I can improve this kernel then I would love to hear them.

If you've enjoyed reading this then please consider **upvoting** this kernel!