# <center>IPL Lovers</center>

![IPL](https://www.bharatarmy.com/Docs/Banner_ipl2020.jpg)

<br>
Are you a Cricket Lover?<br>
Do you love watching Cricket?<br>
Are you interested in reading commentary in Cricbuzz?

<br><br>
This dataset is for you…


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os


import plotly.express as px
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

from PIL import Image
import requests
from io import BytesIO

DATA_DIR = "/kaggle/input/can-generate-automatic-commentary-for-ipl-cricket/"

for dirname, _, filenames in os.walk(DATA_DIR):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Problem:

Generate automated commentary for Cricket based on score (Four, Six or OUT). This dataset provides you the history of IPL highlights commentary from Cricbuzz. 


The dataset contains two files.

- IPL Schedule 2008 - 2020 (Current)
- IPL Match Highlights commentary

Over 10K+ Commentary available.


In [None]:
ipl_schedule_df = pd.read_csv(os.path.join(DATA_DIR, "IPL_SCHEDULE_2008_2020.csv"))
ipl_highlights_df = pd.read_csv(os.path.join(DATA_DIR, "IPL_Match_Highlights_Commentary.csv"))
print(f"IPL Schedule: {ipl_schedule_df.shape}, IPL Highlights {ipl_highlights_df.shape}")

Observation:

- Match_id is a unique identifier used for mapping the IPL match schedule to match highlights that is provided in another highlight sheet

<strong>IPL Schedule Data</strong>

- Match_id, unique identifier for every match
- IPL_year, the year schedule was happened
- Match_Info, states the number of the match, qualifier, final or eliminator
- Match_Date, indicates the match held date (YYYY-mm-dd format)
- Match_Team, team who played each other
- Match_Result, tells match status like which team won
- Match_Cricbuzz_URL, the link available for match highlights
- Stadium, which stadium the match conducted
- Location, where the match happend
- Highlights_available, indicates whether match highlights available in Cricbuzz.

<strong>IPL Highlights Data</strong>

- Match_id, unique identifier for every match
- Team, the team who played each other 
- Over_num, over of the ball which commentary available
- Commentary, ball commentary
- Bowler_Onstrike, it contains bowler and batsman name
- Run_scored, run scored on the ball

In [None]:
ipl_schedule_df.head(5)

In [None]:
ipl_schedule_df.head(5)

## No of Match held on each year

In [None]:
match_distribution = ipl_schedule_df["IPL_year"].value_counts() 

fig = px.bar(match_distribution, title="No of match held on every Year")
fig.update_layout(
    xaxis_title="IPL Year",
    yaxis_title="No of match")
fig.show()

## Mostly Played Stadium 

In [None]:
stadium_distribution = ipl_schedule_df["Stadium"].value_counts() 

fig = px.bar(stadium_distribution, title="Stadium Vs no of match")
fig.update_layout(
    xaxis_title="Stadium",
    yaxis_title="No of match held")

fig.show()

Observation:

- M.Chinnasway stadium has most no of matches held in 2008 - 2020 IPL year
- Eden Garden stadium has second most matches held

## Location vs No of Stadium

In [None]:
ipl_schedule_df.groupby(["Location"])["Stadium"].nunique().reset_index().sort_values("Stadium", ascending=False).reset_index(drop=True)

In [None]:
## Mumbai stadiums
ipl_schedule_df[ipl_schedule_df["Location"] == "Mumbai"]["Stadium"].unique()

## Location vs No of matches

In [None]:
stadium_distribution = ipl_schedule_df["Location"].value_counts() 

fig = px.bar(stadium_distribution, title="Location Vs no of match")
fig.update_layout(
    xaxis_title="Location",
    yaxis_title="No of match held")

fig.show()

Observation:

- Even though, M.Chinnasway stadium has most of no played matches. In location wise, Mumbai and Bengaluru has most of matched happened.

- Fact is, Mumbai has only 3 stadiums available to play.

## Highlights commentarty available matches

In [None]:
ipl_schedule_df["Highlights_available"].value_counts()

In [None]:
match_highlights_date = ipl_schedule_df[ipl_schedule_df["Highlights_available"] == True]["Match_Date"]
print(f"The match highlights available from {match_highlights_date.min()} to {match_highlights_date.max()}")

- Only 188 matches has highlights commentary in Cricbuzz..
- The match highlights available from 2017 to 2020 (current)

In [None]:
## Join the Schedule info with Highlights info using match_id

## Highlights Data

In [None]:
ipl_schedule_highlights_available = ipl_schedule_df[ipl_schedule_df["Highlights_available"] == True]

In [None]:
ipl_schedule_commentary_df = pd.merge(ipl_highlights_df, ipl_schedule_highlights_available, on="Match_id")
ipl_schedule_commentary_df.columns

## Commentary wordcloud

In [None]:
stopwords = set(STOPWORDS)


wordcloud = WordCloud(stopwords=stopwords, contour_width=3, contour_color='steelblue', background_color="white", max_words=1000)
wordcloud.generate(",".join(ipl_schedule_commentary_df["Commentary"].tolist()))

plt.figure(figsize = (20,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("English word distribution",fontdict={"fontsize":20}, pad=2)
plt.show()

## Run Scored Distribution

In [None]:
score_distribution = ipl_schedule_commentary_df["Run_scored"].value_counts() 

fig = px.bar(score_distribution, title="Run Distribution")
fig.update_layout(
    xaxis_title="Run",
    yaxis_title="Frequency")

fig.show()

Observation:

- Four, Six and Out has signigicant match highlights available.

**Can we able to build automatic way of generating commentary for score based on the historic data. I'm interested to see, how Transformers based model will generate commentary for the run scored.**

<font color="blue"><strong>Thanks for reading the basic exploration notebook.</strong></font>