# Exploratory Data Analysis on Baseball Matches

Essential key steps to demonstrate in your Python Notebook
1. Loading data in to DataFrames. (Integration of SQL and Python if required)
2. Check the Data Types of your data columns.
3. Drop any NULL, missing values or unwanted columns.
4. Drop duplicate values.
5. Check for outliers using a box plot or histogram.
6. Plot features against each other using a pair plot.
7. Use a HeatMap for finding the correlation between the features(Feature to Feature).
8. Use a scatter plot to show the relationship between 2 variables.
9. Merging two Data Frames.
10. Slicing Data of a particular column value (like year, month, filter values depending on the categorical data)
11. Representing data in matrix form.
12. Upload data to Numerical Python (NumPy)
13. Select a slice or part of the data and display.
14. Use conditions and segregate the data based on the condition (like show data of a feature(column) >,<,= a number)
15. Use mathematical and statistical functions using libraries.
16. Select data based on a category(categorical data based).
17. Libraries expected to try(minimum 4 required): Pandas, Numpy, Seaborn, Matplotlib .
18. Write your own functions and handle exceptions in the functions. 
19. Use of *arg and **kwargs.
20. Use of data functions. 
21. Create classes.

Website - https://data.world/vijayabhaskar/ipl-all-match-complete-data

In [None]:
from google.colab import drive
drive.mount('/content/drive')
#To be done

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Problem Statement  <a class="anchor" id="section1">
In this python notebook, analysis of IPL matches from 2008 to 2020 is done using python packages like pandas, matplotlib and seaborn.  We have conducted an exploratory data analysis on two different datasets, for the purpose of determining what factors may have an impact on a teams chances of winning.  The factors we chose to analyze are as follows: Win rate, number of games played, result of the pre-game coin toss, and the venue.  We were also curious to know about additional factors, such as: highest performing players. 

## 2. Data Loading and Description <a class="anchor" id="section2">

**Some Background Information**

The Indian Premier League (IPL), is a professional Twenty20 cricket league in India contested during April and May of every year
by teams representing Indian cities and some states from 2008 to 2020. The dataset that we use in this notebook is matches.csv

**Importing Packages**

In [None]:
import numpy as np                  # Implemennts milti-dimensional array and matrices
import pandas as pd                 # For data manipulation and analysis
import matplotlib.pyplot as plt     # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns               # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

from subprocess import check_output

import warnings                                            # Ignore warning related to pandas_profiling
warnings.filterwarnings('ignore') 

# To be added in our 2nd file
def annot_plot(ax,w,h):                                    # function to add data to plot
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for p in ax.patches:
        ax.annotate('{0:.1f}'.format(p.get_height()), (p.get_x()+w, p.get_height()+h))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls /content/drive/MyDrive/PythonMidterm/

ls: cannot access '/content/drive/MyDrive/PythonMidterm/': No such file or directory


In [None]:
import os
os.chdir('/content/drive/My Drive/Final-Baseball/')
import BaseballMethods a

SyntaxError: ignored

**Importing the Dataset**

In [None]:
try:
  ball_by_ball = pd.read_csv("/IPL Ball-by-Ball 2008-2020.csv")
except ValueError:
  print("file not found in directory")
ball_by_ball.head(2)


In [None]:
try:
  matches_data = pd.read_csv("/IPL Matches 2008-2020.csv")

except ValueError:
  print("file not found in directory")
matches_data.head(2)

In [None]:
print(f"This Dataset have {ball_by_ball.shape[0]} rows and {ball_by_ball.shape[1]} columns.")

In [None]:
print(f"This Dataset have {matches_data.shape[0]} rows and {matches_data.shape[1]} columns.")

## 3. Data Profiling

- __understanding our dataset__ using various pandas functionalities.
- __pandas profiling__ we will find which columns of our dataset need preprocessing.
- In __preprocessing__ we will deal with erronous and missing values of columns. 
- Again we will do __pandas profiling__ to see how preprocessing have transformed our dataset.

### 3.1 - Understanding the Dataset

Observing few rows and columns of data both from the starting and from the end


In [None]:
ball_by_ball.shape

IPL matches ball by ball has 193468 rows and 18 columns

In [None]:
matches_data.shape

IPL matches data has 816 rows and 17 columns

In [None]:
ball_by_ball.columns

In [None]:
matches_data.columns

In [None]:
ball_by_ball.head(2)

In [None]:
matches_data.head(2)

In [None]:
ball_by_ball.tail(2)

In [None]:
matches_data.tail(2)

In [None]:
ball_by_ball.info()

In [None]:
matches_data.info()

In [None]:
ball_by_ball.describe()

In [None]:
matches_data.describe()

### 3.2 - Preprocessing

In [None]:
#checking null values
cri.find_null_values(ball_by_ball)

In [None]:
##checking null values
cri.find_null_values(matches_data)

- Dealing with missing values<br/>
    - Replacing missing entries of __City__ from the Venue Column.
    - Replacing __Rising Pune supergiant as Rising Pune supergiants__.
    - Replace city __Bengaluru to Bangalore__

In [None]:
matches_data.columns.unique()

Rising Pune Supergiants is having 2 teams with different names (Rising Pune Supergiants and Rising Pune Supergiant)  replacing all values with same name Rising Pune Supergiants

In [None]:
#Replacing Rising Pune Supergiant with Rising Pune Supergiants
cri.replace_values_for_consistency(matches_data)
matches_data.head(2)

**Replacing Null values in CITY column from VENUE**

In [None]:
cri.replace_values_for_consistency(matches_data)
matches_data.head(10)


**Missing values of winner and player_of_match columns.**

In [None]:
matches_data[matches_data['winner'].isnull()]

As the matches had no result these columns would be left blank and no need to replace with any values.

**Replace city Bengaluru to Bangalore**

In [None]:
matches_data.replace( 'Bengaluru', 'Bangalore',inplace = True)
matches_data['city'].unique()

### 3.3 - Post Profiling<a class="anchor" id="section304">

In [None]:
#display team names and column names
cri.display_post_profiling(matches_data)

##4. Questions

### 4.1 - Total matches played by each team  <a class="anchor" id="section403">

In [None]:
matches_played = matches_data['team1'].value_counts()+ matches_data['team2'].value_counts()
matches_played


### 4.2 - How many matches won by each team ?  <a class="anchor" id="section404">

In [None]:
matches_won = matches_data.groupby('winner').count()
matches_won["id"]

### 4.3 - Comparison between Number of matches won by each team and total matches played <a class="anchor" id="section405">

In [None]:
cri.matches_won_total_matches(matches_data, matches_won, matches_played)

### 4.4 - Success Rate of each team <a class="anchor" id="section406">

In [None]:
match_succes_rate = (matches_won["id"]/matches_played)*100
#print(match_succes_rate)

data = match_succes_rate.sort_values(ascending = False)
plt.figure(figsize=(7,3))
ax = sns.barplot(x = data.index, y = data, palette="Set2");
plt.ylabel('Succes rate of each team')
plt.xticks(rotation=80)
annot_plot(ax,0.08,1)


__Chennai Super Kings__ have highest winning success rate __(59.6%)__ , followed by Mumbai Indians __(59.1%)__.  Without doing a full statistical analysis, it seems clear that the teams that play more games tend to have a higher win percentage. 

### 4.5 - Top 10 high performing Players <a class="anchor" id="section415">

In [None]:
plt.figure(figsize=(5,3))

ax = matches_data['player_of_match'].value_counts()[:10].plot.bar()
plt.title('Top 10 high performing Players')
annot_plot(ax,0.08,1)

__AB de Villers__ has won Player of the match  __23__ times followed by  __CH Gayle__ who had won __22__ times.

### 4.6 - Toss winning success rate of each team <a class="anchor" id="section416">

In [None]:
toss_won = matches_data['toss_winner'].value_counts()
toss_win_rate = (toss_won/matches_played)*100
data = toss_win_rate.sort_values(ascending = False)
plt.figure(figsize=(5,3))
ax = sns.barplot(x = data.index, y = data, palette="Set2");
plt.ylabel('Toss win rate of each team')
plt.xticks(rotation=90)
annot_plot(ax,0.08,1)
plt.title('Toss winning success rate of each team')

__Delhi Capitals__ has the highest Toss winning success rate of __60.6%__ and __Rising Pune Supergaints__ has the least toss winning success rate of __43.3%__.

### 4.7 - Checking Corelation in Ball By Ball database <a class="anchor" id="section406">

In [None]:
corelation = ball_by_ball.corr()

In [None]:
sns.heatmap(corelation, xticklabels=corelation.columns, yticklabels=corelation.columns)

###4.8 - Most Runs Scored

In [None]:
most_runs = ball_by_ball.groupby(['id'])['total_runs'].sum().reset_index()

# Ascending Order
cri.runs_scored_ascending(most_runs)

# Descending Order
cri.runs_scored_descending(most_runs)

In [None]:
sns.pairplot(asc_most_runs)

### 4.9 - Toss Decision across matches <a class="anchor" id="section417">

In [None]:
toss=matches_data['toss_decision'].value_counts()
labels=np.array(toss.index)
sizes = toss.values
colors = ['blue', 'red']

# Plot
plt.figure(figsize=(5,3))
plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True,startangle=90)

plt.title('Toss Decision of all the matches')
plt.axis('equal')
plt.show()

__60.8%__ of the toss winning teams have opted for __Feilding__ while __39.2%__ have opted for __Batting__.

### 4.10 - How toss winning affects the match winner <a class="anchor" id="section420">

In [None]:
tosswin_win = matches_data['id'][matches_data['toss_winner'] == matches_data['winner']].count()
total_matches=matches_data['id'].count()
Success_rate = ((matches_data[matches_data['toss_winner'] == matches_data['winner']].count())/(total_matches))*100

print("Number of matches in which Toss winner is the game winner is :",tosswin_win, "out of",total_matches," ie.,", Success_rate["id"],"%" )


###  4.11 - Toss Decision  in which Toss winner is the game winner<a class="anchor" id="section421">

In [None]:
tosswin_winner = matches_data['toss_decision'][matches_data['toss_winner'] == matches_data['winner']].value_counts()
labels=np.array(tosswin_winner.index)
sizes = tosswin_winner.values
colors = ['red', 'lightskyblue']

plt.figure(figsize=(5,3))
plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True,startangle=90)

plt.title('Toss decision of toss winner to win the game')
plt.axis('equal')
plt.show()

__65.3%__ of the toss winning teams had decided to __field__ first while __34.7%__ had decided to __Bat__ first and won the matches.

### 4.12 - Top 10 Cities to hold match <a class="anchor" id="section422">

In [None]:
plt.figure(figsize=(5,3))

ax=matches_data['city'].value_counts()[:10].plot.bar()
plt.title('Top 10 Cities to hold match')
plt.xticks(rotation=70)
annot_plot(ax,0.08,1)

In [None]:
plt.figure(figsize=(15,10))
matches_data['city'].value_counts().plot.pie(autopct="%0.2f%%")

__Mumbai__ had hold highest number of matches (__101.0__) followed by __Bangalore (80.0)__.

### 4.13 - In which city does each team has won more matches ?<a class="anchor" id="section423">

In [None]:
a = matches_data.groupby(['winner','city']).size().reset_index(name='win_counts')
a = a.sort_values("win_counts",ascending=False)
a.groupby("winner").head(1)

Teams had won more matches in their home grounds.

### 4.14 - Top 10 venue to hold matches <a class="anchor" id="section424">

In [None]:
#top 10 venue to hold max number of matches
plt.figure(figsize=(5,3))
venue=matches_data.groupby('venue')["id"].count()
ax =venue.sort_values(ascending=False).head(10).plot.bar(figsize=(5,3))
plt.title('Top 10 venue to hold matches')
plt.xticks(rotation=90)
annot_plot(ax,0.08,1)

In [None]:
plt.figure(figsize=(14,8))
sns.countplot(y = 'venue',data = matches_data)

__Eden Gardens__ had hold highest number of matches (__77.0__) followed by __Feroz Shah Kotla (74.0)__.

### 4.15 - Identify if each Venue is Best Suited to opt for batting or fielding based on previous matches won on that venue <a class="anchor" id="section425">

In [None]:
venue_suit_for = matches_data[matches_data['toss_winner'] == matches_data['winner']]
sns.countplot(x='venue',hue='toss_decision',data=venue_suit_for)
plt.xlabel('Venue ')
plt.title('Venue is Best Suited for')
plt.xticks(rotation=90)

__M Chinnaswamy stadium and Eden Gardens__ is best Suited for Feilding and __MA Chidambaram Stadium ,Chepauk__ is best suited for Batting.

### 4.16 - Match Results : Normal , tie and no result <a class="anchor" id="section426">

In [None]:
result=matches_data['result'].value_counts().tolist()
names='Normal - '+str(result[0]), 'Tie - '+str(result[1]), 'No result - '+str(result[2]), 

fig, ax = plt.subplots(figsize=(3.5,3.5))  
# Create a pieplot
explode = (0, 0.01, 0.01)
ax1,text=ax.pie(result,labeldistance=2,explode=explode,radius=0.1, startangle=180,colors=['skyblue','green','red'])
#plt.show()
ax.axis('equal')
ax.set_title("Match Results") 

# add a circle at the center
my_circle=plt.Circle( (0,0), 0.07, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.legend(ax1, names,  bbox_to_anchor=(.9,.8), loc=2)
plt.tight_layout()
plt.show()

### 4.17 - Combining 2 Data Frames

In [None]:
IplData = ball_by_ball[['id']].merge(matches_data, left_on = 'id',right_on = 'id',how = 'left')
IplData.head(2)

### 4.18 - Runs Vs Wickets in combined Data Frame

In [None]:
plt.figure(figsize=(6,6))
sizes = IplData.result.value_counts()
labels = IplData.result.value_counts().index
plt.pie(sizes,colors = ['b','g','r'],
         labels=labels,
         autopct='%1.1f%%',
         startangle=90,
         pctdistance=0.75,
         )

#draw white circle
centre_circle = plt.Circle((0,0),0.60,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

### 4.19 - Histogram and PairPlot in Data Frames

In [None]:
ball_by_ball.hist(figsize=(10,10))
plt.show()

In [None]:
sns.pairplot(matches_data)

### 4.20 - Matrix Graph for Data Frame

In [None]:
matrix = ball_by_ball.corr() 
f, ax = plt.subplots(figsize=(25, 12)) 
sns.heatmap(matrix, vmax=.8, square=True, cmap="RdYlGn",annot = True);

### 4.21 - Top 5 Umpire 

In [None]:
sns.barplot(x = matches_data['umpire1'].value_counts().head(10).values, y = matches_data['umpire1'].value_counts().head(10).index, data = matches_data)
plt.title("Umpire 1")
plt.xlabel("Match Count")
plt.ylabel("Top 5 Umpire 2")

In [None]:
sns.barplot(x = matches_data['umpire2'].value_counts().head(10).values, y = matches_data['umpire2'].value_counts().head(10).index, data = matches_data)
plt.title("Umpire 2")
plt.xlabel("Match Count")
plt.ylabel("Top 5 Umpire 2")

### 4.22 - Cross Tab Plot

Toss Winner Vs Toss Decision

In [None]:
pd.crosstab(matches_data['toss_winner'],matches_data['toss_decision']).style.background_gradient(cmap = 'spring')

Team1 Vs Team2

In [None]:
pd.crosstab(matches_data['team1'],matches_data['team2']).style.background_gradient(cmap = 'autumn')

Batsman vs The wickets they have taken 

In [None]:
pd.crosstab(ball_by_ball['batsman'],ball_by_ball['is_wicket']).style.background_gradient(cmap = 'seismic')

Batsman and there total runs

In [None]:
pd.crosstab(ball_by_ball['batsman'],ball_by_ball['batsman_runs']).style.background_gradient(cmap = 'PRGn')

### 4.23 - Cat Plot

####The team won both match and toss

In [None]:
sns.catplot(x = "result", y = "winner", hue = "toss_decision", data = matches_data)

## 4. Conclusion

*   The success rate of Rising Pune Supergiants is good compeatively amoung the new teams.
*   Royal Challengers Bangalore
 and Chennai Super Kings are best defending team.
*   From the year 2014 most of the teams are opting to field after winning the toss and are also successful in winning matches.
*   Overall Chennai Super Kings and Mumbai Indians have high success rate.
*   From the analysis we can see that Chennai Super Kings and Mumbai Indians are more likely to win upcomin IPL seasons. 



To conclude, we found that there is likely a correlation between number of games played and number of games won. Additionally, it seems beneficial for teams to win the coin toss at the beginning of each game.  
