# INCREASED PREDICTION ACCURACY IN THE GAME OF CRICKET 


## Using Dataset IPL Player Stats - 2016 till 2019 taken from Kaggle
https://www.kaggle.com/cclayford/cricinfo-statsguru-data

### Table of Contents

1. [Dataset Description](#cricket)
2. [Importing the packages and dataset](#packages)
3. [Exploring the dataset](#explore)
4. [Data Cleaning](#clean)
5. [Creating Additional Columns (or) Derived Attributes](#add)
6. [Treating Category and Object values](#le)
7. [Splitting data for Runs prediction](#rsplit)
8. [Models for Runs Prediction](#rm)
     - 8.1 [Decision Tree Classifier](#dtr)
     - 8.2 [Pruned Decision Tree Classifier](#dtrprune)
     - 8.3 [Random Forest Classifier](#rfr)
     - 8.4 [Pruned Random Forest Classifier](#rfrprune)
     - 8.5 [Gradient Boost Classifier](#gbr)
     - 8.6 [Pruned Gradient Boosting Classifier](#gbrprune)
     - 8.7 [Support Vector Machine Classifier](#svmr)
     - 8.8 [KNearest Neighbors Classifier](#knnr)
     - 8.9 [Bagging Classifier](#bagr)
     - 8.10 [XG Boost Classifier](#xgbr)
     - 8.11 [Pruned XG Boost Classifier](#xgbrprune)
     - 8.12 [Ada Boost Classifier](#abr)
     - 8.13 [Pruned Ada Boost Classifier](#abrprune)
     - 8.14 [Logistic Regression](#lrr)
     - 8.15 [Stacking  for Logistic Regression using Mlxtend Classifier](#strmlx)
     - 8.16 [Stacking using Voting Classifier](#strvote)
     - 8.17 [Comparison Table of all models for Runs Prediction](#ctr)
9. [Splitting Data for Wickets Prediction](#wsplit)
10. [Models for Wickets Prediction](#wm)
     - 10.1 [Decision Tree Classifier](#dtw)
     - 10.2 [Pruned Decision Tree Classifier](#dtwprune)
     - 10.3 [Random Forest Classifier](#rfw)
     - 10.4 [Gradient Boost Classifier](#gbw)
     - 10.5 [Pruned Gradient Boosting Classifier](#gbwprune)
     - 10.6 [Support Vector Machine Classifier](#svmw)
     - 10.7 [KNearest Neighbors Classifier](#knnw)
     - 10.8 [Bagging Classifier](#bagw)
     - 10.9 [XG Boost Classifier](#xgbw)
     - 10.10 [Pruned XG Boost Classifier](#xgbwprune)
     - 10.11 [Ada Boost Classifier](#abw)
     - 10.12 [Pruned Ada Boost Classifier](#abwprune)
     - 10.13 [Logistic Regression](#lrw)
     - 10.14 [Stacking  for Logistic Regression using Mlxtend Classifier](#stwmlx)
     - 10.15 [Stacking using Voting Classifier](#stwvote)
     - 10.16 [Comparison Table of all models for Runs Prediction](#ctw)

### 1. Dataset Description  <a id='cricket'>
    
**Objective**
Our objective is to predict the performance of players using the columns Runs and Wickets.

**Columns**

1. Team - The Player Teams

2. Player - Name of Players

3. Tournament - Tournament Name

4. Matches - No of Matches

5. Batting Innings - No of innings the batsmen batted

6. Not Out - No of times the batsmen was not out

7. Runs Scored - Runs scored by the batsmen

8. Highest Score - Highest score of the batsmen

9. Batting Average -  Average number of runs scored by batsmen

10. Balls Faced - Balls Faced by the batsmen

11. Batting Strike Rate - Average no of Runs scored by batsmen per 100 balls

12. 100 - No of Centuries scored by batsmen

13. 50 - No of Fifties scored by batsmen

14. 0 - No of zeroes scored by batsmen

15. 4s - No of Fours batsmen scored

16. 6s - No of Sixes batsmen scored

17. Bowling Innings - No of innings the bowler bowled

18. Overs Bowled - No of Overs the bowler bowled

19. Maidens Bowled - No of Maidens the bowler bowled

20. Runs Conceded - Total no of runs scored by opponent when bowler bowled

21. Wickets Taken - No of Wickets taken by the bowler

22. 4+ Innings Wickets - No of innings where bowler took more than four wickets

23. 5+ Innings Wickets - No of innings where bowler took more than five wickets

24. Catches Taken - No of catches taken

25. Stumpings made - No of Stumpings made

### 2. Importing the packages and dataset  <a id='packages'>

In [10]:
# Importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [565]:
# Reading the dataset
cricket = pd.read_csv('IPL Player Stats - 2016 till 2019.csv')

### 3. Exploring the dataset  <a id='explore'>

In [566]:
# Head of the dataset
cricket.head()

Unnamed: 0,Team,Player,Tournament,Matches,Batting Innings,Not Out,Runds Scored,Highest Score,Batting Average,Balls Faced,...,Runs Conceded,Wickets Taken,Best Bowling Figures,Bowling Average,Bowling Economy Rate,Bowling Strike Rate,4+ Innings Wickets,5+ Innings Wickets,Catches Taken,Stumpings Made
0,Delhi Daredevils,CH Morris,IPL 2016,12,7,4,195,82*,65.0,109,...,308,13,2/30,23.69,7.00,20.3,0,0,8,0
1,Delhi Daredevils,CH Morris,IPL 2017,9,9,4,154,52*,30.8,94,...,240,12,4/26,20.00,7.74,15.5,1,0,5,0
2,Delhi Daredevils,CH Morris,IPL 2018,4,4,3,46,27*,46.0,26,...,143,3,2/41,47.66,10.21,28.0,0,0,2,0
3,Delhi Daredevils,JP Duminy,IPL 2016,10,8,3,191,49*,38.2,156,...,55,2,1/4,27.50,7.85,21.0,0,0,3,0
4,Delhi Daredevils,Q de Kock,IPL 2016,13,13,1,445,108,37.08,327,...,-,-,-,-,-,-,-,-,2,2


In [567]:
# Shape of the dataset
cricket.shape

(631, 29)

In [568]:
# Head of the dataset
cricket.columns

Index(['Team', 'Player', 'Tournament', 'Matches', 'Batting Innings', 'Not Out',
       'Runds Scored', 'Highest Score', 'Batting Average', 'Balls Faced',
       'Batting Strike Rate', '100', '50', '0', '4s', '6s', 'Bowling Innings',
       'Overs Bowled', 'Maidens Bowled', 'Runs Conceded', 'Wickets Taken',
       'Best Bowling Figures', 'Bowling Average', 'Bowling Economy Rate',
       'Bowling Strike Rate', '4+ Innings Wickets', '5+ Innings Wickets',
       'Catches Taken', 'Stumpings Made'],
      dtype='object')

In [569]:
# Renaming the columns
cricket = cricket.rename(columns={'Runds Scored':'Runs Scored','50':'Fifties','100':'Centuries','0':'Zeroes','Overs Bowled':'Overs','4s':'Fours','6s':'Sixes'})

In [570]:
# Checking for null values
cricket.isnull().sum()

Team                    0
Player                  0
Tournament              0
Matches                 0
Batting Innings         0
Not Out                 0
Runs Scored             0
Highest Score           0
Batting Average         0
Balls Faced             0
Batting Strike Rate     0
Centuries               0
Fifties                 0
Zeroes                  0
Fours                   0
Sixes                   0
Bowling Innings         0
Overs                   0
Maidens Bowled          0
Runs Conceded           0
Wickets Taken           0
Best Bowling Figures    0
Bowling Average         0
Bowling Economy Rate    0
Bowling Strike Rate     0
4+ Innings Wickets      0
5+ Innings Wickets      0
Catches Taken           0
Stumpings Made          0
dtype: int64

In [571]:
# Info of the dataset
cricket.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 29 columns):
Team                    631 non-null object
Player                  631 non-null object
Tournament              631 non-null object
Matches                 631 non-null int64
Batting Innings         631 non-null object
Not Out                 631 non-null object
Runs Scored             631 non-null object
Highest Score           631 non-null object
Batting Average         631 non-null object
Balls Faced             631 non-null object
Batting Strike Rate     631 non-null object
Centuries               631 non-null object
Fifties                 631 non-null object
Zeroes                  631 non-null object
Fours                   631 non-null object
Sixes                   631 non-null object
Bowling Innings         631 non-null object
Overs                   631 non-null object
Maidens Bowled          631 non-null object
Runs Conceded           631 non-null object
Wickets Taken   

In [572]:
# Value counts for Team column
cricket['Team'].value_counts()

Royal Challengers Bangalore    84
Kings XI Punjab                79
Sunrisers Hyderabad            77
Kolkata Knight Riders          74
Mumbai Indians                 73
Delhi Daredevils               59
Gujarat Lions                  43
Rajasthan Royals               40
Chennai Super Kings            39
Rising Pune Supergiants        23
Delhi Capitals                 20
Rising Pune Supergiant         20
Name: Team, dtype: int64

In [573]:
# Value counts for Batting Innings column
cricket['Batting Innings'].value_counts().head()

1    103
2     62
-     62
3     60
4     52
Name: Batting Innings, dtype: int64

In [574]:
# Value counts for Not Out column
cricket['Not Out'].value_counts().head()

1    179
0    173
2     88
3     68
-     62
Name: Not Out, dtype: int64

In [575]:
# Value counts for Highest Score column
cricket['Highest Score'].value_counts().head()

-     62
0*    20
1*    17
0     16
5*    12
Name: Highest Score, dtype: int64

We could see * present in the values and all the columns we have seen has - value present which needs to be treated

In [576]:
# Value counts for Batting Average column
cricket['Batting Average'].value_counts().head()

-       130
0.00     21
1.00     16
8.00     12
6.00     12
Name: Batting Average, dtype: int64

In [577]:
# Value counts for Balls Faced column
cricket['Balls Faced'].value_counts().head()

-    62
3    20
1    19
4    18
2    16
Name: Balls Faced, dtype: int64

In [578]:
# Value counts for Batting Strike Rate column
cricket['Batting Strike Rate'].value_counts().head()

-         70
0.00      28
100.00    22
50.00     14
33.33      7
Name: Batting Strike Rate, dtype: int64

In [579]:
# Value counts for Centuries column
cricket['Centuries'].value_counts()

0    551
-     62
1     15
2      2
4      1
Name: Centuries, dtype: int64

In [580]:
# Value counts for Fifties column
cricket['Fifties'].value_counts()

0    393
1     71
-     62
2     41
3     32
4     14
5      9
6      5
8      2
7      1
9      1
Name: Fifties, dtype: int64

In [581]:
# Value counts for Zeroes column
cricket['Zeroes'].value_counts()

0    369
1    154
-     62
2     32
3     14
Name: Zeroes, dtype: int64

In [582]:
# Value counts for Fours column (taken head due to presence of lot of values)
cricket['Fours'].value_counts().head()

0    151
1     67
-     62
2     38
3     26
Name: Fours, dtype: int64

In [583]:
# Value counts for Sixes column (taken head due to presence of lot of values)
cricket['Sixes'].value_counts().head()

0    199
1     81
-     62
4     35
2     29
Name: Sixes, dtype: int64

In [584]:
# Value counts for Bowling Innings column (taken head due to presence of lot of values)
cricket['Bowling Innings'].value_counts().head()

-    218
1     54
2     46
4     33
3     33
Name: Bowling Innings, dtype: int64

In [585]:
# Value counts for Overs column (taken head due to presence of lot of values)
cricket['Overs'].value_counts().head()

-      218
2.0     20
4.0     20
3.0     16
1.0     13
Name: Overs, dtype: int64

In [586]:
# Value counts for Maidens Bowled column (taken head due to presence of lot of values)
cricket['Maidens Bowled'].value_counts().head()

0    362
-    218
1     40
2      8
3      2
Name: Maidens Bowled, dtype: int64

In [587]:
# Value counts for Runs Conceded column (taken head due to presence of lot of values)
cricket['Runs Conceded'].value_counts().head()

-     218
42      6
49      6
31      5
33      4
Name: Runs Conceded, dtype: int64

In [588]:
# Value counts for Wickets Taken column (taken head due to presence of lot of values)
cricket['Wickets Taken'].value_counts().head()

-    218
0     67
1     47
2     45
3     35
Name: Wickets Taken, dtype: int64

In [589]:
# Value counts for Bowling Average column (taken head due to presence of lot of values)
cricket['Bowling Average'].value_counts().head()

-        285
21.00      5
27.00      5
35.50      4
33.00      4
Name: Bowling Average, dtype: int64

In [590]:
# Value counts for Bowling Economy Rate column (taken head due to presence of lot of values)
cricket['Bowling Economy Rate'].value_counts().head()

-        218
10.50      8
9.00       8
8.00       7
13.00      5
Name: Bowling Economy Rate, dtype: int64

In [591]:
# Value counts for Bowling Strike Rate column (taken head due to presence of lot of values)
cricket['Bowling Strike Rate'].value_counts().head()

-       285
12.0     16
24.0     16
30.0     11
21.0     10
Name: Bowling Strike Rate, dtype: int64

In [592]:
# Value counts for 4+ Innings Wickets column (taken head due to presence of lot of values)
cricket['4+ Innings Wickets'].value_counts().head()

0    381
-    218
1     28
2      3
3      1
Name: 4+ Innings Wickets, dtype: int64

In [593]:
# Value counts for 5+ Innings Wickets column (taken head due to presence of lot of values)
cricket['5+ Innings Wickets'].value_counts().head()

0    407
-    218
1      6
Name: 5+ Innings Wickets, dtype: int64

### 4. Data Cleaning  <a id='clean'>

In [594]:
# Removing the '*' values from the Highest Score column
cricket['Highest Score'] = cricket['Highest Score'].apply(lambda x:x.replace('*',''))

In [595]:
# Value counts of Highest Score column
cricket['Highest Score'].value_counts()

-      62
0      36
1      28
7      21
5      19
6      16
2      15
4      12
3      12
12     11
9      11
15     10
8      10
10     10
19      9
16      9
35      8
27      7
24      7
65      7
67      7
31      7
36      7
11      7
13      7
52      6
21      6
34      6
50      6
14      6
       ..
90      2
104     2
85      2
80      2
48      2
29      2
58      2
73      2
102     2
23      1
113     1
126     1
79      1
101     1
108     1
71      1
128     1
129     1
86      1
103     1
42      1
117     1
114     1
94      1
87      1
78      1
38      1
49      1
105     1
88      1
Name: Highest Score, Length: 113, dtype: int64

We could see that the '*' values have been removed

In [596]:
# Creating a function to remove all the '-' values present in all the columns
def removedash(a):
    a = a.replace('-',np.nan)
    a = a.astype(float)
    a = a.fillna(value=a.median())
    return a

In [597]:
# Assigning the function for each columns to remove the '-' value
cricket['5+ Innings Wickets'] = removedash(cricket['5+ Innings Wickets'])

cricket['4+ Innings Wickets'] = removedash(cricket['4+ Innings Wickets'])

cricket['Bowling Strike Rate'] = removedash(cricket['Bowling Strike Rate'])

cricket['Bowling Economy Rate'] = removedash(cricket['Bowling Economy Rate'])

cricket['Bowling Average'] = removedash(cricket['Bowling Average'])

cricket['Wickets Taken'] = removedash(cricket['Wickets Taken'])

cricket['Runs Conceded'] = removedash(cricket['Runs Conceded'])

cricket['Maidens Bowled'] = removedash(cricket['Maidens Bowled'])

cricket['Overs'] = removedash(cricket['Overs'])

cricket['Bowling Innings'] = removedash(cricket['Bowling Innings'])

cricket['Fours'] = removedash(cricket['Fours'])

cricket['Sixes'] = removedash(cricket['Sixes'])

cricket['Zeroes'] = removedash(cricket['Zeroes'])

cricket['Fifties'] = removedash(cricket['Fifties'])

cricket['Centuries'] = removedash(cricket['Centuries'])

cricket['Batting Strike Rate'] = removedash(cricket['Batting Strike Rate'])

cricket['Balls Faced'] = removedash(cricket['Balls Faced'])

cricket['Batting Average'] = removedash(cricket['Batting Average'])

cricket['Not Out'] = removedash(cricket['Not Out'])

cricket['Highest Score'] = removedash(cricket['Highest Score'])

cricket['Batting Innings'] = removedash(cricket['Batting Innings'])

cricket['Runs Scored'] = removedash(cricket['Runs Scored'])

In [598]:
# Shape of the dataset after Data Cleaning
cricket.shape

(631, 29)

In [599]:
# Removing the column as I am not able to understand
cricket = cricket.drop('Best Bowling Figures',1)

### 5. Creating Additional Columns (or) Derived Attributes<a id='add'>

Now creating additional columns from the existing columns to derive more insights, the columns to be created are as follows:
    
    1. Batting Consistency - Tells how consistent the batsmen are
    
    2. Bowling Consistency - Tells how consistent the bowlers are
    
    3. Batting Form - Tells what form the batsmen are
    
    4. Bowling Form - Tells what form the bowlers are
    
    5. Batting Opposition - Describes the batsmens performance against a opposite team
    
    6. Bowling Opposition - Describes the bowlers performance against a opposite team
    
    7. Batting Venue - Describes the batsmen performance at a venue
    
    8. Bowling Venue - Describes the bowlers performance at a venue

In [600]:
# Creating Batting Consistency column using the formula
cricket['Batting Consistency'] = (0.2566*cricket['Batting Innings'])+(0.1510*cricket['Batting Strike Rate'])+(0.0787*cricket['Centuries'])+(0.0556*cricket['Fifties'])-(0.0328*cricket['Zeroes']) 

In [601]:
# Creating Bowling Consistency column using the formula
cricket['Bowling Consistency'] = 0.4174 * cricket['Overs']+ 0.2634 * cricket['Bowling Innings'] + 0.1602 * cricket['Bowling Strike Rate'] + 0.0975*cricket['Bowling Average'] + 0.0615*(cricket['4+ Innings Wickets'] + cricket['5+ Innings Wickets'])

In [602]:
# Creating Batting Form column using the formula
cricket['Batting Form'] = (0.4262*cricket['Batting Average'])+(0.2566*cricket['Batting Innings'])+(0.1510*cricket['Batting Strike Rate'])+(0.0787*cricket['Centuries'])+(0.0556*cricket['Fifties'])-(0.0328*cricket['Zeroes'])

In [603]:
# Creating Bowling Form column using the formula
cricket['Bowling Form'] =  0.3269*cricket['Overs'] + 0.2846*cricket['Bowling Innings'] + 0.1877*cricket['Bowling Strike Rate'] + 0.1210*cricket['Bowling Average'] + 0.0798*(cricket['4+ Innings Wickets']+cricket['5+ Innings Wickets']) 

In [604]:
# Creating Batting Opposition column using the formula
cricket['Batting Opposition'] = (0.4262*cricket['Batting Average'])+(0.2566*cricket['Batting Innings'])+(0.1510*cricket['Batting Strike Rate'])+(0.0787*cricket['Centuries'])+(0.0556*cricket['Fifties'])-(0.0328*cricket['Zeroes'])

In [605]:
# Creating Bowling Opposition column using the formula
cricket['Bowling Opposition'] =  (0.3177*cricket['Overs'])+(0.3177*cricket['Bowling Innings'])+(0.1933*cricket['Bowling Strike Rate'])+(0.1465*cricket['Bowling Average'])+(0.0943*(cricket['4+ Innings Wickets']+cricket['5+ Innings Wickets']))  

In [606]:
# Creating Batting Venue column using the formula
cricket['Batting Venue'] = (0.4262*cricket['Batting Average'])+(0.2566*cricket['Batting Innings'])+(0.1510*cricket['Batting Strike Rate'])+(0.0787*cricket['Centuries'])+(0.0556*cricket['Fifties'])+(0.0328*cricket['Highest Score']) 

In [607]:
# Creating Bowling Venue column using the formula
cricket['Bowling Venue'] = (0.3018*cricket['Overs'])+(0.2783*cricket['Bowling Innings'])+(0.1836*cricket['Bowling Strike Rate'])+(0.1391*cricket['Bowling Average'])+(0.0972*(cricket['4+ Innings Wickets']+cricket['5+ Innings Wickets'])) 

In [608]:
# Binning the Batting Average columns to create a new column Batting Average Rating
bins = [0.00,29.99,49.99,69.99,89.99]
names = ['0.00-29.99','30.00-49.99','50.00-69.99','70.00-89.99']

cricket['Batting Average Rating'] = pd.cut(cricket['Batting Average'], bins, labels=names,include_lowest=True)

replace_batting = {'0.00-29.99':'1','30.00-49.99':'2','50.00-69.99':'3','70.00-89.99':'4'}

cricket = cricket.replace({'Batting Average Rating':replace_batting})

In [609]:
# Value counts for Batting Average Rating column
cricket['Batting Average Rating'].value_counts()

1    511
2     96
3     20
4      4
Name: Batting Average Rating, dtype: int64

In [610]:
# Binning the Batting Strike Rate column to create a new column Batting SR Rating
bins = [0,100.00,200.00,300.00,400.00]
names = ['0.00-100.00','101.00-200.00','201.00-300.00','301.00-400.00']

cricket['Batting SR Rating'] = pd.cut(cricket['Batting Strike Rate'],bins,labels=names,include_lowest=True,)

replace_battingstrike = {'0.00-100.00':'1','101.00-200.00':'2','201.00-300.00':'3','301.00-400.00':'4'}

cricket = cricket.replace({'Batting SR Rating':replace_battingstrike})

In [611]:
# Value counts for Batting SR Rating column
cricket['Batting SR Rating'].value_counts()

2    424
1    194
3     12
4      1
Name: Batting SR Rating, dtype: int64

In [612]:
# Binning the Bowling Average column to create a new column Bowling Average Rating
bins = [0,50.00,100.00,150.00,200.00,250.00]
names = ['0.00-50.00','51.00-100.00','101.00-150.00','151.00-200.00','201.00-250.00']

cricket['Bowling Average Rating'] = pd.cut(cricket['Bowling Average'],bins,labels=names)

replace_bowling = {'0.00-50.00':'1','51.00-100.00':'2','101.00-150.00':'3','151.00-200.00':'4','201.00-250.00':'5'}

cricket = cricket.replace({'Bowling Average Rating':replace_bowling})

In [613]:
# Value counts for Bowling Average Rating column
cricket['Bowling Average Rating'].value_counts()

1    570
2     50
3      9
5      1
4      1
Name: Bowling Average Rating, dtype: int64

In [614]:
# Binning the Bowling Strike Rate column to create a new column Bowling SR Rating
bins = [0.00,50.00,100.00,150.00,200.00]
names = ['0.00-50.00','51.00-100.00','101.00-150.00','151.00-200.00']

cricket['Bowling SR Rating'] = pd.cut(cricket['Bowling Strike Rate'],bins,labels=names)

replace_bowlingSR = {'0.00-50.00':'1','51.00-100.00':'2','101.00-150.00':'3','151.00-200.00':'4'}

cricket = cricket.replace({'Bowling SR Rating':replace_bowlingSR})

In [615]:
# Value counts for Bowling SR Rating column
cricket['Bowling SR Rating'].value_counts()

1    610
2     18
3      3
Name: Bowling SR Rating, dtype: int64

In [616]:
# As the values present in these columns are with small differences and most of the values are same, we have binned to rate them
cricket = cricket.drop('Bowling Strike Rate',1)
cricket = cricket.drop('Batting Average',1)
cricket = cricket.drop('Bowling Average',1)
cricket = cricket.drop('Batting Strike Rate',1)

In [617]:
# Binning the Runs Scored column to create a new column Runs
bins = [0.0,500.00,1000.00]
names = ['0.0-600.0','601.0-1200.0']

cricket['Runs'] = pd.cut(cricket['Runs Scored'],bins,labels=names,include_lowest=True)

replace_runs = {'0.0-600.0':'1','601.0-1200.0':'2'}

cricket = cricket.replace({'Runs':replace_runs})

In [618]:
# Value counts for Runs column
cricket['Runs'].value_counts()

1    612
2     19
Name: Runs, dtype: int64

In [619]:
# Binning the Wickets Taken to create a new column Wickets
bins = [0.0,15.0,30.0]
names = ['0.0-15.0','16.0-30.0']

cricket['Wickets'] = pd.cut(cricket['Wickets Taken'],bins,labels=names,include_lowest=True)

replace_wickets = {'0.0-15.0':'1','16.0-30.0':'2'}

cricket = cricket.replace({'Wickets':replace_wickets})

In [620]:
# Value counts for Wickets column
cricket['Wickets'].value_counts()

1    592
2     39
Name: Wickets, dtype: int64

In [621]:
# Dropping the Runs Scored and Wickets Taken column as we have created new columns Runs and Wickets by rating them
cricket = cricket.drop('Runs Scored',1)
cricket = cricket.drop('Wickets Taken',1)

In [622]:
# Info of the dataset
cricket.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 36 columns):
Team                      631 non-null object
Player                    631 non-null object
Tournament                631 non-null object
Matches                   631 non-null int64
Batting Innings           631 non-null float64
Not Out                   631 non-null float64
Highest Score             631 non-null float64
Balls Faced               631 non-null float64
Centuries                 631 non-null float64
Fifties                   631 non-null float64
Zeroes                    631 non-null float64
Fours                     631 non-null float64
Sixes                     631 non-null float64
Bowling Innings           631 non-null float64
Overs                     631 non-null float64
Maidens Bowled            631 non-null float64
Runs Conceded             631 non-null float64
Bowling Economy Rate      631 non-null float64
4+ Innings Wickets        631 non-null float64
5+ Inni

### 6. Treating Category and Object values <a id='le'>

In [623]:
# Converting Runs and Wickets from Object to Integer data type
cricket['Runs'] = cricket['Runs'].astype(int)

cricket['Wickets'] = cricket['Wickets'].astype(int)

In [624]:
# Importing Label Encoder and encoding Team, Tournament and Player columns using Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

cricket['Team'] = le.fit_transform(cricket['Team'])

cricket['Tournament'] = le.fit_transform(cricket['Tournament'])

cricket['Player'] = le.fit_transform(cricket['Player'])

In [625]:
# Converting Batting Average Rating, Batting SR Rating, Bowling SR Rating columns from object to integer data type
cricket['Batting Average Rating'] = cricket['Batting Average Rating'].astype(int)

cricket['Batting SR Rating'] = cricket['Batting SR Rating'].astype(int)

cricket['Bowling Average Rating'] = cricket['Bowling Average Rating'].astype(int)

cricket['Bowling SR Rating'] = cricket['Bowling SR Rating'].astype(int)

In [626]:
# Info of the dataset
cricket.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 36 columns):
Team                      631 non-null int32
Player                    631 non-null int32
Tournament                631 non-null int32
Matches                   631 non-null int64
Batting Innings           631 non-null float64
Not Out                   631 non-null float64
Highest Score             631 non-null float64
Balls Faced               631 non-null float64
Centuries                 631 non-null float64
Fifties                   631 non-null float64
Zeroes                    631 non-null float64
Fours                     631 non-null float64
Sixes                     631 non-null float64
Bowling Innings           631 non-null float64
Overs                     631 non-null float64
Maidens Bowled            631 non-null float64
Runs Conceded             631 non-null float64
Bowling Economy Rate      631 non-null float64
4+ Innings Wickets        631 non-null float64
5+ Innings

### 7. Splitting data for Runs <a id = 'rsplit'>

In [627]:
# Assigning X and y for train test split
X_runs = cricket.drop('Runs',1)
y_runs = cricket['Runs']

In [628]:
# Importing train test split and splitting the data with test size 0.33
from sklearn.model_selection import train_test_split

X_train_runs,X_test_runs,y_train_runs,y_test_runs = train_test_split(X_runs,y_runs,random_state=2,test_size=0.33)

### 8. Models for Runs Prediction <a id='rm'>

### 8.1 Decision Tree Classiifier <a id = dtr>

In [629]:
# Importing Decision Tree Classifier and fitting the Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(class_weight='balanced')

here given class weight as balanced to balance the classes


In [630]:
# Training and testing the model
dt.fit(X_train_runs,y_train_runs)

dt_pred_runs = dt.predict(X_test_runs)

***Model Evaluation***

In [631]:
# Importing Metrics for model evaluation
from sklearn import metrics

In [632]:
# Accuracy Score
dt_runs_accuracy = metrics.accuracy_score(dt_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',dt_runs_accuracy)

# Precision Score
dt_runs_precision = metrics.precision_score(dt_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',dt_runs_precision)

# Recall Score
dt_runs_recall = metrics.recall_score(dt_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',dt_runs_recall)

# F1 Score
dt_runs_f1score = metrics.f1_score(dt_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',dt_runs_f1score)

# ROC AUC Score
dt_runs_aucrocscore = metrics.roc_auc_score(dt_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',dt_runs_aucrocscore)

# Train Score for Decision Tree Classifier
dt_runs_train = dt.score(X_train_runs,y_train_runs)

# Printing Train Score for Decision Tree Classifier
print('Train Score for Decision Tree Classifier',dt_runs_train)

# Test Score for Decision Tree Classifier
dt_runs_test = dt.score(X_test_runs,y_test_runs)

# Printing Test Score for Decision Tree Classifier
print('Test Score for Decision Tree Classifier',dt_runs_test)

Accuracy Score: 0.9617224880382775
Precision Score: 0.9900497512437811
Recall Score: 0.9707317073170731
F1 Score: 0.9802955665024631
ROC AUC Score: 0.7353658536585366
Train Score for Decision Tree Classifier 1.0
Test Score for Decision Tree Classifier 0.9617224880382775


We could see that the scores are good but as the Train score is 1.0 we can confirm that the model is Overfitting

To eliminate Overfitting we need to perform Pruning Techniques (ie) tune the parameters of the model

The tuning of the parameters are done by Randomized Search CV

### 8.2 Pruned Decision Tree Classifier<a id ='dtrprune'>

In [633]:
# Criterion for tree
criterion = ['gini','entropy']
# Splitter for tree
splitter = ['random','best']
# Maximum levels of  trees
max_depth = [2,5,10,15]
# Maximum number of samples required to split a node
max_leaf_nodes = [2, 5, 10]
# Maximum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree

# Parameter grid assigned for performing Randomized Sear
random_grid = {'criterion': criterion,
               'splitter': splitter,
               'max_depth': max_depth,
               'max_leaf_nodes': max_leaf_nodes,
               'min_samples_leaf': min_samples_leaf}

In [634]:
# Importing Randomized Search CV
from sklearn.model_selection import RandomizedSearchCV
# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
dt_random = RandomizedSearchCV(estimator = dt, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
dt_random.fit(X_runs,y_runs)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    0.8s finished


RandomizedSearchCV(cv=3,
                   estimator=DecisionTreeClassifier(class_weight='balanced'),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [2, 5, 10, 15],
                                        'max_leaf_nodes': [2, 5, 10],
                                        'min_samples_leaf': [1, 2, 4],
                                        'splitter': ['random', 'best']},
                   random_state=42, verbose=2)

In [635]:
# Best parameters
dt_random.best_params_

{'splitter': 'random',
 'min_samples_leaf': 1,
 'max_leaf_nodes': 10,
 'max_depth': 10,
 'criterion': 'entropy'}

In [648]:
# Fitting the model again after parameter tuning by assigning the parameters given by Randomized Search CV
dt_prune = DecisionTreeClassifier(class_weight='balanced',criterion = "entropy", splitter = 'random', max_leaf_nodes = 10, min_samples_leaf = 1,max_depth= 10)

dt_prune.fit(X_train_runs,y_train_runs)
dt_prune_runs_pred = dt_prune.predict(X_test_runs)

***Model Evaluation***

In [649]:
# Accuracy Score after Pruning
dt_prune_runs_accuracy = metrics.accuracy_score(dt_prune_runs_pred,y_test_runs)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',dt_prune_runs_accuracy)

# Precision Score after Pruning
dt_prune_runs_precision = metrics.precision_score(dt_prune_runs_pred,y_test_runs)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',dt_prune_runs_precision)

# Recall Score after Pruning
dt_prune_runs_recall = metrics.recall_score(dt_prune_runs_pred,y_test_runs)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',dt_prune_runs_recall)

# F1 Score after Pruning
dt_prune_runs_f1score = metrics.f1_score(dt_prune_runs_pred,y_test_runs)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',dt_prune_runs_f1score)

# ROC AUC Score after Pruning
dt_prune_runs_aucrocscore = metrics.roc_auc_score(dt_prune_runs_pred,y_test_runs)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',dt_prune_runs_aucrocscore)

# Train Score for Decision Tree Classifier after Pruning
dt_prune_runs_train = dt_prune.score(X_train_runs,y_train_runs)

# Printing Train Score for Decision Tree Classifier after Pruning
print('Train Score for Decision Tree Classifier after Pruning',dt_prune_runs_train)

# Test Score for Decision Tree Classifier after Pruning
dt_prune_runs_test = dt_prune.score(X_test_runs,y_test_runs)

# Printing Test Score for Decision Tree Classifier after Pruning
print('Test Score for Decision Tree Classifier ',dt_prune_runs_test)

Accuracy Score after Pruning: 0.9617224880382775
Precision Score after Pruning: 0.9900497512437811
Recall Score after Pruning: 0.9707317073170731
F1 Score after Pruning: 0.9802955665024631
ROC AUC Score after Pruning: 0.7353658536585366
Train Score for Decision Tree Classifier after Pruning 0.9976303317535545
Test Score for Decision Tree Classifier  0.9617224880382775


We could see after Pruning its not Overfitting like before

### 8.3 Random Forest Classifier <a id='rfr'>

In [650]:
# Importing Random Forest Classifier and fitting, training and testing the model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight='balanced')

rf.fit(X_train_runs,y_train_runs)
rf_pred_runs = rf.predict(X_test_runs)

***Model Evaluation***

In [651]:
# Accuracy Score
rf_runs_accuracy = metrics.accuracy_score(rf_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',rf_runs_accuracy)

# Precision Score
rf_runs_precision = metrics.precision_score(rf_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',rf_runs_precision)

# Recall Score
rf_runs_recall = metrics.recall_score(rf_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',rf_runs_recall)

# F1 Score
rf_runs_f1score = metrics.f1_score(rf_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',rf_runs_f1score)

# ROC AUC Score
rf_runs_aucrocscore = metrics.roc_auc_score(rf_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',rf_runs_aucrocscore)

# Train Score for Random Forest Classifier
rf_runs_train = rf.score(X_train_runs,y_train_runs)

# Printing Train Score for Random Forest Classifier
print('Train Score for Random Forest Classifier',rf_runs_train)

# Test Score for Random Forest Classifier
rf_runs_test = rf.score(X_test_runs,y_test_runs)

# Printing Test Score for Random Forest Classifier
print('Test Score for Random Forest Classifier',rf_runs_test)

Accuracy Score: 0.9760765550239234
Precision Score: 1.0
Recall Score: 0.9757281553398058
F1 Score: 0.9877149877149877
ROC AUC Score: 0.9878640776699029
Train Score for Random Forest Classifier 1.0
Test Score for Random Forest Classifier 0.9760765550239234


Here too we could see Overfitting problem and also here for Precision the score is 1.0, maybe for this classifier there are no False Positives

### 8.4 Pruned Random Forest Classifier <a id='rfrprune'>

In [652]:
# No of estimators
n_estimators=[1, 2, 4, 8, 16, 32, 64, 100, 200]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [2,5,10,15]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Grid of parameters for Randomized Search CV
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [656]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_runs,y_runs)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 276 tasks      | elapsed:    8.5s
[Parallel(n_jobs=-1)]: Done 285 out of 300 | elapsed:    8.9s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   10.1s finished


RandomizedSearchCV(cv=3,
                   estimator=RandomForestClassifier(class_weight='balanced'),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [2, 5, 10, 15],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [1, 2, 4, 8, 16, 32, 64,
                                                         100, 200]},
                   random_state=42, verbose=2)

In [657]:
# Best parameters
rf_random.best_params_

{'n_estimators': 8,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 10,
 'bootstrap': True}

In [658]:
# Fitting the model,training and testing after parameter by assigning the parameters
rf_prune = RandomForestClassifier(class_weight='balanced',n_estimators=8,min_samples_split=2,min_samples_leaf=4,max_features='auto',max_depth=10,bootstrap='True')

rf_prune.fit(X_train_runs,y_train_runs)

rf_prune_pred_runs = rf_prune.predict(X_test_runs)

***Model Evaluation***

In [659]:
# Accuracy Score after Pruning
rf_prune_runs_accuracy = metrics.accuracy_score(rf_prune_pred_runs,y_test_runs)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',rf_prune_runs_accuracy)

# Precision Score after Pruning
rf_prune_runs_precision = metrics.precision_score(rf_prune_pred_runs,y_test_runs)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',rf_prune_runs_precision)

# Recall Score after Pruning
rf_prune_runs_recall = metrics.recall_score(rf_prune_pred_runs,y_test_runs)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',rf_prune_runs_recall)

# F1 Score after Pruning
rf_prune_runs_f1score = metrics.f1_score(rf_prune_pred_runs,y_test_runs)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',rf_prune_runs_f1score)

# ROC AUC Score after Pruning
rf_prune_runs_aucrocscore = metrics.roc_auc_score(rf_prune_pred_runs,y_test_runs)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',rf_prune_runs_aucrocscore)

# Train Score for Random Forest Classifier after Pruning
rf_prune_runs_train = rf_prune.score(X_train_runs,y_train_runs)

# Printing Train Score for Random Forest Classifier after Pruning
print('Train Score for Random Forest Classifier after Pruning',rf_prune_runs_train)

# Test Score for Random Forest Classifier after Pruning
rf_prune_runs_test = rf_prune.score(X_test_runs,y_test_runs)

# Printing Test Score for Random Forest Classifier after Pruning
print('Test Score for Random Forest Classifier after Pruning',rf_prune_runs_test)

Accuracy Score after Pruning: 0.9856459330143541
Precision Score after Pruning: 0.9900497512437811
Recall Score after Pruning: 0.995
F1 Score after Pruning: 0.9925187032418954
ROC AUC Score after Pruning: 0.8863888888888888
Train Score for Random Forest Classifier after Pruning 0.990521327014218
Test Score for Random Forest Classifier after Pruning 0.9856459330143541


We could see after Pruning its not Overfitting life before

### 8.5 Gradient Boosting Classifier <a id='gbr'>

In [660]:
# Importing Gradient Booosting Classifier and fitting, training and testing the model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()

gb.fit(X_train_runs,y_train_runs)
gb_pred_runs = gb.predict(X_test_runs)

***Model Evaluation***

In [661]:
# Accuracy Score
gb_runs_accuracy = metrics.accuracy_score(gb_pred_runs,y_test_runs)

# Printing Accuracy Score 
print('Accuracy Score:',gb_runs_accuracy)

# Precision Score 
gb_runs_precision = metrics.precision_score(gb_pred_runs,y_test_runs)

# Printing Precision Score 
print('Precision Score:',gb_runs_precision)

# Recall Score
gb_runs_recall = metrics.recall_score(gb_pred_runs,y_test_runs)

# Printing Recall Score 
print('Recall Score',gb_runs_recall)

# F1 Score 
gb_runs_f1score = metrics.f1_score(gb_pred_runs,y_test_runs)

# Printing F1 Score 
print('F1 Score:',gb_runs_f1score)

# ROC AUC Score 
gb_runs_aucrocscore = metrics.roc_auc_score(gb_pred_runs,y_test_runs)

# Printing ROC AUC Score 
print('ROC AUC Score:',gb_runs_aucrocscore)

# Train Score for Gradient Boosting Classifier 
gb_runs_train = gb.score(X_train_runs,y_train_runs)

# Printing Train Score for Gradient Boosting Classifier 
print('Train Score for Gradient Boosting Classifier',gb_runs_train)

# Test Score for Gradient Boosting Classifier 
gb_runs_test = gb.score(X_test_runs,y_test_runs)

# Printing Test Score for Gradient Boosting Classifier
print('Test Score for Gradient Boosting Classifier',gb_runs_test)

Accuracy Score: 0.9856459330143541
Precision Score: 0.9950248756218906
Recall Score 0.9900990099009901
F1 Score: 0.9925558312655087
ROC AUC Score: 0.9236209335219236
Train Score for Gradient Boosting Classifier 1.0
Test Score for Gradient Boosting Classifier 0.9856459330143541


Here too there is Overfitting which needs to be corrected

### 8.6 Pruned Gradient Boosting Classifier <a id='gbrprune'>

In [662]:
# Loss
loss = ['deviance','exponential']
# Learning rate
learning_rate = [1, 0.5, 0.25, 0.1, 0.05, 0.01]
# No of estimators
n_estimators = [1, 2, 4, 8, 16, 32, 64]
# Max features
max_features = ['auto', 'sqrt', 'log2']
# No of levels of trees
max_depths = [2,5,10,15]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Critetion 
criterion = ['friedman mse','mse','mae']

# Parameters in the Grid for Randomized Search CV
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
                'learning_rate':learning_rate,
                'criterion':criterion}

In [667]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
gb_random.fit(X_runs,y_runs)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 285 out of 300 | elapsed:   21.0s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   30.0s finished


RandomizedSearchCV(cv=3, estimator=GradientBoostingClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'criterion': ['friedman mse', 'mse',
                                                      'mae'],
                                        'learning_rate': [1, 0.5, 0.25, 0.1,
                                                          0.05, 0.01],
                                        'max_depth': [2, 5, 10, 15],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [1, 2, 4, 8, 16, 32,
                                                         64]},
                   random_state=42, verbose=2)

In [668]:
# Best parameters
gb_random.best_params_

{'n_estimators': 32,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 10,
 'learning_rate': 0.25,
 'criterion': 'mae'}

In [669]:
# Fitting, training and testing the model again after Pruning
gb_prune = GradientBoostingClassifier(criterion='mae',n_estimators=32,min_samples_split=4,min_samples_leaf=5,max_features='auto',max_depth=10,learning_rate=0.25)

gb_prune.fit(X_train_runs,y_train_runs)

gb_prune_pred_runs = gb_prune.predict(X_test_runs)

***Model Evaluation***

In [670]:
# Accuracy Score after Pruning
gb_prune_runs_accuracy = metrics.accuracy_score(gb_prune_pred_runs,y_test_runs)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',gb_prune_runs_accuracy)

# Precision Score after Pruning
gb_prune_runs_precision = metrics.precision_score(gb_prune_pred_runs,y_test_runs)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',gb_prune_runs_precision)

# Recall Score after Pruning
gb_prune_runs_recall = metrics.recall_score(gb_prune_pred_runs,y_test_runs)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',gb_prune_runs_recall)

# F1 Score after Pruning
gb_prune_runs_f1score = metrics.f1_score(gb_prune_pred_runs,y_test_runs)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',gb_prune_runs_f1score)

# ROC AUC Score after Pruning
gb_prune_runs_aucrocscore = metrics.roc_auc_score(gb_prune_pred_runs,y_test_runs)

# Printing ROC AUC Score after Pruning 
print('ROC AUC Score after Pruning:',gb_prune_runs_aucrocscore)

# Train Score for Gradient Boosting Classifier after Pruning
gb_prune_runs_train = gb_prune.score(X_train_runs,y_train_runs)

# Printing Train Score for Gradient Boosting Classifier 
print('Train Score for Gradient Boosting Classifier after Pruning',gb_prune_runs_train)

# Test Score for Gradient Boosting Classifier after Pruning
gb_prune_runs_test = gb_prune.score(X_test_runs,y_test_runs)

# Printing Test Score for Gradient Boosting Classifier after Pruning
print('Test Score for Gradient Boosting Classifier after Pruning',gb_runs_test)

Accuracy Score after Pruning: 0.9856459330143541
Precision Score after Pruning: 1.0
Recall Score after Pruning: 0.9852941176470589
F1 Score after Pruning: 0.9925925925925926
ROC AUC Score after Pruning: 0.9926470588235294
Train Score for Gradient Boosting Classifier after Pruning 0.9928909952606635
Test Score for Gradient Boosting Classifier after Pruning 0.9856459330143541


Here we could see we have corrected Overfitting

### 8.7 Support Vector Machine Classifier <a id='svmr'>

In [671]:
# Importing Support Vector Machine Classifier and fitting,training and testing the model
from sklearn.svm import SVC
svc = SVC()

svc.fit(X_train_runs,y_train_runs)
svc_pred_runs = svc.predict(X_test_runs)

***Model Evaluation***

In [672]:
# Accuracy Score
svc_runs_accuracy = metrics.accuracy_score(svc_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',svc_runs_accuracy)

# Precision Score
svc_runs_precision = metrics.precision_score(svc_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',svc_runs_precision)

# Recall Score
svc_runs_recall = metrics.recall_score(svc_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',svc_runs_recall)

# F1 Score
svc_runs_f1score = metrics.f1_score(svc_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',svc_runs_f1score)

# ROC AUC Score
svc_runs_aucrocscore = metrics.roc_auc_score(svc_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',svc_runs_aucrocscore)

# Train Score for Support Machine Classifier
svc_runs_train = svc.score(X_train_runs,y_train_runs)

# Printing Train Score for Support Machine Classifier
print('Train Score for Support Machine Classifier',svc_runs_train)

# Test Score for Support Machine Classifier
svc_runs_test = svc.score(X_test_runs,y_test_runs)

# Printing Test Score for Support Machine Classifier
print('Test Score for Gradient Boosting Classifier',svc_runs_test)

Accuracy Score: 0.9760765550239234
Precision Score: 1.0
Recall Score: 0.9757281553398058
F1 Score: 0.9877149877149877
ROC AUC Score: 0.9878640776699029
Train Score for Support Machine Classifier 0.990521327014218
Test Score for Gradient Boosting Classifier 0.9760765550239234


Support Vector Machine draws a hyperplane to reduce overfitting unlike other algorithms, and here too Precision is 1.0 through which we can tell there are no False Positives

### 8.8 KNearest Neighbors Classifier<a id='knnr'>

In [673]:
# Importing the KNearest Neighbors model and fitting,training and testing
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

knn.fit(X_train_runs,y_train_runs)
knn_pred_runs = knn.predict(X_test_runs)

***Model Evaluation***

In [674]:
# Accuracy Score
knn_runs_accuracy = metrics.accuracy_score(knn_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',knn_runs_accuracy)

# Precision Score
knn_runs_precision = metrics.precision_score(knn_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',knn_runs_precision)

# Recall Score
knn_runs_recall = metrics.recall_score(knn_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',knn_runs_recall)

# F1 Score
knn_runs_f1score = metrics.f1_score(knn_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',knn_runs_f1score)

# ROC AUC Score
knn_runs_aucrocscore = metrics.roc_auc_score(knn_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',knn_runs_aucrocscore)

# Train Score for KNearest Neighbors Classifier
knn_runs_train = knn.score(X_train_runs,y_train_runs)

# Printing Train Score for KNearest Neighbors Classifier
print('Train Score for Support Machine Classifier',knn_runs_train)

# Test Score for KNearest Neighbors Classifier
knn_runs_test = knn.score(X_test_runs,y_test_runs)

# Printing Test Score for KNearest Neighbors Classifier
print('Test Score for Gradient Boosting Classifier',knn_runs_test)

Accuracy Score: 0.9760765550239234
Precision Score: 1.0
Recall Score: 0.9757281553398058
F1 Score: 0.9877149877149877
ROC AUC Score: 0.9878640776699029
Train Score for Support Machine Classifier 0.990521327014218
Test Score for Gradient Boosting Classifier 0.9760765550239234


As KNN works with distance criteria, this algorithm too prevents Overfitting, and here too there are no False Positives as Precision is 1.0

### 8.9 Bagging Classifier <a id='bagr'>

In [675]:
# Importing Bagging Classifier and fitting,training and testing
from sklearn.ensemble import BaggingClassifier
bc = BaggingClassifier()

bc.fit(X_train_runs,y_train_runs)
bc_pred_runs = bc.predict(X_test_runs)

***Model Evaluation***

In [676]:
# Accuracy Score
bc_runs_accuracy = metrics.accuracy_score(bc_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',bc_runs_accuracy)

# Precision Score
bc_runs_precision = metrics.precision_score(bc_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',bc_runs_precision)

# Recall Score
bc_runs_recall = metrics.recall_score(bc_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',bc_runs_recall)

# F1 Score
bc_runs_f1score = metrics.f1_score(bc_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',bc_runs_f1score)

# ROC AUC Score
bc_runs_aucrocscore = metrics.roc_auc_score(bc_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',bc_runs_aucrocscore)

# Train Score for Bagging Classifier
bc_runs_train = bc.score(X_train_runs,y_train_runs)

# Printing Train Score for Bagging Classifier
print('Train Score for Bagging Classifier',bc_runs_train)

# Test Score for Bagging Classifier
bc_runs_test = bc.score(X_test_runs,y_test_runs)

# Printing Test Score for Bagging Classifier
print('Test Score for Bagging Classifier',bc_runs_test)

Accuracy Score: 0.9856459330143541
Precision Score: 0.9950248756218906
Recall Score: 0.9900990099009901
F1 Score: 0.9925558312655087
ROC AUC Score: 0.9236209335219236
Train Score for Bagging Classifier 0.9976303317535545
Test Score for Bagging Classifier 0.9856459330143541


As Bagging Classifier reduces variance it thereby prevents Overfitting

### 8.10 XG Boost Classifier <a id='xgbr'>

In [677]:
# Importing XG Boost Classifier and fitting,training and testing
from xgboost.sklearn import XGBClassifier

xgb = XGBClassifier()

xgb.fit(X_train_runs,y_train_runs)
xgb_pred_runs = xgb.predict(X_test_runs)

***Model Evaluation***

In [678]:
# Accuracy Score
xgb_runs_accuracy = metrics.accuracy_score(xgb_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',xgb_runs_accuracy)

# Precision Score
xgb_runs_precision = metrics.precision_score(xgb_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',xgb_runs_precision)

# Recall Score
xgb_runs_recall = metrics.recall_score(xgb_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',xgb_runs_recall)

# F1 Score
xgb_runs_f1score = metrics.f1_score(xgb_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',xgb_runs_f1score)

# ROC AUC Score
xgb_runs_aucrocscore = metrics.roc_auc_score(xgb_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',xgb_runs_aucrocscore)

# Train Score for XG Boost Classifier
xgb_runs_train = xgb.score(X_train_runs,y_train_runs)

# Printing Train Score for XG Boost Classifier
print('Train Score for XG Boost Classifier',xgb_runs_train)

# Test Score for XG Boost Classifier
xgb_runs_test = xgb.score(X_test_runs,y_test_runs)

# Printing Test Score for XG Boost Classifier
print('Test Score for XG Boost Classifier',xgb_runs_test)

Accuracy Score: 0.9808612440191388
Precision Score: 0.9950248756218906
Recall Score: 0.9852216748768473
F1 Score: 0.9900990099009901
ROC AUC Score: 0.9092775041050905
Train Score for XG Boost Classifier 1.0
Test Score for XG Boost Classifier 0.9808612440191388


We need to prevent the Overfitting through Pruning

### 8.11 Pruned XG Boost Classifier<a id='xgbrprune'>

In [679]:
# Assigning Grid for Randomized Search CV
params_xgb_GS = {"max_depth": [3,5,6,7,8],
              "min_child_weight" : [5,6,7,8],
            'learning_rate':[0.05,0.1,0.2],
            'n_estimators': [10,30,50,70]}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
xgb_random = RandomizedSearchCV(estimator = xgb, param_distributions = params_xgb_GS, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [680]:
# Fit the random search model
xgb_random.fit(X_runs,y_runs)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 285 out of 300 | elapsed:    2.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    2.4s finished


RandomizedSearchCV(cv=3,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           gpu_id=-1, importance_type='gain',
                                           interaction_constraints='',
                                           learning_rate=0.300000012,
                                           max_delta_step=0, max_depth=6,
                                           min_child_weight=1, missing=nan,
                                           monotone_constraints='()',
                                           n_estimators=100, n_jobs=0,
                                           num_parallel_tree=1, random_state=0,
                                           reg_alpha=0, reg_lambda=1,
                                       

In [681]:
# Best parameters
xgb_random.best_params_

{'n_estimators': 50,
 'min_child_weight': 5,
 'max_depth': 6,
 'learning_rate': 0.1}

In [682]:
# Fitting,training and testing the model after Pruning
xgb_prune = XGBClassifier(n_estimators=50,min_child_weight=5,max_depth=6,learning_rate=0.1)

xgb_prune.fit(X_train_runs,y_train_runs)

xgb_prune_pred_runs = xgb_prune.predict(X_test_runs)

***Model Evaluation***

In [683]:
# Accuracy Score after Pruning
xgb_prune_runs_accuracy = metrics.accuracy_score(xgb_prune_pred_runs,y_test_runs)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',xgb_prune_runs_accuracy)

# Precision Score after Pruning
xgb_prune_runs_precision = metrics.precision_score(xgb_prune_pred_runs,y_test_runs)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',xgb_prune_runs_precision)

# Recall Score after Pruning
xgb_prune_runs_recall = metrics.recall_score(xgb_prune_pred_runs,y_test_runs)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',xgb_prune_runs_recall)

# F1 Score after Pruning
xgb_prune_runs_f1score = metrics.f1_score(xgb_prune_pred_runs,y_test_runs)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',xgb_prune_runs_f1score)

# ROC AUC Score after Pruning
xgb_prune_runs_aucrocscore = metrics.roc_auc_score(xgb_prune_pred_runs,y_test_runs)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',xgb_prune_runs_aucrocscore)

# Train Score for XG Boost Classifier after Pruning
xgb_prune_runs_train = xgb_prune.score(X_train_runs,y_train_runs)

# Printing Train Score for XG Boost Classifier after Pruning
print('Train Score for XG Boost Classifier after Pruning',xgb_prune_runs_train)

# Test Score for XG Boost Classifier after Pruning
xgb_prune_runs_test = xgb_prune.score(X_test_runs,y_test_runs)

# Printing Test Score for XG Boost Classifier after Pruning
print('Test Score for XG Boost Classifier after Pruning',xgb_prune_runs_test)

Accuracy Score after Pruning: 0.9808612440191388
Precision Score after Pruning: 0.9950248756218906
Recall Score after Pruning: 0.9852216748768473
F1 Score after Pruning: 0.9900990099009901
ROC AUC Score after Pruning: 0.9092775041050905
Train Score for XG Boost Classifier after Pruning 0.990521327014218
Test Score for XG Boost Classifier after Pruning 0.9808612440191388


Here we have corrected Overfitting

### 8.12 Ada Boost Classifier <a id='abr'>

In [762]:
# Importing Ada Boost Classifier and fitting,training and testing the model
from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier()

ab.fit(X_train_runs,y_train_runs)

ab_pred_runs = ab.predict(X_test_runs)

***Model Evaluation***

In [763]:
# Accuracy Score
ab_runs_accuracy = metrics.accuracy_score(ab_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',ab_runs_accuracy)

# Precision Score
ab_runs_precision = metrics.precision_score(ab_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',ab_runs_precision)

# Recall Score
ab_runs_recall = metrics.recall_score(ab_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',ab_runs_recall)

# F1 Score
ab_runs_f1score = metrics.f1_score(ab_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',ab_runs_f1score)

# ROC AUC Score
ab_runs_aucrocscore = metrics.roc_auc_score(ab_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',ab_runs_aucrocscore)

# Train Score for Ada Boost Classifier
ab_runs_train = ab.score(X_train_runs,y_train_runs)

# Printing Train Score for Ada Boost Classifier
print('Train Score for Ada Boost Classifier',ab_runs_train)

# Test Score for Ada Boost Classifier
ab_runs_test = ab.score(X_test_runs,y_test_runs)

# Printing Test Score for Ada Boost Classifier
print('Test Score for Ada Boost Classifier',ab_runs_test)

Accuracy Score: 0.9856459330143541
Precision Score: 0.9950248756218906
Recall Score: 0.9900990099009901
F1 Score: 0.9925558312655087
ROC AUC Score: 0.9236209335219236
Train Score for Ada Boost Classifier 1.0
Test Score for Ada Boost Classifier 0.9856459330143541


The Overfitting should be corrected

### 8.13 Pruned Ada Boost Classifier <a id='abrprune'>

In [686]:
# Assigning the parameters to a Grid for performing Randomized Search CV
params_Adb_GS = {'learning_rate':[0.05,0.1,0.2,1],'n_estimators':[10,30,50,60,70,75],'algorithm':['SAMME', 'SAMME.R']}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
ab_random = RandomizedSearchCV(estimator = ab,param_distributions=params_Adb_GS,n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [687]:
# Fit the Randomized Search CV
ab_random.fit(X_runs,y_runs)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:    4.9s finished


RandomizedSearchCV(cv=3, estimator=AdaBoostClassifier(), n_iter=100, n_jobs=-1,
                   param_distributions={'algorithm': ['SAMME', 'SAMME.R'],
                                        'learning_rate': [0.05, 0.1, 0.2, 1],
                                        'n_estimators': [10, 30, 50, 60, 70,
                                                         75]},
                   random_state=42, verbose=2)

In [688]:
# Best parameters
ab_random.best_params_

{'n_estimators': 10, 'learning_rate': 0.2, 'algorithm': 'SAMME'}

In [689]:
# Fitting,training and testing the model after Pruning
ab_prune = AdaBoostClassifier(n_estimators=10,learning_rate=0.2,algorithm='SAMME')

ab_prune.fit(X_train_runs,y_train_runs)

ab_prune_pred_runs = ab_prune.predict(X_test_runs)

***Model Evaluation***

In [690]:
# Accuracy Score after Pruning
ab_prune_runs_accuracy = metrics.accuracy_score(ab_prune_pred_runs,y_test_runs)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',ab_prune_runs_accuracy)

# Precision Score after Pruning
ab_prune_runs_precision = metrics.precision_score(ab_prune_pred_runs,y_test_runs)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',ab_prune_runs_precision)

# Recall Score after Pruning
ab_prune_runs_recall = metrics.recall_score(ab_prune_pred_runs,y_test_runs)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',ab_prune_runs_recall)

# F1 Score after Pruning
ab_prune_runs_f1score = metrics.f1_score(ab_prune_pred_runs,y_test_runs)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',ab_prune_runs_f1score)

# ROC AUC Score after Pruning
ab_prune_runs_aucrocscore = metrics.roc_auc_score(ab_prune_pred_runs,y_test_runs)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',ab_prune_runs_aucrocscore)

# Train Score for Ada Boost Classifier after Pruning
ab_prune_runs_train = ab_prune.score(X_train_runs,y_train_runs)

# Printing Train Score for Ada Boost Classifier after Pruning
print('Train Score for Ada Boost Classifier after Pruning',ab_prune_runs_train)

# Test Score for Ada Boost Classifier after Pruning
ab_prune_runs_test = ab_prune.score(X_test_runs,y_test_runs)

# Printing Test Score for Ada Boost Classifier after Pruning
print('Test Score for Ada Boost Classifier after Pruning',ab_prune_runs_test)

Accuracy Score after Pruning: 0.9808612440191388
Precision Score after Pruning: 0.9950248756218906
Recall Score after Pruning: 0.9852216748768473
F1 Score after Pruning: 0.9900990099009901
ROC AUC Score after Pruning: 0.9092775041050905
Train Score for XG Boost Classifier after Pruning 0.995260663507109
Test Score for XG Boost Classifier after Pruning 0.9808612440191388


We have prevented Overfitting

### 8.14 Logistic Regression <a id='lrr'>

In [691]:
# Importing Logistic Regression and fitting,training and testing the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

lr.fit(X_train_runs,y_train_runs)

lr_pred_runs = lr.predict(X_test_runs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


***Model Evaluation***

In [692]:
# Accuracy Score
lr_runs_accuracy = metrics.accuracy_score(lr_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',lr_runs_accuracy)

# Precision Score
lr_runs_precision = metrics.precision_score(lr_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',lr_runs_precision)

# Recall Score
lr_runs_recall = metrics.recall_score(lr_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',lr_runs_recall)

# F1 Score
lr_runs_f1score = metrics.f1_score(lr_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',lr_runs_f1score)

# ROC AUC Score
lr_runs_aucrocscore = metrics.roc_auc_score(lr_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',lr_runs_aucrocscore)

# Train Score for Logistic Regression
lr_runs_train = lr.score(X_train_runs,y_train_runs)

# Printing Train Score for Logistic Regression
print('Train Score for Logistic Regression',lr_runs_train)

# Test Score for Logistic Regression
lr_runs_test = lr.score(X_test_runs,y_test_runs)

# Printing Test Score for Logistic Regression
print('Test Score for Logistic Regression',lr_runs_test)

Accuracy Score: 0.9808612440191388
Precision Score: 1.0
Recall Score: 0.9804878048780488
F1 Score: 0.9901477832512315
ROC AUC Score: 0.9902439024390244
Train Score for Logistic Regression 1.0
Test Score for Logistic Regression 0.9808612440191388


We could see Overfitting, as Logistic Regression is a base model lets try using it as a meta classifier for Stacking to stack other models into it

### 8.15 Stacking for Logistic Regression using Mlxtend Classifier <a id='strmlx'>

In [693]:
# Importing Mlxtend classifier and fitting,training and testing
from mlxtend.classifier import StackingClassifier

# Stacking KNN,SVM,XG Boost,Ada Boost and Gradient Boosting models into Logistic Regression for better results
lrstack = StackingClassifier(classifiers=[knn,svc,xgb_prune,ab_prune,gb_prune],meta_classifier=lr)

lrstack.fit(X_train_runs,y_train_runs)

lrstack_pred_runs = st.predict(X_test_runs)

***Model Evaluation***

In [694]:
# Accuracy Score after Stacking for Logistic Regression
lrstack_runs_accuracy = metrics.accuracy_score(lrstack_pred_runs,y_test_runs)

# Printing Accuracy Score after Stacking for Logistic Regression
print('Accuracy Score after Stacking for Logistic Regression:',lrstack_runs_accuracy)

# Precision Score after Stacking for Logistic Regression
lrstack_runs_precision = metrics.precision_score(lrstack_pred_runs,y_test_runs)

# Printing Precision Score after Stacking for Logistic Regression
print('Precision Score after Stacking for Logistic Regression:',lrstack_runs_precision)

# Recall Score after Stacking for Logistic Regression
lrstack_runs_recall = metrics.recall_score(lrstack_pred_runs,y_test_runs)

# Printing Recall Score after Stacking for Logistic Regression
print('Recall Score after Stacking for Logistic Regression:',lrstack_runs_recall)

# F1 Score after Stacking for Logistic Regression
lrstack_runs_f1score = metrics.f1_score(lrstack_pred_runs,y_test_runs)

# Printing F1 Score after Stacking for Logistic Regression
print('F1 Score after Stacking for Logistic Regression:',lrstack_runs_f1score)

# ROC AUC Score after Stacking for Logistic Regression
lrstack_runs_aucrocscore = metrics.roc_auc_score(lrstack_pred_runs,y_test_runs)

# Printing ROC AUC Score after Stacking for Logistic Regression
print('ROC AUC Score after Stacking for Logistic Regression:',lrstack_runs_aucrocscore)

# Train Score for Logistic Regression after Stacking for Logistic Regression
lrstack_runs_train = lrstack.score(X_train_runs,y_train_runs)

# Printing Train Score for Logistic Regression after Stacking for Logistic Regression
print('Train Score for Logistic Regression after Stacking for Logistic Regression',lrstack_runs_train)

# Test Score for Logistic Regression after Stacking for Logistic Regression
lrstack_runs_test = lrstack.score(X_test_runs,y_test_runs)

# Printing Test Score for Logistic Regression after Stacking for Logistic Regression
print('Test Score for Logistic Regression after Stacking for Logistic Regression',lrstack_runs_test)

Accuracy Score after Stacking for Logistic Regression: 0.9760765550239234
Precision Score after Stacking for Logistic Regression: 1.0
Recall Score after Stacking for Logistic Regression: 0.9757281553398058
F1 Score after Stacking for Logistic Regression: 0.9877149877149877
ROC AUC Score after Stacking for Logistic Regression: 0.9878640776699029
Train Score for Logistic Regression after Stacking for Logistic Regression 0.9928909952606635
Test Score for Logistic Regression after Stacking for Logistic Regression 0.9760765550239234


Now Logistic Regression give better results

### 8.16 Stacking using Voting Classifier <a id='strvote'>

In [695]:
# Assigning estimator models for voting classifier
vote_est = [('knn',knn),('xgb',xgb_prune),('SVM',svc)]

In [696]:
# Importing Voting Classifier and fitting the model
from sklearn.ensemble import VotingClassifier
vote = VotingClassifier(estimators=vote_est)

In [697]:
# Training and testing the model
vote.fit(X_train_runs,y_train_runs)

vote_pred_runs = vote.predict(X_test_runs)

***Model Evaluation***

In [698]:
# Accuracy Score
vote_runs_accuracy = metrics.accuracy_score(vote_pred_runs,y_test_runs)

# Printing Accuracy Score
print('Accuracy Score:',vote_runs_accuracy)

# Precision Score
vote_runs_precision = metrics.precision_score(vote_pred_runs,y_test_runs)

# Printing Precision Score
print('Precision Score:',vote_runs_precision)

# Recall Score
vote_runs_recall = metrics.recall_score(vote_pred_runs,y_test_runs)

# Printing Recall Score
print('Recall Score:',vote_runs_recall)

# F1 Score
vote_runs_f1score = metrics.f1_score(vote_pred_runs,y_test_runs)

# Printing F1 Score
print('F1 Score:',vote_runs_f1score)

# ROC AUC Score
vote_runs_aucrocscore = metrics.roc_auc_score(vote_pred_runs,y_test_runs)

# Printing ROC AUC Score
print('ROC AUC Score:',vote_runs_aucrocscore)

# Train Score
vote_runs_train = vote.score(X_train_runs,y_train_runs)

# Printing Train Score
print('Train Score',vote_runs_train)

# Test Score
vote_runs_test = vote.score(X_test_runs,y_test_runs)

# Printing Test Score
print('Test Score',vote_runs_test)

Accuracy Score: 0.9760765550239234
Precision Score: 1.0
Recall Score: 0.9757281553398058
F1 Score: 0.9877149877149877
ROC AUC Score: 0.9878640776699029
Train Score 0.990521327014218
Test Score 0.9760765550239234


There is no Overfitting

### 8.17 Comparison Table of all models for Runs Prediction <a id='ctr'>

In [699]:
# Creating dictionary with all the metrics
runs_metrics = {'Classifier': ['Decision Tree','Pruned Decision Tree','Random Forest','Pruned Random Forest','Gradient Boosting','Pruned Gradient Boosting','Support Vector Machine','KNN','Bagging','XG Boost','Pruned XG Boost','Ada Boost','Pruned Ada Boost','Logistic Regression','Stacking for Logistic Regression using Mlxtend','Stacking using Voting'],
                'Accuracy':[dt_runs_accuracy,dt_prune_runs_accuracy,rf_runs_accuracy,rf_prune_runs_accuracy,gb_runs_accuracy,gb_prune_runs_accuracy,svc_runs_accuracy,knn_runs_accuracy,bc_runs_accuracy,xgb_runs_accuracy,xgb_prune_runs_accuracy,ab_runs_accuracy,ab_prune_runs_accuracy,lr_runs_accuracy,lrstack_runs_accuracy,vote_runs_accuracy],
                'Precision':[dt_runs_precision,dt_prune_runs_precision,rf_runs_precision,rf_prune_runs_precision,gb_runs_precision,gb_prune_runs_precision,svc_runs_precision,knn_runs_precision,bc_runs_precision,xgb_runs_precision,xgb_prune_runs_precision,ab_runs_precision,ab_prune_runs_precision,lr_runs_precision,lrstack_runs_precision,vote_runs_precision],
                'Recall':[dt_runs_recall,dt_prune_runs_recall,rf_runs_recall,rf_prune_runs_recall,gb_runs_recall,gb_prune_runs_recall,svc_runs_recall,knn_runs_recall,bc_runs_recall,xgb_runs_recall,xgb_prune_runs_recall,ab_runs_recall,ab_prune_runs_recall,lr_runs_recall,lrstack_runs_recall,vote_runs_recall],
                'F1 Score':[dt_runs_f1score,dt_prune_runs_f1score,rf_runs_f1score,rf_prune_runs_f1score,gb_runs_f1score,gb_prune_runs_f1score,svc_runs_f1score,knn_runs_f1score,bc_runs_f1score,xgb_runs_f1score,xgb_prune_runs_f1score,ab_runs_f1score,ab_prune_runs_f1score,lr_runs_f1score,lrstack_runs_f1score,vote_runs_f1score],
                'AUCROC Score':[dt_runs_aucrocscore,dt_prune_runs_aucrocscore,rf_runs_aucrocscore,rf_prune_runs_aucrocscore,gb_runs_aucrocscore,gb_prune_runs_aucrocscore,svc_runs_aucrocscore,knn_runs_aucrocscore,bc_runs_aucrocscore,xgb_runs_aucrocscore,xgb_prune_runs_aucrocscore,ab_runs_aucrocscore,ab_prune_runs_aucrocscore,lr_runs_aucrocscore,lrstack_runs_aucrocscore,vote_runs_aucrocscore],
                'Train Score':[dt_runs_train,dt_prune_runs_train,rf_runs_train,rf_prune_runs_train,gb_runs_train,gb_prune_runs_train,svc_runs_train,knn_runs_train,xgb_runs_train,bc_runs_train,xgb_prune_runs_train,ab_runs_train,ab_prune_runs_train,lr_runs_train,lrstack_runs_train,vote_runs_train],
                'Test Score':[dt_runs_test,dt_prune_runs_test,rf_runs_test,rf_prune_runs_test,gb_runs_test,gb_prune_runs_test,svc_runs_test,knn_runs_test,xgb_runs_test,bc_runs_test,xgb_prune_runs_test,ab_runs_test,ab_prune_runs_test,lr_runs_test,lrstack_runs_test,vote_runs_test]}

In [700]:
# Making the dictionary as Dataframe
runs_metrics = pd.DataFrame(runs_metrics)

In [701]:
# Printing the dataframe
runs_metrics

Unnamed: 0,Classifier,Accuracy,Precision,Recall,F1 Score,AUCROC Score,Train Score,Test Score
0,Decision Tree,0.961722,0.99005,0.970732,0.980296,0.735366,1.0,0.961722
1,Pruned Decision Tree,0.961722,0.99005,0.970732,0.980296,0.735366,0.99763,0.961722
2,Random Forest,0.976077,1.0,0.975728,0.987715,0.987864,1.0,0.976077
3,Pruned Random Forest,0.985646,0.99005,0.995,0.992519,0.886389,0.990521,0.985646
4,Gradient Boosting,0.985646,0.995025,0.990099,0.992556,0.923621,1.0,0.985646
5,Pruned Gradient Boosting,0.985646,1.0,0.985294,0.992593,0.992647,0.992891,0.985646
6,Support Vector Machine,0.976077,1.0,0.975728,0.987715,0.987864,0.990521,0.976077
7,KNN,0.976077,1.0,0.975728,0.987715,0.987864,0.990521,0.976077
8,Bagging,0.985646,0.995025,0.990099,0.992556,0.923621,1.0,0.980861
9,XG Boost,0.980861,0.995025,0.985222,0.990099,0.909278,0.99763,0.985646


Among all the models we could see Pruned XG Boost,Pruned Gradient Boosting are best models with least Train and Test score difference

### 9. Splitting data for Wickets prediction <a id='wsplit'>

In [702]:
# Assigning X and y for train test split
X_wickets = cricket.drop('Wickets',1)
y_wickets = cricket['Wickets']

In [703]:
# Splitting data into train and test using train test split
X_train_wickets,X_test_wickets,y_train_wickets,y_test_wickets = train_test_split(X_wickets,y_wickets,test_size=0.33,random_state=2)

### 10. Models for Wicket Predictions <a id ='wm'>

### 10.1 Decision Tree <a id='dtw'>

In [794]:
# Fitting,training and testing the model

dtw = DecisionTreeClassifier(class_weight='balanced')

dtw.fit(X_train_wickets,y_train_wickets)

dt_pred_wickets = dtw.predict(X_test_wickets)

***Model Evaluation***

In [795]:
# Accuracy Score
dt_wickets_accuracy = metrics.accuracy_score(dt_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',dt_wickets_accuracy)

# Precision Score
dt_wickets_precision = metrics.precision_score(dt_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',dt_wickets_precision)

# Recall Score
dt_wickets_recall = metrics.recall_score(dt_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',dt_wickets_recall)

# F1 Score
dt_wickets_f1score = metrics.f1_score(dt_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',dt_wickets_f1score)

# ROC AUC Score
dt_wickets_aucrocscore = metrics.roc_auc_score(dt_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',dt_wickets_aucrocscore)

# Train Score for Decision Tree Classifier
dt_wickets_train = dtw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Decision Tree Classifier
print('Train Score for Decision Tree Classifier',dt_wickets_train)

# Test Score for Decision Tree Classifier
dt_wickets_test = dtw.score(X_test_runs,y_test_wickets)

# Printing Test Score for Decision Tree Classifier
print('Test Score for Decision Tree Classifier',dt_wickets_test)

Accuracy Score: 0.9234449760765551
Precision Score: 0.953125
Recall Score: 0.9631578947368421
F1 Score: 0.9581151832460733
ROC AUC Score: 0.7447368421052631
Train Score for Decision Tree Classifier 1.0
Test Score for Decision Tree Classifier 0.9234449760765551


We could see that the scores are good but as the Train score is 1.0 we can confirm that the model is Overfitting

To eliminate Overfitting we need to perform Pruning Techniques (ie) tune the parameters of the model

The tuning of the parameters are done by Randomized Search CV

### 10.2 Pruned Decision Tree Classifier<a id ='dtwprune'>

In [819]:
# Criterion for tree
criterion = ['gini','entropy']
# Splitter for tree
splitter = ['random','best']
# Maximum levels of  trees
max_depth = [2,5,10,15]
# Maximum number of samples required to split a node
max_leaf_nodes = [2, 5, 10]
# Maximum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree

# Parameter grid assigned for performing Randomized Sear
random_grid = {'criterion': criterion,
               'splitter': splitter,
               'max_depth': max_depth,
               'max_leaf_nodes': max_leaf_nodes,
               'min_samples_leaf': min_samples_leaf}

In [820]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
dt_random = RandomizedSearchCV(estimator = dtw, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
dt_random.fit(X_wickets,y_wickets)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    3.6s finished


RandomizedSearchCV(cv=3,
                   estimator=DecisionTreeClassifier(class_weight='balanced'),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [2, 5, 10, 15],
                                        'max_leaf_nodes': [2, 5, 10],
                                        'min_samples_leaf': [1, 2, 4],
                                        'splitter': ['random', 'best']},
                   random_state=42, verbose=2)

In [821]:
# Best parameters
dt_random.best_params_

{'splitter': 'random',
 'min_samples_leaf': 4,
 'max_leaf_nodes': 10,
 'max_depth': 15,
 'criterion': 'entropy'}

In [822]:
# Fitting the model again after parameter tuning by assigning the parameters given by Randomized Search CV
dtw_prune = DecisionTreeClassifier(class_weight='balanced',criterion = "entropy", splitter = 'random', max_leaf_nodes = 10, min_samples_leaf = 4,max_depth= 15)

dtw_prune.fit(X_train_wickets,y_train_wickets)
dt_prune_wickets_pred = dtw_prune.predict(X_test_wickets)

***Model Evaluation***

In [882]:
# Accuracy Score after Pruning
dt_prune_wickets_accuracy = metrics.accuracy_score(dt_prune_wickets_pred,y_test_wickets)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',dt_prune_wickets_accuracy)

# Precision Score after Pruning
dt_prune_wickets_precision = metrics.precision_score(dt_prune_wickets_pred,y_test_wickets)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',dt_prune_wickets_precision)

# Recall Score after Pruning
dt_prune_wickets_recall = metrics.recall_score(dt_prune_wickets_pred,y_test_wickets)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',dt_prune_wickets_recall)

# F1 Score after Pruning
dt_prune_wickets_f1score = metrics.f1_score(dt_prune_wickets_pred,y_test_wickets)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',dt_prune_wickets_f1score)

# ROC AUC Score after Pruning
dt_prune_wickets_aucrocscore = metrics.roc_auc_score(dt_prune_wickets_pred,y_test_wickets)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',dt_prune_wickets_aucrocscore)

# Train Score for Decision Tree Classifier after Pruning
dt_prune_wickets_train = dtw_prune.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Decision Tree Classifier after Pruning
print('Train Score for Decision Tree Classifier after Pruning',dt_prune_wickets_train)

# Test Score for Decision Tree Classifier after Pruning
dt_prune_wickets_test = dtw_prune.score(X_test_runs,y_test_wickets)

# Printing Test Score for Decision Tree Classifier after Pruning
print('Test Score for Decision Tree Classifier ',dt_prune_wickets_test)

Accuracy Score after Pruning: 0.9138755980861244
Precision Score after Pruning: 0.9270833333333334
Recall Score after Pruning: 0.978021978021978
F1 Score after Pruning: 0.9518716577540107
ROC AUC Score after Pruning: 0.7297517297517297
Train Score for Decision Tree Classifier after Pruning 0.9478672985781991
Test Score for Decision Tree Classifier  0.9138755980861244


We could see after Pruning its not Overfitting that much like before

### 10.3 Random Forest Classifier <a id='rfw'>

In [824]:
# Fitting, training and testing the model
rfw = RandomForestClassifier(class_weight='balanced')

rfw.fit(X_train_runs,y_train_wickets)
rf_pred_wickets = rfw.predict(X_test_wickets)

***Model Evaluation***

In [884]:
# Accuracy Score
rf_wickets_accuracy = metrics.accuracy_score(rf_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',rf_wickets_accuracy)

# Precision Score
rf_wickets_precision = metrics.precision_score(rf_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',rf_wickets_precision)

# Recall Score
rf_wickets_recall = metrics.recall_score(rf_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',rf_wickets_recall)

# F1 Score
rf_wickets_f1score = metrics.f1_score(rf_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',rf_wickets_f1score)

# ROC AUC Score after Pruning
rf_wickets_aucrocscore = metrics.roc_auc_score(rf_prune_pred_wickets,y_test_wickets)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',rf_wickets_aucrocscore)


# Train Score for Random Forest Classifier
rf_wickets_train = rfw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Random Forest Classifier
print('Train Score for Random Forest Classifier',rf_wickets_train)

# Test Score for Random Forest Classifier
rf_wickets_test = rfw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Random Forest Classifier
print('Test Score for Random Forest Classifier',rf_wickets_test)

Accuracy Score: 0.9186602870813397
Precision Score: 1.0
Recall Score: 0.9186602870813397
F1 Score: 0.9576059850374066
ROC AUC Score after Pruning: 0.8013136288998357
Train Score for Random Forest Classifier 0.9478672985781991
Test Score for Random Forest Classifier 0.9186602870813397


We could see a bit of Overfitting here, and here Precision Score is 1.0 so there are no False Positives

### 10.4 Gradient Boosting Classifier <a id='gbw'>

In [826]:
# Fitting, training and testing the model
gbw = GradientBoostingClassifier()

gbw.fit(X_train_wickets,y_train_wickets)
gb_pred_wickets = gbw.predict(X_test_wickets)

***Model Evaluation***

In [827]:
# Accuracy Score
gb_wickets_accuracy = metrics.accuracy_score(gb_pred_wickets,y_test_wickets)

# Printing Accuracy Score 
print('Accuracy Score:',gb_wickets_accuracy)

# Precision Score 
gb_wickets_precision = metrics.precision_score(gb_pred_wickets,y_test_wickets)

# Printing Precision Score 
print('Precision Score:',gb_wickets_precision)

# Recall Score
gb_wickets_recall = metrics.recall_score(gb_pred_wickets,y_test_wickets)

# Printing Recall Score 
print('Recall Score',gb_wickets_recall)

# F1 Score 
gb_wickets_f1score = metrics.f1_score(gb_pred_wickets,y_test_wickets)

# Printing F1 Score 
print('F1 Score:',gb_wickets_f1score)

# ROC AUC Score 
gb_wickets_aucrocscore = metrics.roc_auc_score(gb_pred_wickets,y_test_wickets)

# Printing ROC AUC Score 
print('ROC AUC Score:',gb_wickets_aucrocscore)

# Train Score for Gradient Boosting Classifier 
gb_wickets_train = gbw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Gradient Boosting Classifier 
print('Train Score for Gradient Boosting Classifier',gb_wickets_train)

# Test Score for Gradient Boosting Classifier 
gb_wickets_test = gbw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Gradient Boosting Classifier
print('Test Score for Gradient Boosting Classifier',gb_wickets_test)

Accuracy Score: 0.9569377990430622
Precision Score: 0.9791666666666666
Recall Score 0.9740932642487047
F1 Score: 0.9766233766233766
ROC AUC Score: 0.8620466321243524
Train Score for Gradient Boosting Classifier 1.0
Test Score for Gradient Boosting Classifier 0.9569377990430622


Here too there is Overfitting which needs to be corrected

### 10.5 Pruned Gradient Boosting Classifier <a id='gbwprune'>

In [833]:
# Loss
loss = ['deviance','exponential']
# Learning rate
learning_rate = [1, 0.5, 0.25, 0.1, 0.05, 0.01]
# No of estimators
n_estimators = [1, 2, 4, 8, 16, 32, 64]
# Max features
max_features = ['auto', 'sqrt', 'log2']
# No of levels of trees
max_depths = np.linspace(1, 32, 32, endpoint=True)
# Minimum number of samples required to split a node
min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
# Minimum number of samples required at each leaf node
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)
# Critetion 
criterion = ['friedman mse','mse','mae']

# Parameters in the Grid for Randomized Search CV
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
                'learning_rate':learning_rate,
                'criterion':criterion}

In [888]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
gb_random = RandomizedSearchCV(estimator = gbw, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
gb_random.fit(X_wickets,y_wickets)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   30.5s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   33.5s
[Parallel(n_jobs=-1)]: Done 285 out of 300 | elapsed:   51.4s remaining:    2.6s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   58.1s finished


RandomizedSearchCV(cv=3, estimator=GradientBoostingClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'criterion': ['friedman mse', 'mse',
                                                      'mae'],
                                        'learning_rate': [1, 0.5, 0.25, 0.1,
                                                          0.05, 0.01],
                                        'max_depth': [2, 5, 10, 15],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [1, 2, 4, 8, 16, 32,
                                                         64]},
                   random_state=42, verbose=2)

In [889]:
# Best parameters
gb_random.best_params_

{'n_estimators': 16,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 15,
 'learning_rate': 1,
 'criterion': 'mae'}

In [890]:
# Fitting, training and testing the model again after Pruning
gbw_prune = GradientBoostingClassifier(criterion='mae',n_estimators=16,min_samples_split=5,min_samples_leaf=4,max_features='auto',max_depth=15,learning_rate=1)

gbw_prune.fit(X_train_wickets,y_train_wickets)

gbw_prune_pred_wickets = gbw_prune.predict(X_test_wickets)

***Model Evaluation***

In [891]:
# Accuracy Score after Pruning
gb_prune_wickets_accuracy = metrics.accuracy_score(gb_prune_pred_wickets,y_test_wickets)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',gb_prune_wickets_accuracy)

# Precision Score after Pruning
gb_prune_wickets_precision = metrics.precision_score(gb_prune_pred_wickets,y_test_wickets)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',gb_prune_wickets_precision)

# Recall Score after Pruning
gb_prune_wickets_recall = metrics.recall_score(gb_prune_pred_wickets,y_test_wickets)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',gb_prune_wickets_recall)

# F1 Score after Pruning
gb_prune_wickets_f1score = metrics.f1_score(gb_prune_pred_wickets,y_test_wickets)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',gb_prune_wickets_f1score)

# ROC AUC Score after Pruning
gb_prune_wickets_aucrocscore = metrics.roc_auc_score(gb_prune_pred_wickets,y_test_wickets)

# Printing ROC AUC Score after Pruning 
print('ROC AUC Score after Pruning:',gb_prune_wickets_aucrocscore)

# Train Score for Gradient Boosting Classifier after Pruning
gb_prune_wickets_train = gbw_prune.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Gradient Boosting Classifier 
print('Train Score for Gradient Boosting Classifier after Pruning',gb_prune_wickets_train)

# Test Score for Gradient Boosting Classifier after Pruning
gb_prune_wickets_test = gbw_prune.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Gradient Boosting Classifier after Pruning
print('Test Score for Gradient Boosting Classifier after Pruning',gb_wickets_test)

Accuracy Score after Pruning: 0.9521531100478469
Precision Score after Pruning: 0.984375
Recall Score after Pruning: 0.9642857142857143
F1 Score after Pruning: 0.9742268041237113
ROC AUC Score after Pruning: 0.8667582417582417
Train Score for Gradient Boosting Classifier after Pruning 1.0
Test Score for Gradient Boosting Classifier after Pruning 0.9569377990430622


Well even after Pruning we are not able to recitfy Overfitting so lets focus on other models

### 10.6 Support Vector Machine Classifier <a id='svmw'>

In [850]:
# Fitting,training and testing the model
svcw = SVC()

svcw.fit(X_train_wickets,y_train_wickets)
svc_pred_wickets = svcw.predict(X_test_wickets)

***Model Evaluation***

In [851]:
# Accuracy Score
svc_wickets_accuracy = metrics.accuracy_score(svc_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',svc_wickets_accuracy)

# Precision Score
svc_wickets_precision = metrics.precision_score(svc_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',svc_wickets_precision)

# Recall Score
svc_wickets_recall = metrics.recall_score(svc_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',svc_wickets_recall)

# F1 Score
svc_wickets_f1score = metrics.f1_score(svc_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',svc_wickets_f1score)

# ROC AUC Score
svc_wickets_aucrocscore = metrics.roc_auc_score(svc_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',svc_wickets_aucrocscore)

# Train Score for Support Machine Classifier
svc_wickets_train = svcw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Support Machine Classifier
print('Train Score for Support Machine Classifier',svc_wickets_train)

# Test Score for Support Machine Classifier
svc_wickets_test = svcw.score(X_test_runs,y_test_wickets)

# Printing Test Score for Support Machine Classifier
print('Test Score for Gradient Boosting Classifier',svc_runs_test)

Accuracy Score: 0.9234449760765551
Precision Score: 1.0
Recall Score: 0.9230769230769231
F1 Score: 0.9600000000000001
ROC AUC Score: 0.9615384615384616
Train Score for Support Machine Classifier 0.957345971563981
Test Score for Gradient Boosting Classifier 0.9760765550239234


Support Vector Machine draws a hyperplane to reduce overfitting unlike other algorithms, and here too Precision is 1.0 through which we can tell there are no False Positives

### 10.7 KNearest Neighbors Classifier <a id='knnw'>

In [852]:
# Fitting,training and testing the model
knnw = KNeighborsClassifier()

knnw.fit(X_train_wickets,y_train_wickets)
knn_pred_wickets = knnw.predict(X_test_wickets)

***Model Evaluation***

In [853]:
# Accuracy Score
knn_wickets_accuracy = metrics.accuracy_score(knn_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',knn_wickets_accuracy)

# Precision Score
knn_wickets_precision = metrics.precision_score(knn_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',knn_wickets_precision)

# Recall Score
knn_wickets_recall = metrics.recall_score(knn_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',knn_wickets_recall)

# F1 Score
knn_wickets_f1score = metrics.f1_score(knn_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',knn_wickets_f1score)

# ROC AUC Score
knn_wickets_aucrocscore = metrics.roc_auc_score(knn_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',knn_wickets_aucrocscore)

# Train Score for KNearest Neighbors Classifier
knn_wickets_train = knnw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for KNearest Neighbors Classifier
print('Train Score for Support Machine Classifier',knn_wickets_train)

# Test Score for KNearest Neighbors Classifier
knn_wickets_test = knnw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for KNearest Neighbors Classifier
print('Test Score for Gradient Boosting Classifier',knn_wickets_test)

Accuracy Score: 0.9425837320574163
Precision Score: 1.0
Recall Score: 0.9411764705882353
F1 Score: 0.9696969696969697
ROC AUC Score: 0.9705882352941176
Train Score for Support Machine Classifier 0.9597156398104265
Test Score for Gradient Boosting Classifier 0.9425837320574163


As KNN works with distance criteria, this algorithm too prevents Overfitting, and here too there are no False Positives as Precision is 1.0

### 10.8 Bagging Classifier <a id='bagw'>

In [899]:
# Fitting,training and testing the model
bcw = BaggingClassifier()

bcw.fit(X_train_wickets,y_train_wickets)
bc_pred_wickets = bcw.predict(X_test_wickets)

***Model Evaluation***

In [900]:
# Accuracy Score
bc_wickets_accuracy = metrics.accuracy_score(bc_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',bc_wickets_accuracy)

# Precision Score
bc_wickets_precision = metrics.precision_score(bc_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',bc_wickets_precision)

# Recall Score
bc_wickets_recall = metrics.recall_score(bc_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',bc_wickets_recall)

# F1 Score
bc_wickets_f1score = metrics.f1_score(bc_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',bc_wickets_f1score)

# ROC AUC Score
bc_wickets_aucrocscore = metrics.roc_auc_score(bc_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',bc_wickets_aucrocscore)

# Train Score for Bagging Classifier
bc_wickets_train = bcw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Bagging Classifier
print('Train Score for Bagging Classifier',bc_wickets_train)

# Test Score for Bagging Classifier
bc_wickets_test = bcw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Bagging Classifier
print('Test Score for Bagging Classifier',bc_wickets_test)

Accuracy Score: 0.9473684210526315
Precision Score: 0.9791666666666666
Recall Score: 0.9641025641025641
F1 Score: 0.9715762273901809
ROC AUC Score: 0.8391941391941392
Train Score for Bagging Classifier 0.9976303317535545
Test Score for Bagging Classifier 0.9473684210526315


As Bagging Classifier reduces variance it thereby prevents Overfitting

### 10.9 XG Boost Classifier <a id='xgbw'>

In [856]:
# Fitting,training and testing the model
xgbw= XGBClassifier()

xgbw.fit(X_train_wickets,y_train_wickets)
xgb_pred_wickets = xgbw.predict(X_test_wickets)

***Model Evaluation***

In [857]:
# Accuracy Score
xgb_wickets_accuracy = metrics.accuracy_score(xgb_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',xgb_wickets_accuracy)

# Precision Score
xgb_wickets_precision = metrics.precision_score(xgb_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',xgb_wickets_precision)

# Recall Score
xgb_wickets_recall = metrics.recall_score(xgb_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',xgb_wickets_recall)

# F1 Score
xgb_wickets_f1score = metrics.f1_score(xgb_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',xgb_wickets_f1score)

# ROC AUC Score
xgb_wickets_aucrocscore = metrics.roc_auc_score(xgb_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',xgb_wickets_aucrocscore)

# Train Score for XG Boost Classifier
xgb_wickets_train = xgbw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for XG Boost Classifier
print('Train Score for XG Boost Classifier',xgb_wickets_train)

# Test Score for XG Boost Classifier
xgb_wickets_test = xgbw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for XG Boost Classifier
print('Test Score for XG Boost Classifier',xgb_wickets_test)

Accuracy Score: 0.9521531100478469
Precision Score: 0.9739583333333334
Recall Score: 0.9739583333333334
F1 Score: 0.9739583333333334
ROC AUC Score: 0.8399203431372549
Train Score for XG Boost Classifier 1.0
Test Score for XG Boost Classifier 0.9521531100478469


We need to prevent the Overfitting through Pruning

### 10.10 Pruned XG Boost Classifier <a id='xgbwprune'>

In [752]:
# Assigning Grid for Randomized Search CV
params_xgb_GS = {"max_depth": [3,5,6,7,8],
              "min_child_weight" : [5,6,7,8],
            'learning_rate':[0.05,0.1,0.2],
            'n_estimators': [10,30,50,70]}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
xgb_random = RandomizedSearchCV(estimator = xgb, param_distributions = params_xgb_GS, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [753]:
# Fit the random search model
xgb_random.fit(X_wickets,y_wickets)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   40.9s
[Parallel(n_jobs=-1)]: Done 216 tasks      | elapsed:   43.8s
[Parallel(n_jobs=-1)]: Done 285 out of 300 | elapsed:   44.7s remaining:    2.3s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   44.9s finished


RandomizedSearchCV(cv=3,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           gpu_id=-1, importance_type='gain',
                                           interaction_constraints='',
                                           learning_rate=0.300000012,
                                           max_delta_step=0, max_depth=6,
                                           min_child_weight=1, missing=nan,
                                           monotone_constraints='()',
                                           n_estimators=100, n_jobs=0,
                                           num_parallel_tree=1, random_state=0,
                                           reg_alpha=0, reg_lambda=1,
                                       

In [755]:
# Best parameters
xgb_random.best_params_

{'n_estimators': 10,
 'min_child_weight': 7,
 'max_depth': 3,
 'learning_rate': 0.2}

In [858]:
# Fitting,training and testing the model after Pruning
xgbw_prune = XGBClassifier(n_estimators=10,min_child_weight=7,max_depth=3,learning_rate=0.3)

xgbw_prune.fit(X_train_wickets,y_train_wickets)

xgb_prune_pred_wickets = xgbw_prune.predict(X_test_wickets)

***Model Evaluation***

In [859]:
# Accuracy Score after Pruning
xgb_prune_wickets_accuracy = metrics.accuracy_score(xgb_prune_pred_wickets,y_test_wickets)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',xgb_prune_wickets_accuracy)

# Precision Score after Pruning
xgb_prune_wickets_precision = metrics.precision_score(xgb_prune_pred_wickets,y_test_wickets)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',xgb_prune_wickets_precision)

# Recall Score after Pruning
xgb_prune_wickets_recall = metrics.recall_score(xgb_prune_pred_wickets,y_test_wickets)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',xgb_prune_wickets_recall)

# F1 Score after Pruning
xgb_prune_wickets_f1score = metrics.f1_score(xgb_prune_pred_wickets,y_test_wickets)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',xgb_prune_wickets_f1score)

# ROC AUC Score after Pruning
xgb_prune_wickets_aucrocscore = metrics.roc_auc_score(xgb_prune_pred_wickets,y_test_wickets)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',xgb_prune_wickets_aucrocscore)

# Train Score for XG Boost Classifier after Pruning
xgb_prune_wickets_train = xgbw_prune.score(X_train_wickets,y_train_wickets)

# Printing Train Score for XG Boost Classifier after Pruning
print('Train Score for XG Boost Classifier after Pruning',xgb_prune_wickets_train)

# Test Score for XG Boost Classifier after Pruning
xgb_prune_wickets_test = xgbw_prune.score(X_test_wickets,y_test_wickets)

# Printing Test Score for XG Boost Classifier after Pruning
print('Test Score for XG Boost Classifier after Pruning',xgb_prune_wickets_test)

Accuracy Score after Pruning: 0.9282296650717703
Precision Score after Pruning: 0.9479166666666666
Recall Score after Pruning: 0.9732620320855615
F1 Score after Pruning: 0.9604221635883904
ROC AUC Score after Pruning: 0.7593582887700534
Train Score for XG Boost Classifier after Pruning 0.966824644549763
Test Score for XG Boost Classifier after Pruning 0.9282296650717703


Here we have somewhat corrected Overfitting

### 10.11 Ada Boost Classifier <a id='abw'>

In [860]:
# Fitting,training and testing the model
abw = AdaBoostClassifier()

abw.fit(X_train_wickets,y_train_wickets)

ab_pred_wickets = abw.predict(X_test_wickets)

***Model Evaluation***

In [861]:
# Accuracy Score
ab_wickets_accuracy = metrics.accuracy_score(ab_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',ab_wickets_accuracy)

# Precision Score
ab_wickets_precision = metrics.precision_score(ab_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',ab_wickets_precision)

# Recall Score
ab_wickets_recall = metrics.recall_score(ab_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',ab_wickets_recall)

# F1 Score
ab_wickets_f1score = metrics.f1_score(ab_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',ab_wickets_f1score)

# ROC AUC Score
ab_wickets_aucrocscore = metrics.roc_auc_score(ab_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',ab_wickets_aucrocscore)

# Train Score for Ada Boost Classifier
ab_wickets_train = abw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Ada Boost Classifier
print('Train Score for Ada Boost Classifier',ab_wickets_train)

# Test Score for Ada Boost Classifier
ab_wickets_test = abw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Ada Boost Classifier
print('Test Score for Ada Boost Classifier',ab_wickets_test)

Accuracy Score: 0.9856459330143541
Precision Score: 1.0
Recall Score: 0.9846153846153847
F1 Score: 0.9922480620155039
ROC AUC Score: 0.9923076923076923
Train Score for Ada Boost Classifier 1.0
Test Score for Ada Boost Classifier 0.9856459330143541


### 10.12 Pruned Ada Boost Classifier <a id='abwprune'>

In [776]:
# Assigning the parameters to a Grid for performing Randomized Search CV
params_Adb_GS = {'learning_rate':[0.05,0.1,0.2,1],'n_estimators':[10,20,30,40],'algorithm':['SAMME', 'SAMME.R']}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
ab_random = RandomizedSearchCV(estimator = ab,param_distributions=params_Adb_GS,n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [777]:
# Fit the Randomized Search CV
ab_random.fit(X_wickets,y_wickets)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed:    2.5s finished


RandomizedSearchCV(cv=3, estimator=AdaBoostClassifier(), n_iter=100, n_jobs=-1,
                   param_distributions={'algorithm': ['SAMME', 'SAMME.R'],
                                        'learning_rate': [0.05, 0.1, 0.2, 1],
                                        'n_estimators': [10, 20, 30, 40]},
                   random_state=42, verbose=2)

In [778]:
# Best parameters
ab_random.best_params_

{'n_estimators': 30, 'learning_rate': 0.2, 'algorithm': 'SAMME'}

In [862]:
# Fitting,training and testing the model after Pruning
abw_prune = AdaBoostClassifier(n_estimators=30,learning_rate=0.2,algorithm='SAMME')

abw_prune.fit(X_train_wickets,y_train_wickets)

ab_prune_pred_wickets = abw_prune.predict(X_test_wickets)

***Model Evaluation***

In [863]:
# Accuracy Score after Pruning
ab_prune_wickets_accuracy = metrics.accuracy_score(ab_prune_pred_wickets,y_test_wickets)

# Printing Accuracy Score after Pruning
print('Accuracy Score after Pruning:',ab_prune_wickets_accuracy)

# Precision Score after Pruning
ab_prune_wickets_precision = metrics.precision_score(ab_prune_pred_wickets,y_test_wickets)

# Printing Precision Score after Pruning
print('Precision Score after Pruning:',ab_prune_wickets_precision)

# Recall Score after Pruning
ab_prune_wickets_recall = metrics.recall_score(ab_prune_pred_wickets,y_test_wickets)

# Printing Recall Score after Pruning
print('Recall Score after Pruning:',ab_prune_wickets_recall)

# F1 Score after Pruning
ab_prune_wickets_f1score = metrics.f1_score(ab_prune_pred_wickets,y_test_wickets)

# Printing F1 Score after Pruning
print('F1 Score after Pruning:',ab_prune_wickets_f1score)

# ROC AUC Score after Pruning
ab_prune_wickets_aucrocscore = metrics.roc_auc_score(ab_prune_pred_wickets,y_test_wickets)

# Printing ROC AUC Score after Pruning
print('ROC AUC Score after Pruning:',ab_prune_wickets_aucrocscore)

# Train Score for Ada Boost Classifier after Pruning
ab_prune_wickets_train = abw_prune.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Ada Boost Classifier after Pruning
print('Train Score for Ada Boost Classifier after Pruning',ab_prune_wickets_train)

# Test Score for Ada Boost Classifier after Pruning
ab_prune_wickets_test = abw_prune.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Ada Boost Classifier after Pruning
print('Test Score for Ada Boost Classifier after Pruning',ab_prune_wickets_test)

Accuracy Score after Pruning: 0.9330143540669856
Precision Score after Pruning: 0.953125
Recall Score after Pruning: 0.973404255319149
F1 Score after Pruning: 0.9631578947368422
ROC AUC Score after Pruning: 0.7724164133738602
Train Score for Ada Boost Classifier after Pruning 0.966824644549763
Test Score for Ada Boost Classifier after Pruning 0.9330143540669856


We have prevented Overfitting

### 10.13 Logistic Regression <a id='lrw'>

In [864]:
# Fitting,training and testing the model
lrw = LogisticRegression()

lrw.fit(X_train_wickets,y_train_wickets)

lr_pred_wickets = lrw.predict(X_test_wickets)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


***Model Evaluation***

In [865]:
# Accuracy Score
lr_wickets_accuracy = metrics.accuracy_score(lr_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',lr_wickets_accuracy)

# Precision Score
lr_wickets_precision = metrics.precision_score(lr_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',lr_wickets_precision)

# Recall Score
lr_wickets_recall = metrics.recall_score(lr_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',lr_wickets_recall)

# F1 Score
lr_wickets_f1score = metrics.f1_score(lr_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',lr_wickets_f1score)

# ROC AUC Score
lr_wickets_aucrocscore = metrics.roc_auc_score(lr_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',lr_wickets_aucrocscore)

# Train Score for Logistic Regression
lr_wickets_train = lrw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Logistic Regression
print('Train Score for Logistic Regression',lr_wickets_train)

# Test Score for Logistic Regression
lr_wickets_test = lrw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Logistic Regression
print('Test Score for Logistic Regression',lr_wickets_test)

Accuracy Score: 0.9904306220095693
Precision Score: 0.9947916666666666
Recall Score: 0.9947916666666666
F1 Score: 0.9947916666666666
ROC AUC Score: 0.9679840686274509
Train Score for Logistic Regression 1.0
Test Score for Logistic Regression 0.9904306220095693


We could see Overfitting, as Logistic Regression is a base model lets try using it as a meta classifier for Stacking to stack other models into it

### 10.14 Stacking for Logistic Regression using Mlxtend Classifier <a id='stwmlx>

In [869]:
# Stacking KNN,SVM,XG Boost,Ada Boost and Gradient Boosting models into Logistic Regression for better results
lrstackw = StackingClassifier(classifiers=[knnw,svcw,xgbw_prune,abw_prune,gbw_prune],meta_classifier=lrw)

lrstackw.fit(X_train_wickets,y_train_wickets)

lrstack_pred_wickets = lrstackw.predict(X_test_wickets)

***Model Evaluation***

In [870]:
# Accuracy Score after Stacking for Logistic Regression
lrstack_wickets_accuracy = metrics.accuracy_score(lrstack_pred_wickets,y_test_wickets)

# Printing Accuracy Score after Stacking for Logistic Regression
print('Accuracy Score after Stacking for Logistic Regression:',lrstack_wickets_accuracy)

# Precision Score after Stacking for Logistic Regression
lrstack_wickets_precision = metrics.precision_score(lrstack_pred_wickets,y_test_wickets)

# Printing Precision Score after Stacking for Logistic Regression
print('Precision Score after Stacking for Logistic Regression:',lrstack_wickets_precision)

# Recall Score after Stacking for Logistic Regression
lrstack_wickets_recall = metrics.recall_score(lrstack_pred_wickets,y_test_wickets)

# Printing Recall Score after Stacking for Logistic Regression
print('Recall Score after Stacking for Logistic Regression:',lrstack_wickets_recall)

# F1 Score after Stacking for Logistic Regression
lrstack_wickets_f1score = metrics.f1_score(lrstack_pred_wickets,y_test_wickets)

# Printing F1 Score after Stacking for Logistic Regression
print('F1 Score after Stacking for Logistic Regression:',lrstack_wickets_f1score)

# ROC AUC Score after Stacking for Logistic Regression
lrstack_wickets_aucrocscore = metrics.roc_auc_score(lrstack_pred_wickets,y_test_wickets)

# Printing ROC AUC Score after Stacking for Logistic Regression
print('ROC AUC Score after Stacking for Logistic Regression:',lrstack_wickets_aucrocscore)

# Train Score for Logistic Regression after Stacking for Logistic Regression
lrstack_wickets_train = lrstackw.score(X_train_wickets,y_train_wickets)

# Printing Train Score for Logistic Regression after Stacking for Logistic Regression
print('Train Score for Logistic Regression after Stacking for Logistic Regression',lrstack_wickets_train)

# Test Score for Logistic Regression after Stacking for Logistic Regression
lrstack_wickets_test = lrstackw.score(X_test_wickets,y_test_wickets)

# Printing Test Score for Logistic Regression after Stacking for Logistic Regression
print('Test Score for Logistic Regression after Stacking for Logistic Regression',lrstack_wickets_test)

Accuracy Score after Stacking for Logistic Regression: 0.9473684210526315
Precision Score after Stacking for Logistic Regression: 0.984375
Recall Score after Stacking for Logistic Regression: 0.9593908629441624
F1 Score after Stacking for Logistic Regression: 0.9717223650385605
ROC AUC Score after Stacking for Logistic Regression: 0.8546954314720813
Train Score for Logistic Regression after Stacking for Logistic Regression 0.9881516587677726
Test Score for Logistic Regression after Stacking for Logistic Regression 0.9473684210526315


Now Logistic Regression give better results

### 10.15 Stacking using Voting Classifier <a id='stwvote'>

In [871]:
# Assigning estimator models for voting classifier
vote_est = [('knn',knnw),('xgb',xgbw_prune),('SVM',svcw)]

In [873]:
# Fitting the model
votew = VotingClassifier(estimators=vote_est)

In [874]:
# Training and testing the model
votew.fit(X_train_wickets,y_train_wickets)

vote_pred_wickets = votew.predict(X_test_wickets)

***Model Evaluation***

In [875]:
# Accuracy Score
vote_wickets_accuracy = metrics.accuracy_score(vote_pred_wickets,y_test_wickets)

# Printing Accuracy Score
print('Accuracy Score:',vote_wickets_accuracy)

# Precision Score
vote_wickets_precision = metrics.precision_score(vote_pred_wickets,y_test_wickets)

# Printing Precision Score
print('Precision Score:',vote_wickets_precision)

# Recall Score
vote_wickets_recall = metrics.recall_score(vote_pred_wickets,y_test_wickets)

# Printing Recall Score
print('Recall Score:',vote_wickets_recall)

# F1 Score
vote_wickets_f1score = metrics.f1_score(vote_pred_wickets,y_test_wickets)

# Printing F1 Score
print('F1 Score:',vote_wickets_f1score)

# ROC AUC Score
vote_wickets_aucrocscore = metrics.roc_auc_score(vote_pred_wickets,y_test_wickets)

# Printing ROC AUC Score
print('ROC AUC Score:',vote_wickets_aucrocscore)

# Train Score
vote_wickets_train = votew.score(X_train_wickets,y_train_wickets)

# Printing Train Score
print('Train Score',vote_wickets_train)

# Test Score
vote_wickets_test = votew.score(X_test_wickets,y_test_wickets)

# Printing Test Score
print('Test Score',vote_wickets_test)

Accuracy Score: 0.937799043062201
Precision Score: 1.0
Recall Score: 0.9365853658536586
F1 Score: 0.9672544080604535
ROC AUC Score: 0.9682926829268292
Train Score 0.9691943127962085
Test Score 0.937799043062201


There is no Overfitting

### 10.16 Comparison Table of all models for Wickets Prediction <a id='ctw'>

In [901]:
# Creating dictionary with all the metrics
wickets_metrics = {'Classifier': ['Decision Tree','Pruned Decision Tree','Random Forest','Gradient Boosting','Pruned Gradient Boosting','Support Vector Machine','KNN','Bagging','XG Boost','Pruned XG Boost','Ada Boost','Pruned Ada Boost','Logistic Regression','Stacking for Logistic Regression using Mlxtend','Stacking using Voting'],
                'Accuracy':[dt_wickets_accuracy,dt_prune_wickets_accuracy,rf_wickets_accuracy,gb_wickets_accuracy,gb_prune_wickets_accuracy,svc_wickets_accuracy,knn_wickets_accuracy,bc_wickets_accuracy,xgb_wickets_accuracy,xgb_prune_wickets_accuracy,ab_wickets_accuracy,ab_prune_wickets_accuracy,lr_wickets_accuracy,lrstack_wickets_accuracy,vote_wickets_accuracy],
                'Precision':[dt_wickets_precision,dt_prune_wickets_precision,rf_wickets_precision,gb_wickets_precision,gb_prune_wickets_precision,svc_wickets_precision,knn_wickets_precision,bc_wickets_precision,xgb_wickets_precision,xgb_prune_wickets_precision,ab_wickets_precision,ab_prune_wickets_precision,lr_wickets_precision,lrstack_wickets_precision,vote_wickets_precision],
                'Recall':[dt_wickets_recall,dt_prune_wickets_recall,rf_wickets_recall,gb_wickets_recall,gb_prune_wickets_recall,svc_wickets_recall,knn_wickets_recall,bc_wickets_recall,xgb_wickets_recall,xgb_prune_wickets_recall,ab_wickets_recall,ab_prune_wickets_recall,lr_wickets_recall,lrstack_wickets_recall,vote_wickets_recall],
                'F1 Score':[dt_wickets_f1score,dt_prune_wickets_f1score,rf_wickets_f1score,gb_wickets_f1score,gb_prune_wickets_f1score,svc_wickets_f1score,knn_wickets_f1score,bc_wickets_f1score,xgb_wickets_f1score,xgb_prune_wickets_f1score,ab_wickets_f1score,ab_prune_wickets_f1score,lr_wickets_f1score,lrstack_wickets_f1score,vote_wickets_f1score],
                'AUCROC Score':[dt_wickets_aucrocscore,dt_prune_wickets_aucrocscore,rf_wickets_aucrocscore,gb_wickets_aucrocscore,gb_prune_wickets_aucrocscore,svc_wickets_aucrocscore,knn_wickets_aucrocscore,bc_wickets_aucrocscore,xgb_wickets_aucrocscore,xgb_prune_wickets_aucrocscore,ab_wickets_aucrocscore,ab_prune_wickets_aucrocscore,lr_wickets_aucrocscore,lrstack_wickets_aucrocscore,vote_wickets_aucrocscore],
                'Train Score':[dt_wickets_train,dt_prune_wickets_train,rf_wickets_train,gb_wickets_train,gb_prune_wickets_train,svc_wickets_train,knn_wickets_train,xgb_wickets_train,bc_wickets_train,xgb_prune_wickets_train,ab_wickets_train,ab_prune_wickets_train,lr_wickets_train,lrstack_wickets_train,vote_wickets_train],
                'Test Score':[dt_wickets_test,dt_prune_wickets_test,rf_wickets_test,gb_wickets_test,gb_prune_wickets_test,svc_wickets_test,knn_wickets_test,xgb_runs_test,bc_wickets_test,xgb_prune_wickets_test,ab_wickets_test,ab_prune_wickets_test,lr_wickets_test,lrstack_wickets_test,vote_wickets_test]}

In [902]:
# Creating dataframe with the dictionary
wickets_metrics = pd.DataFrame(wickets_metrics)

In [903]:
wickets_metrics

Unnamed: 0,Classifier,Accuracy,Precision,Recall,F1 Score,AUCROC Score,Train Score,Test Score
0,Decision Tree,0.923445,0.953125,0.963158,0.958115,0.744737,1.0,0.923445
1,Pruned Decision Tree,0.913876,0.927083,0.978022,0.951872,0.729752,0.947867,0.913876
2,Random Forest,0.91866,1.0,0.91866,0.957606,0.801314,0.947867,0.91866
3,Gradient Boosting,0.956938,0.979167,0.974093,0.976623,0.862047,1.0,0.956938
4,Pruned Gradient Boosting,0.952153,0.984375,0.964286,0.974227,0.866758,1.0,0.942584
5,Support Vector Machine,0.923445,1.0,0.923077,0.96,0.961538,0.957346,0.923445
6,KNN,0.942584,1.0,0.941176,0.969697,0.970588,0.959716,0.942584
7,Bagging,0.947368,0.979167,0.964103,0.971576,0.839194,1.0,0.980861
8,XG Boost,0.952153,0.973958,0.973958,0.973958,0.83992,0.99763,0.947368
9,Pruned XG Boost,0.92823,0.947917,0.973262,0.960422,0.759358,0.966825,0.92823


Comparing all the models KNN is the best model with least train test difference