# T20 WORLD_CUP PREDICTION 2022

# Goals

Use Machine Learning to predict the winner of ICC 2022 T20 Cricket World Cup.

Predict the outcome of individual matches for the entire competition.

Run simulation of the next matches i.e semi finals and finals.

These goals present a unique real-world Machine Learning prediction problem and involve solving various Machine Learning tasks: data wrangling, feature extraction and outcome prediction.

# DATASET

I used data sets from Kaggle - Results of the matches since 2007 and 2022. I might not be that accurate but still I believe this gives a fairly good intuition.  For the rest of data files I used the crickbuzz website.

In [2]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

In [3]:
#Load Datasets
world_cup = pd.read_excel('C:/Users/natu/Downloads/World Cup 2022 Dataset (2).xlsx')
results = pd.read_excel('C:/Users/natu/Downloads/T20_ Results (3).xlsx')

In [4]:
world_cup

Unnamed: 0,Team,Group,Previous Appearance,Previous Titles,Previous Finals,Previous Semi-Finals,Current Ranking
0,Afghanistan,A,5,0,0,0,10
1,Australia,A,7,1,2,4,6
2,Bangladesh,B,7,0,0,0,9
3,England,A,7,1,2,3,2
4,India,B,7,1,2,3,1
5,ireland,A,7,0,0,0,12
6,Netherlands,B,5,0,0,0,17
7,New Zealand,A,7,0,0,3,5
8,Pakistan,B,7,1,2,5,3
9,South Africa,B,7,0,0,1,4


In [5]:
results.head()

Unnamed: 0,Date,Team1,Team2,Winner,Venue
0,2007-01-09,Australia,England,Australia,\nSydney Cricket Ground
1,2007-02-02,New Zealand,Bangladesh,New Zealand,McLean Park
2,2007-06-28,New Zealand,Bangladesh,Scotland,Bay Oval
3,2007-06-29,New Zealand,Bangladesh,Afghanistan,Bay Oval
4,2007-09-01,Hong Kong,Scotland,Netherlands,Sheikh Zayed Stadium


In [6]:
results.shape

(1847, 5)

In [105]:
df = results[(results['Team1'] == 'India') | (results['Team2'] == 'India')]
india = df.iloc[:]
india.head()

Unnamed: 0,Date,Team1,Team2,Winner,Venue
19,2007-09-16,India,England,India,Green Park
20,2007-09-16,India,England,Sri Lanka,Vidarbha Cricket Association Stadium
21,2007-09-16,India,England,South Africa,M Chinnaswamy Stadium
44,2008-06-20,West Indies,India,Pakistan,Sabina Park
45,2008-08-02,Sri Lanka,India,ICC,R Premadasa Stadium


In [106]:
# combining the teams participating in the worldcup

Worldcup_teams = ['Australia','New Zealand','Afghanistan','England','Sri Lanka','ireland','India','Pakistan','Bangladesh','South Africa','Netherlands','Zimbabwe']
df_teams_1 = results[results['Team1'].isin(Worldcup_teams)] 
df_teams_2 = results[results['Team2'].isin(Worldcup_teams)] 
df_teams = pd.concat((df_teams_1,df_teams_2))
df_teams.drop_duplicates
df_teams.count()


Date      1553
Team1     1553
Team2     1553
Winner    1553
Venue     1553
dtype: int64

In [107]:
df_teams.head()

Unnamed: 0,Date,Team1,Team2,Winner,Venue
0,2007-01-09,Australia,England,Australia,\nSydney Cricket Ground
1,2007-02-02,New Zealand,Bangladesh,New Zealand,McLean Park
2,2007-06-28,New Zealand,Bangladesh,Scotland,Bay Oval
3,2007-06-29,New Zealand,Bangladesh,Afghanistan,Bay Oval
5,2007-09-02,Afghanistan,Ireland,Oman,Sheikh Zayed Stadium


In [108]:
#dropping columns that wll not affect match outcomes
df_teams = df_teams.drop(['Date', 'Venue'], axis=1)
df_teams.head()

Unnamed: 0,Team1,Team2,Winner
0,Australia,England,Australia
1,New Zealand,Bangladesh,New Zealand
2,New Zealand,Bangladesh,Scotland
3,New Zealand,Bangladesh,Afghanistan
5,Afghanistan,Ireland,Oman


In [109]:
#Building the model
#the prediction label: The winning_team column will show "1" Team 1 has won and "2" if the away team has won.

df_teams = df_teams.reset_index(drop=True)
df_teams.loc[df_teams.Winner == df_teams.Team1,'winning_team']=1
df_teams.loc[df_teams.Winner == df_teams.Team2, 'winning_team']=2
df_teams = df_teams.drop(['winning_team'], axis=1)

df_teams.head()

Unnamed: 0,Team1,Team2,Winner
0,Australia,England,Australia
1,New Zealand,Bangladesh,New Zealand
2,New Zealand,Bangladesh,Scotland
3,New Zealand,Bangladesh,Afghanistan
4,Afghanistan,Ireland,Oman


In [110]:
#convert team-1 and team-2 from categorical variables to continous inputs 
# Get dummy variables
final = pd.get_dummies(df_teams, prefix=['Team1', 'Team2'], columns=['Team1', 'Team2'])

# Separate X and y sets
X = final.drop(['Winner'], axis=1)
y = final["Winner"]

In [111]:
# Separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)


In [112]:
final.head()

Unnamed: 0,Winner,Team1_Afghanistan,Team1_Australia,Team1_Bangladesh,Team1_Bermuda,Team1_Canada,Team1_England,Team1_Hong Kong,Team1_India,Team1_Ireland,...,Team2_Singapore,Team2_South Africa,Team2_SouthAfrica,Team2_Sri Lanka,Team2_UAE,Team2_Uganda,Team2_United,Team2_West Indies,Team2_Zimbabwe,Team2_Zimbabwebabwe
0,Australia,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,New Zealand,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Scotland,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Afghanistan,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Oman,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Random Forest Classifier

In [113]:
rf = RandomForestClassifier(n_estimators=100, max_depth=20,
                              random_state=0)
rf.fit(X_train, y_train) 


score = rf.score(X_train, y_train)
score2 = rf.score(X_test, y_test)


print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))

Training set accuracy:  0.489
Test set accuracy:  0.361


# Compare several machine learning models on a performance metric

I used Logistic Regression, Support Vector Machines, Random Forests and K Nearest Neighbours for training the model.

Random Forest outperformed all other algorithms with 49% training accuracy and 36% test accuracy.becoz the data are too dirty.here my accuracy was low but my prediction atleast close to this world cup results.

In [114]:
#adding ICC rankings
#the team which is positioned higher on the ICC Ranking will be considered "favourite" for the match
#and therefore, will be positioned under the "Team_1" column

# Loading new datasets
ranking = pd.read_excel('C:/Users/natu/Downloads/icc_rankings (1).xlsx') 
fixtures = pd.read_excel('C:/Users/natu/Downloads/fixtures 1 (3).xlsx')

# List for storing the group stage games
pred_set = []

In [115]:
fixtures

Unnamed: 0,Group stage,Date,Location,Team1,Team2,Result
0,1,2022-10-22,"Sydney Cricket Ground, Sydney",Australia,New zealand,
1,1,2022-10-22,"Perth Stadium, Perth",England,Afghanistan,
2,1,2022-10-23,"Bellerive Oval, Hobart",Sri Lanka,Ireland,
3,1,2022-10-23,"Melbourne Cricket Ground, Melbourne",India,Pakistan,
4,1,2022-10-24,"Bellerive Oval, Hobart",Bangladesh,Netherland,
5,1,2022-10-24,"Bellerive Oval, Hobart",South Africa,Zimbabwe,
6,1,2022-10-25,"Perth Stadium, Perth",Australia,Sri Lanka,
7,1,2022-10-26,"Melbourne Cricket Ground, Melbourne",England,Ireland,
8,1,2022-10-26,"Melbourne Cricket Ground, Melbourne",Afghanistan,New zealand,
9,1,2022-10-27,"Sydney Cricket Ground, Sydney",South Africa,Bangladesh,


In [116]:
ranking

Unnamed: 0,Rank,Team,Points
0,1,India,14760
1,2,England,11063
2,3,Pakistan,12415
3,4,New Zealand,9544
4,5,South Africa,10865
5,6,Australia,10554
6,7,Sri Lanka,10356
7,8,Bangladesh,10220
8,9,Afghanistan,5919
9,10,Zimbabwe,7792


In [117]:
# Create new columns with ranking position of each team
fixtures.insert(1, 'first_position', fixtures['Team1'].map(ranking.set_index('Team')['Rank']))
fixtures.insert(2, 'second_position', fixtures['Team2'].map(ranking.set_index('Team')['Rank']))

# We only need the group stage games, so we have to slice the dataset
fixtures = fixtures.iloc[:30, :]
fixtures.tail()

Unnamed: 0,Group stage,first_position,second_position,Date,Location,Team1,Team2,Result
25,1,6.0,9.0,2022-11-07,"Adelaide Oval, Adelaide",Australia,Afghanistan,
26,1,2.0,7.0,2022-11-08,"Sydney Cricket Ground, Sydney",England,Sri Lanka,
27,1,5.0,,2022-11-09,"Adelaide Oval, Adelaide",South Africa,Netherlands,
28,1,3.0,8.0,2022-11-10,"Adelaide Oval, Adelaide",Pakistan,Bangladesh,
29,1,1.0,10.0,2022-11-11,"Melbourne Cricket Ground, Melbourne",India,Zimbabwe,


In [118]:
# Loop to add teams to new prediction dataset based on the ranking position of each team
for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'Team1': row['Team1'], 'Team2': row['Team2'], 'winning_team': None})
    else:
        pred_set.append({'Team1': row['Team2'], 'Team2': row['Team1'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set
pred_set.head()

Unnamed: 0,Team1,Team2,winning_team
0,New zealand,Australia,
1,England,Afghanistan,
2,Sri Lanka,Ireland,
3,India,Pakistan,
4,Bangladesh,Netherland,


In [119]:
pred_set = pd.get_dummies(pred_set, prefix=['Team1', 'Team2'], columns=['Team1', 'Team2'])

missing_cols = set(final.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[final.columns]


pred_set = pred_set.drop(['Winner'], axis=1)
pred_set.head()

Unnamed: 0,Team1_Afghanistan,Team1_Australia,Team1_Bangladesh,Team1_Bermuda,Team1_Canada,Team1_England,Team1_Hong Kong,Team1_India,Team1_Ireland,Team1_Kenya,...,Team2_Singapore,Team2_South Africa,Team2_SouthAfrica,Team2_Sri Lanka,Team2_UAE,Team2_Uganda,Team2_United,Team2_West Indies,Team2_Zimbabwe,Team2_Zimbabwebabwe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [120]:
#group matches 
predictions = rf.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 1:
        print("Winner: " + backup_pred_set.iloc[i, 1])
    
    else:
        print("Winner: " + backup_pred_set.iloc[i, 0])
    print("")

Australia and New zealand
Winner: New zealand

Afghanistan and England
Winner: England

Ireland and Sri Lanka
Winner: Sri Lanka

Pakistan and India
Winner: India

Netherland and Bangladesh
Winner: Bangladesh

Zimbabwe and South Africa
Winner: South Africa

Sri Lanka and Australia
Winner: Australia

Ireland and England
Winner: England

Afghanistan and New zealand
Winner: New zealand

Bangladesh and South Africa
Winner: South Africa

Netherland and India
Winner: India

Zimbabwe and Pakistan
Winner: Pakistan

Ireland and Afghanistan
Winner: Afghanistan

Australia and England
Winner: England

Sri Lanka and New Zealand
Winner: New Zealand

Zimbabwe and Bangladesh
Winner: Bangladesh

Netherland and Pakistan
Winner: Pakistan

India and  South Africa
Winner:  South Africa

Ireland and Australia
Winner: Australia

Afghanistan and Sri Lanka
Winner: Sri Lanka

England and New zealand
Winner: New zealand

Netherland and Zimbabwe
Winner: Zimbabwe

Bangladesh and India
Winner: India

South Africa an

In [127]:
semi = [('New Zealand', 'Australia','England'),
            ('India', 'South Africa','Pakistan')]

In [128]:
def clean_and_predict(matches, ranking, final, logreg):

    # Initialization of auxiliary list for data cleaning
    positions = []

    # Loop to retrieve each team's position according to ICC ranking
    for match in matches:
        positions.append(ranking.loc[ranking['Team'] == match[0],'Rank'].iloc[0])
        positions.append(ranking.loc[ranking['Team'] == match[1],'Rank'].iloc[0])
    
    # Creating the DataFrame for prediction
    pred_set = []

    # Initializing iterators for while loop
    i = 0
    j = 0

    # 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
    while i < len(positions):
        dict1 = {}

        # If position of first team is better then this team will be the 'Team_1' team, and vice-versa
        if positions[i] < positions[i + 1]:
            dict1.update({'Team1': matches[j][0], 'Team2': matches[j][1]})
        else:
            dict1.update({'Team1': matches[j][1], 'Team2': matches[j][0]})

        # Append updated dictionary to the list, that will later be converted into a DataFrame
        pred_set.append(dict1)
        i += 2
        j += 1
        
        # Convert list into DataFrame
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    # Get dummy variables and drop winning_team column
    pred_set = pd.get_dummies(pred_set, prefix=['Team1', 'Team2'], columns=['Team1', 'Team2'])

    # Add missing columns compared to the model's training dataset
    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    pred_set = pred_set.drop(['Winner'], axis=1)

    # Predict!
    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 1:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        else:
            print("Winner: " + backup_pred_set.iloc[i, 0])
        print("")

In [132]:
clean_and_predict(semi, ranking, final, rf)

Australia and New Zealand
Winner: New Zealand

South Africa and India
Winner: India



In [130]:
# Finals
finals = [('India', 'New Zealand')]

In [131]:
clean_and_predict(finals, ranking, final, rf)

New Zealand and India
Winner: India



# So, Here I predict India and New Zealand have a lot of Chances to won theT20 World_cup 2022 in Australia