# Who will win the ICC Cricket World Cup 2019? 

So far ICC Cricket World Cup 2019 has lived up to its billings, giving us some exciting matches to watch and disappointing outcomes in rain washed matches as well. Now that the group stage matches are close to complete and the weather gods have stayed clear, this is a naive attempt at predicting who will win the world cup 2019 using the historical match & player data all the way back from 1999. Admittedly the model demonstrated here does suffer from dimensionality and not much effort was spent on feature engineering. That said, it trains based on the historical aggregate key performance indicators for the various teams participating in the world cup and predicts the win .

## 1. Data Acquisition

The below code block initializes the utility functions that help crawl through stats.espncricinfo.com portal and gathers all the relevant data

In [190]:
import requests
url="https://raw.githubusercontent.com/Pradeep39/cricket_analytics/master/utilities/cricket_data_wrangling.py"
sc.addPyFile(url)
exec(requests.get(url).text)

## 1.1 Get Historical Match Data
The below code block fetches One Day International match data all the way back from 1999 till today i.e., match urls ,Countries playing each other, Ground played etc.,

In [212]:
pool=ThreadPool()
all_matches_df_list=list()
def getMatchResultCallBack(resultDF):
    if not resultDF.empty:
        all_matches_df_list.append(resultDF)
for i in range(1999, 2020):
    pool.apply_async(getMatchResults, args=(str(i),), callback=getMatchResultCallBack)
pool.close()
pool.join()
all_matches_df = pd.DataFrame({})
for df in all_matches_df_list:
    all_matches_df=all_matches_df.append(df)
    
match_schema = StructType([StructField('Team1', StringType()), StructField('Team2',StringType()),StructField('Winner',StringType()),StructField('Margin',StringType()),StructField('Ground',StringType()),StructField('match_date',StringType()),StructField('match_no',StringType()),StructField('match_url',StringType())])

match_results_spark_df=sqlContext.createDataFrame(all_matches_df,schema=match_schema)
match_results_spark_df.limit(5).toPandas()

Unnamed: 0,Team1,Team2,Winner,Margin,Ground,match_date,match_no,match_url
0,Asia XI,ICC World XI,ICC World XI,112 runs,Melbourne,"Jan 10, 2005",2203,/ci/engine/match/66387.html
1,Australia,West Indies,Australia,116 runs,Melbourne,"Jan 14, 2005",2204,/ci/engine/match/65657.html
2,Australia,Pakistan,Australia,4 wickets,Hobart,"Jan 16, 2005",2205,/ci/engine/match/65658.html
3,Pakistan,West Indies,Pakistan,6 wickets,Brisbane,"Jan 19, 2005",2206,/ci/engine/match/65659.html
4,Bangladesh,Zimbabwe,Zimbabwe,22 runs,Dhaka,"Jan 20, 2005",2207,/ci/engine/match/64918.html


## 1.2 Acquire Players Data for each corresponding match data record.
For each Match data record, The below code block acquires the player profiles for the players that were part of  the corresponding match.

In [192]:
all_matches_df = pd.DataFrame({})
for df in all_matches_df_list:
    all_matches_df=all_matches_df.append(df)
all_matches_df=all_matches_df.reset_index(drop=True)

pool=ThreadPool()
all_players_df_list=list()
for index, row in all_matches_df.iterrows():
    def getMatchDataCallBack(resultDF):
        if not resultDF.empty:
            all_players_df_list.append(resultDF)
            #print(resultDF)
            #print("###################################")
    pool.apply_async(getMatchData, args=(row['match_url'],), callback=getMatchDataCallBack)
pool.close()
pool.join()


all_players_df=pd.DataFrame({})
for df in all_players_df_list:
    all_players_df=all_players_df.append(df)
all_players_df=all_players_df.reset_index(drop=True)

player_schema = StructType([StructField('player_page_href', StringType()), StructField('team',StringType()),StructField('match_no',StringType()),StructField('match_url',StringType()),StructField('player_profile',StringType())])
players_spark_df=sqlContext.createDataFrame(all_players_df,schema=player_schema)
players_spark_df.limit(5).toPandas()

Unnamed: 0,player_page_href,team,match_no,match_url,player_profile
0,/ci/content/player/51880.html,ICC World XI,2203,/ci/engine/match/66387.html,51880
1,/ci/content/player/5390.html,ICC World XI,2203,/ci/engine/match/66387.html,5390
2,/ci/content/player/7133.html,ICC World XI,2203,/ci/engine/match/66387.html,7133
3,/ci/content/player/52337.html,ICC World XI,2203,/ci/engine/match/66387.html,52337
4,/ci/content/player/36597.html,ICC World XI,2203,/ci/engine/match/66387.html,36597


## 1.3 Join Players data with corresponding match records.
For each Match data record, The below code joins the player profiles with the respective matches they played.

In [193]:
match_players_spark_df=match_results_spark_df.join(players_spark_df,"match_no")
match_players_spark_df.limit(5).toPandas()

Unnamed: 0,match_no,Team1,Team2,Winner,Margin,Ground,match_date,match_url,player_page_href,team,match_url.1,player_profile
0,1436,West Indies,Australia,Australia,46 runs,St George's,"Apr 14, 1999",/ci/engine/match/64619.html,/ci/content/player/5390.html,Australia,/ci/engine/match/64619.html,5390
1,1436,West Indies,Australia,Australia,46 runs,St George's,"Apr 14, 1999",/ci/engine/match/64619.html,/ci/content/player/8189.html,Australia,/ci/engine/match/64619.html,8189
2,1436,West Indies,Australia,Australia,46 runs,St George's,"Apr 14, 1999",/ci/engine/match/64619.html,/ci/content/player/6513.html,Australia,/ci/engine/match/64619.html,6513
3,1436,West Indies,Australia,Australia,46 runs,St George's,"Apr 14, 1999",/ci/engine/match/64619.html,/ci/content/player/6285.html,Australia,/ci/engine/match/64619.html,6285
4,1436,West Indies,Australia,Australia,46 runs,St George's,"Apr 14, 1999",/ci/engine/match/64619.html,/ci/content/player/8192.html,Australia,/ci/engine/match/64619.html,8192


## 1.4 Acquire Key performance indicators for all player profiles acquired so far.

In [194]:
player_profile_set = set(all_players_df['player_profile'].tolist())
    
pool=ThreadPool()
players_career_summ_df_list=list()

for player_profile in player_profile_set:
    def getPlayerSummaryCallBack(resultDF):
        if not resultDF.empty:
            players_career_summ_df_list.append(resultDF)
            #print(resultDF)
            #print("###################################")
    pool.apply_async(getPlayerSummary, args=(player_profile,), callback=getPlayerSummaryCallBack)
pool.close()
pool.join()

players_career_summ_df = pd.DataFrame({})
for df in players_career_summ_df_list:
    players_career_summ_df=players_career_summ_df.append(df)

players_career_summ_df=players_career_summ_df.reset_index(drop=True)
players_career_summ_df[:5]

Unnamed: 0,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,...,b_Wkts,b_BBI,b_BBM,b_Ave,b_Econ,b_SR,b_4w,b_5w,b_10,player_profile
0,16,16,2,307,57*,21.92,440,69.77,0,3,...,-,-,-,-,-,-,-,-,-,703323
1,30,22,6,161,41,10.06,218,73.85,0,0,...,33,4/23,4/23,30.54,4.67,39.2,3,0,0,24943
2,27,18,4,110,18,7.85,201,54.72,0,0,...,51,5/46,5/46,26.66,5.87,27.2,2,1,0,348059
3,65,45,12,363,36,11,441,82.31,0,0,...,92,6/27,6/27,26.46,4.12,38.4,4,2,0,230558
4,1,1,1,0,0*,-,0,-,0,0,...,1,1/40,1/40,40,10,24,0,0,0,355267


## 2. Data Wrangling/ETL
## 2.1 Cleanse and Curate Data

In [195]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

player_schema = StructType([StructField("Mat", StringType(), True),StructField("Inns", StringType(), True),StructField("NO", StringType(), True),StructField("Runs", StringType(), True),StructField("HS", StringType(), True),StructField("Ave", StringType(), True),StructField("BF", StringType(), True),StructField("SR", StringType(), True),StructField("bat_100", StringType(), True),StructField("bat_50", StringType(), True),StructField("bat_4s", StringType(), True),StructField("bat_6s", StringType(), True),StructField("Ct", StringType(), True),StructField("St", StringType(), True),StructField("b_Mat", StringType(), True),StructField("b_Inns", StringType(), True),StructField("b_Overs", StringType(), True),StructField("b_Mdns", StringType(), True),StructField("b_Runs", StringType(), True),StructField("b_Wkts", StringType(), True),StructField("b_BBI", StringType(), True),StructField("b_BBM", StringType(), True),StructField("b_Ave", StringType(), True),StructField("b_Econ", StringType(), True),StructField("b_SR", StringType(), True),StructField("b_4w", StringType(), True),StructField("b_5w", StringType(), True),StructField("player_profile", StringType(), True)])

players_spark_df=sqlContext.createDataFrame(players_career_summ_df,schema=player_schema)

clean_players_spark_df = players_spark_df.select(*(regexp_replace(col(c),'-','0').alias(c) for c in players_spark_df.columns))

all_player_match_spark_df=match_players_spark_df.join(clean_players_spark_df,"player_profile")

all_player_match_spark_df.createOrReplaceTempView("all_player_match_data")

all_player_match_spark_df.limit(5).toPandas()

Unnamed: 0,player_profile,match_no,Team1,Team2,Winner,Margin,Ground,match_date,match_url,player_page_href,...,b_Mdns,b_Runs,b_Wkts,b_BBI,b_BBM,b_Ave,b_Econ,b_SR,b_4w,b_5w
0,15555,2258,England,Australia,tied,,Lord's,"Jul 2, 2005",/ci/engine/match/212452.html,/ci/content/player/15555.html,...,275,7,2/43,2/43,39.28,4.74,49.7,0.0,0,0
1,15555,2256,England,Australia,no result,,Birmingham,"Jun 28, 2005",/ci/engine/match/212020.html,/ci/content/player/15555.html,...,275,7,2/43,2/43,39.28,4.74,49.7,0.0,0,0
2,15555,2259,England,Australia,England,9 wickets,Leeds,"Jul 7, 2005",/ci/engine/match/212795.html,/ci/content/player/15555.html,...,275,7,2/43,2/43,39.28,4.74,49.7,0.0,0,0
3,15555,2260,England,Australia,Australia,7 wickets,Lord's,"Jul 10, 2005",/ci/engine/match/213080.html,/ci/content/player/15555.html,...,275,7,2/43,2/43,39.28,4.74,49.7,0.0,0,0
4,15555,2197,Zimbabwe,England,England,74 runs,Bulawayo,"Dec 5, 2004",/ci/engine/match/64913.html,/ci/content/player/15555.html,...,275,7,2/43,2/43,39.28,4.74,49.7,0.0,0,0


## 2.2 Summarize all player data as an aggregated key performance indicators for the entire team

In [196]:
all_df=sqlContext.sql("""select match_no,team,Team1,Team2,Winner,Margin,avg(Mat) as Mat_avg, avg(Inns) as Inns_avg,avg(NO) NO_avg,avg(Runs) Runs_ave,max(CAST(Runs as Double)) Runs_max,avg(HS) HS_avg,max(CAST(HS as Double)) HS_max, avg(Ave) Ave_avg, max(CAST(Ave as Double)) Ave_max, avg(BF) BF_avg,avg(SR) SR_avg,max(CAST(SR as Double)) SR_max,avg(bat_100) bat_100_avg,max(bat_100) bat_100_max,avg(bat_50) bat_50_ave,max(bat_50) bat_50_max,avg(bat_4s) bat_4s_avg,max(bat_4s) bat_4s_max, avg(bat_6s) bat_6s_avg,max(bat_6s) bat_6s_max,avg(Ct)  Ct_ave,max(CAST(Ct as Double))  Ct_max,avg(St)  St_Avg,max(CAST(St as Double))  St_max,avg(b_Mat) bowl_Mat_Avg,avg(b_Inns) as  bowl_Inns_avg,avg(b_Mdns) bowl_maidens_avg,max(CAST(b_Mdns as Double)) bowl_maidens_max,avg(b_Runs) bowl_Runs_avg,max(CAST(b_Runs as Double)) bowl_Runs_max,avg(b_Wkts) bowl_Wkts_avg,max(CAST(b_Wkts as Double)) bowl_Wkts_max,avg(b_Ave) b_Ave_Avg,avg(b_Econ) bowl_Econ_Avg,min(CAST(b_Econ as Double)) bowl_Econ_min,avg(b_SR) bowl_SR_Avg,min(cast(b_SR as Double)) bowl_SR_min,avg(b_4w) as bowl_4w_avg,max(CAST(b_4w as Double)) as bowl_4w_max,avg(b_5w) bowl_5w_avg, max(CAST(b_5w as Double)) bowl_5w_max, IF(team==winner, 1, 0) label
 from all_player_match_data group by match_no,team,Team1,Team2,Winner,Margin order by match_no,team""")
 
all_df=all_df.cache()
all_df.createOrReplaceTempView("all_aggregated_data")
all_df.limit(5).toPandas()

Unnamed: 0,match_no,team,Team1,Team2,Winner,Margin,Mat_avg,Inns_avg,NO_avg,Runs_ave,...,b_Ave_Avg,bowl_Econ_Avg,bowl_Econ_min,bowl_SR_Avg,bowl_SR_min,bowl_4w_avg,bowl_4w_max,bowl_5w_avg,bowl_5w_max,label
0,1378,India,New Zealand,India,New Zealand,5 wickets,217.0,182.090909,30.272727,5525.090909,...,4.451818,44.081818,0.0,1.454545,0.0,1.0,3.0,0.0,0.0,0
1,1378,New Zealand,New Zealand,India,New Zealand,5 wickets,163.363636,135.0,25.181818,2978.363636,...,3.914545,39.309091,0.0,1.454545,0.0,0.363636,2.0,0.0,0.0,1
2,1379,Australia,Australia,England,England,7 runs,178.636364,138.454545,25.818182,4201.727273,...,4.172727,43.090909,0.0,2.363636,0.0,0.909091,7.0,0.0,0.0,0
3,1379,England,Australia,England,England,7 runs,76.545455,62.636364,11.636364,1536.909091,...,4.120909,29.518182,0.0,1.363636,0.0,0.454545,2.0,0.0,0.0,1
4,1380,England,England,Sri Lanka,England,4 wickets,76.909091,62.545455,12.0,1526.090909,...,4.13,31.927273,0.0,1.363636,0.0,0.454545,2.0,0.0,0.0,1


## 2.3 Acquire the playing 11 aggregated KPIs for the future matches of each team. 
* Note: Made a Naive assumption that 11 played in last match will also play future matches.

In [197]:
india_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='India' limit 1""")
pakistan_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk
                        from all_aggregated_data where (team=team1 or team=team2) and team='Pakistan' limit 1""")
australia_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='Australia' limit 1""")
newzealand_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='New Zealand' limit 1""")
srilanka_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='Sri Lanka' limit 1""")
west_indies_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='West Indies' limit 1""")
england_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='England' limit 1""")
bangladesh_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='Bangladesh' limit 1""")
southafrica_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='South Africa' limit 1""")
afghanistan_team_df=sqlContext.sql("""select *, rank() over (order by int(match_no) desc) as rnk 
                        from all_aggregated_data where (team=team1 or team=team2) and team='Afghanistan' limit 1""")
all_teams_metrics_df=india_team_df.union(pakistan_team_df).union(afghanistan_team_df)\
                        .union(australia_team_df).union(newzealand_team_df)\
                        .union(srilanka_team_df).union(west_indies_team_df)\
                        .union(england_team_df).union(bangladesh_team_df)\
                        .union(southafrica_team_df)
all_teams_metrics_df=all_teams_metrics_df.drop('match_no').drop('Winner')\
                        .drop('Margin').drop('label').drop('Team1').drop('Team2')
all_teams_metrics_df.limit(5).toPandas()


Unnamed: 0,team,Mat_avg,Inns_avg,NO_avg,Runs_ave,Runs_max,HS_avg,HS_max,Ave_avg,Ave_max,...,b_Ave_Avg,bowl_Econ_Avg,bowl_Econ_min,bowl_SR_Avg,bowl_SR_min,bowl_4w_avg,bowl_4w_max,bowl_5w_avg,bowl_5w_max,rnk
0,India,104.727273,81.727273,19.363636,3041.909091,11225.0,104.285714,264.0,29.523636,59.7,...,4.292727,42.918182,0.0,1.636364,0.0,0.454545,2.0,0.0,0.0,1
1,Pakistan,70.0,58.272727,10.363636,1745.090909,6587.0,110.0,151.0,35.61,53.06,...,4.378182,39.809091,0.0,1.181818,0.0,0.272727,1.0,0.0,0.0,1
2,Afghanistan,63.0,54.363636,7.545455,1232.909091,2697.0,69.0,116.0,22.398182,34.8,...,4.649091,27.109091,0.0,1.272727,0.0,0.727273,4.0,0.0,0.0,1
3,Australia,66.818182,55.909091,7.818182,1801.0,4859.0,98.0,179.0,29.816364,45.41,...,4.587273,31.318182,0.0,1.727273,0.0,0.818182,7.0,0.0,0.0,1
4,New Zealand,89.909091,75.818182,14.0,2480.0,8259.0,79.0,148.0,28.962727,48.06,...,4.343636,26.545455,0.0,1.181818,0.0,0.727273,5.0,0.0,0.0,1


## 2.4 Calculate the KPI differentials between the teams in future matches.

In [231]:
team1_agg_df=sqlContext.sql("select * from all_aggregated_data where team=Team1")
team2_agg_df=sqlContext.sql("select * from all_aggregated_data where team=Team2")
team1_agg_df.createOrReplaceTempView("team1_agg_df")
team2_agg_df.createOrReplaceTempView("team2_agg_df")
match_team_diff_df=sqlContext.sql("select t1.match_no,t1.team,t1.Team1,t1.Team2,t1.Winner,t1.Margin,t1.Mat_avg-t2.Mat_avg diff_Mat_avg,t1.Inns_avg - t2.Inns_avg diff_Inns_avg,t1.NO_avg - t2.NO_avg as diff_NO_avg,t1.Runs_ave-t2.Runs_ave diff_Runs_avg, t1.Runs_max-t2.Runs_max	diff_Runs_max,t1.HS_avg-t2.HS_avg diff_HS_avg,t1.HS_max-t2.HS_max diff_HS_max,t1.Ave_max	-	t2.	Ave_max	diff_Ave_max,t1.Ave_avg	-	t2.	Ave_avg	diff_Ave_avg,t1.BF_avg	-	t2.	BF_avg	diff_BF_avg,t1.SR_avg	-	t2.	SR_avg	diff_SR_avg,t1.SR_max	-	t2.	SR_max	diff_SR_max,t1.bat_100_avg	-	t2.bat_100_avg	diff_100_avg,t1.bat_100_max	-	t2.bat_100_max	diff_100_max,t1.bat_50_ave	-	t2.bat_50_ave	diff_50_ave,t1.bat_50_max	-	t2.bat_50_max	diff_50_max,t1.bat_4s_avg	-	t2.bat_4s_avg	diff_4s_avg,t1.bat_4s_max	-	t2.bat_4s_max	diff_4s_max,t1.bat_6s_avg	-	t2.bat_6s_avg	diff_6s_avg,t1.bat_6s_max	-	t2.bat_6s_max	diff_6s_max,t1.Ct_ave	-	t2.	Ct_ave	diff_Ct_ave,t1.Ct_max	-	t2.	Ct_max	diff_Ct_max,t1.St_Avg	-	t2.	St_Avg	diff_St_Avg,t1.St_max	-	t2.	St_max	diff_St_max,t1.bowl_Mat_Avg	-	t2.	bowl_Mat_Avg	diff_bowl_Mat_Avg,t1.bowl_Inns_avg	-	t2.	bowl_Inns_avg	diff_bowl_Inns_avg,t1.bowl_maidens_avg	-	t2.	bowl_maidens_avg	diff_bowl_maidens_avg,t1.bowl_maidens_max	-	t2.	bowl_maidens_max	diff_bowl_maidens_max,t1.bowl_Runs_avg	-	t2.	bowl_Runs_avg	diff_bowl_Runs_avg,t1.bowl_Runs_max	-	t2.	bowl_Runs_max	diff_bowl_Runs_max,t1.bowl_Wkts_avg	-	t2.	bowl_Wkts_avg	diff_bowl_Wkts_avg,t1.bowl_Wkts_max	-	t2.	bowl_Wkts_max	diff_bowl_Wkts_max,t1.b_Ave_Avg	-	t2.	b_Ave_Avg	diff_b_Ave_Avg,t1.bowl_Econ_Avg	-	t2.	bowl_Econ_Avg	diff_bowl_Econ_Avg,t1.bowl_Econ_min	-	t2.	bowl_Econ_min	diff_bowl_Econ_min,t1.bowl_SR_Avg	-	t2.	bowl_SR_Avg	diff_bowl_SR_Avg,t1.bowl_SR_min	-	t2.	bowl_SR_min	diff_bowl_SR_min,t1.bowl_4w_avg	-	t2.	bowl_4w_avg	diff_bowl_4w_avg,t1.bowl_4w_max	-	t2.	bowl_4w_max	diff_bowl_4w_max,t1.bowl_5w_avg	-	t2.	bowl_5w_avg	diff_bowl_5w_avg,t1.bowl_5w_max	-	t2.	bowl_5w_max	diff_bowl_5w_max,t1.label from team1_agg_df t1,team2_agg_df t2 where t1.match_no=t2.match_no")
match_team_diff_df.limit(5).toPandas()

Unnamed: 0,match_no,team,Team1,Team2,Winner,Margin,diff_Mat_avg,diff_Inns_avg,diff_NO_avg,diff_Runs_avg,...,diff_b_Ave_Avg,diff_bowl_Econ_Avg,diff_bowl_Econ_min,diff_bowl_SR_Avg,diff_bowl_SR_min,diff_bowl_4w_avg,diff_bowl_4w_max,diff_bowl_5w_avg,diff_bowl_5w_max,label
0,1378,New Zealand,New Zealand,India,New Zealand,5 wickets,-53.636364,-47.090909,-5.090909,-2546.727273,...,-0.537273,-4.772727,0.0,0.0,0.0,-0.636364,-1.0,0.0,0.0,1
1,1379,Australia,Australia,England,England,7 runs,102.090909,75.818182,14.181818,2664.818182,...,0.051818,13.572727,0.0,1.0,0.0,0.454545,5.0,0.0,0.0,0
2,1380,England,England,Sri Lanka,England,4 wickets,-162.363636,-128.181818,-21.727273,-2754.818182,...,-1.500909,0.781818,0.0,-2.272727,0.0,-1.363636,-8.0,0.0,0.0,1
3,1381,New Zealand,New Zealand,India,India,2 wickets,-94.909091,-76.272727,-10.727273,-3194.272727,...,-1.045455,-7.681818,0.0,-0.454545,0.0,-0.818182,-1.0,0.0,0.0,0
4,1382,Australia,Australia,Sri Lanka,Australia,8 wickets,-94.909091,-86.818182,-10.181818,-1202.272727,...,-1.314545,25.536364,0.0,-1.090909,0.0,-0.818182,-3.0,0.0,0.0,1


# 3. Model Training & Validation

Now that we have differential KPI's for the teams participating in the matches, we will train a model using these features. We will use Spark MLlib pipeline APIs

In [232]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import PCA

featureAttributeArray=['diff_Mat_avg', 'diff_Inns_avg', 'diff_NO_avg', 'diff_Runs_avg', 'diff_Runs_max', 'diff_HS_avg', 'diff_HS_max', 'diff_Ave_avg', 'diff_Ave_max', 'diff_BF_avg', 'diff_SR_avg', 'diff_SR_max', 'diff_100_avg', 'diff_100_max', 'diff_50_ave', 'diff_50_max', 'diff_4s_avg', 'diff_4s_max', 'diff_6s_avg', 'diff_6s_max', 'diff_Ct_ave', 'diff_Ct_max', 'diff_St_Avg', 'diff_St_max', 'diff_bowl_Mat_Avg', 'diff_bowl_Inns_avg', 'diff_bowl_maidens_avg', 'diff_bowl_maidens_max', 'diff_bowl_Runs_avg', 'diff_bowl_Runs_max', 'diff_bowl_Wkts_avg', 'diff_bowl_Wkts_max', 'diff_b_Ave_Avg', 'diff_bowl_Econ_Avg', 'diff_bowl_Econ_min', 'diff_bowl_SR_Avg', 'diff_bowl_SR_min', 'diff_bowl_4w_avg', 'diff_bowl_4w_max', 'diff_bowl_5w_avg', 'diff_bowl_5w_max']

assembler = VectorAssembler(
    inputCols=featureAttributeArray,
    outputCol="features", handleInvalid="skip")
                         
pca = PCA(k=25, inputCol="features", outputCol="pcaFeatures")
    
#lr = LogisticRegression(labelCol="label", featuresCol="pcaFeatures", maxIter=50, regParam=0.01)
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=50, regParam=0.01)
                                   
(trainingData, testData) = match_team_diff_df.randomSplit([0.7, 0.3], seed=11)

lr_Pipeline=Pipeline().setStages([assembler, lr])

lr_PipelineModel = lr_Pipeline.fit(trainingData)

predictions = lr_PipelineModel.transform(testData)

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
print("AUROC:%s"%evaluator.evaluate(predictions))

evaluator.setMetricName("areaUnderPR")

print("AUPRC:%s"%evaluator.evaluate(predictions))


AUROC:0.7458716664012445
AUPRC:0.7801610368842407


# 4 Model Scoring
## 4.1 Acquire Schedules for the rest of the tournament and compute differentiated KPIs

In [262]:
import datetime
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)

now=datetime.datetime.now().date()
now_dt64=np.datetime64(now)
df_list=pd.read_html("https://www.icccricketschedule.com/icc-world-cup-2019-schedule/")
df=df_list[0]
df['Date']=df['Date'].str.replace('- ','-')
df['Date']=pd.to_datetime(df['Date'], format='%B %d-%Y')
#filtered_df=filtered_df[(~df['Match Center'].str.contains('Reserve'))]
#filtered_df=filtered_df[(df['Date']>now_dt64) ]
#knockoffs_df=filtered_df[(df['No']>=46) & (df['No']<49)]
filtered_df=df[(df['Date']>now_dt64) & (~df['Match Center'].str.contains('Reserve'))]
knockoffs_df=filtered_df[(filtered_df['No']>=46) & (filtered_df['No']<49)]
knockoffs_df=knockoffs_df.reset_index(drop=True)
final_df=df[df['No']==50]
final_df=final_df.reset_index(drop=True)
filtered_df=filtered_df[df['No']<46]
teams=filtered_df['Match Center'].str.split(pat=' vs ',expand=True)
filtered_df['Team1']=teams[0]
filtered_df['Team2']=teams[1]

to_be_played_match_df=sqlContext.createDataFrame(filtered_df)
to_be_played_team1_df=to_be_played_match_df.join(all_teams_metrics_df,to_be_played_match_df.Team1 == all_teams_metrics_df.team,how='inner')
to_be_played_team2_df=to_be_played_match_df.join(all_teams_metrics_df,to_be_played_match_df.Team2 == all_teams_metrics_df.team,how='inner')
to_be_played_team1_df.createOrReplaceTempView("to_be_played_team1_data")
to_be_played_team2_df.createOrReplaceTempView("to_be_played_team2_data")
to_be_played_team_diff_df=sqlContext.sql("select t1.Date, t1.Venue, t1.No as match_no,t1.team,t1.Team1,t1.Team2,t1.Mat_avg-t2.Mat_avg diff_Mat_avg,t1.Inns_avg - t2.Inns_avg diff_Inns_avg,t1.NO_avg - t2.NO_avg as diff_NO_avg,t1.Runs_ave-t2.Runs_ave diff_Runs_avg, t1.Runs_max-t2.Runs_max	diff_Runs_max,t1.HS_avg-t2.HS_avg diff_HS_avg,t1.HS_max-t2.HS_max diff_HS_max,t1.Ave_max	-	t2.	Ave_max	diff_Ave_max,t1.Ave_avg	-	t2.	Ave_avg	diff_Ave_avg,t1.BF_avg	-	t2.	BF_avg	diff_BF_avg,t1.SR_avg	-	t2.	SR_avg	diff_SR_avg,t1.SR_max	-	t2.	SR_max	diff_SR_max,t1.bat_100_avg	-	t2.bat_100_avg	diff_100_avg,t1.bat_100_max	-	t2.bat_100_max	diff_100_max,t1.bat_50_ave	-	t2.bat_50_ave	diff_50_ave,t1.bat_50_max	-	t2.	bat_50_max	diff_50_max,t1.bat_4s_avg	-	t2.bat_4s_avg	diff_4s_avg,t1.bat_4s_max	-	t2.bat_4s_max	diff_4s_max,t1.bat_6s_avg	-	t2.bat_6s_avg	diff_6s_avg,t1.bat_6s_max	-	t2.bat_6s_max	diff_6s_max,t1.Ct_ave	-	t2.	Ct_ave	diff_Ct_ave,t1.Ct_max	-	t2.	Ct_max	diff_Ct_max,t1.St_Avg	-	t2.	St_Avg	diff_St_Avg,t1.St_max	-	t2.	St_max	diff_St_max,t1.bowl_Mat_Avg	-	t2.	bowl_Mat_Avg	diff_bowl_Mat_Avg,t1.bowl_Inns_avg	-	t2.	bowl_Inns_avg	diff_bowl_Inns_avg,t1.bowl_maidens_avg	-	t2.	bowl_maidens_avg	diff_bowl_maidens_avg,t1.bowl_maidens_max	-	t2.	bowl_maidens_max	diff_bowl_maidens_max,t1.bowl_Runs_avg	-	t2.	bowl_Runs_avg	diff_bowl_Runs_avg,t1.bowl_Runs_max	-	t2.	bowl_Runs_max	diff_bowl_Runs_max,t1.bowl_Wkts_avg	-	t2.	bowl_Wkts_avg	diff_bowl_Wkts_avg,t1.bowl_Wkts_max	-	t2.	bowl_Wkts_max	diff_bowl_Wkts_max,t1.b_Ave_Avg	-	t2.	b_Ave_Avg	diff_b_Ave_Avg,t1.bowl_Econ_Avg	-	t2.	bowl_Econ_Avg	diff_bowl_Econ_Avg,t1.bowl_Econ_min	-	t2.	bowl_Econ_min	diff_bowl_Econ_min,t1.bowl_SR_Avg	-	t2.	bowl_SR_Avg	diff_bowl_SR_Avg,t1.bowl_SR_min	-	t2.	bowl_SR_min	diff_bowl_SR_min,t1.bowl_4w_avg	-	t2.	bowl_4w_avg	diff_bowl_4w_avg,t1.bowl_4w_max	-	t2.	bowl_4w_max	diff_bowl_4w_max,t1.bowl_5w_avg	-	t2.	bowl_5w_avg	diff_bowl_5w_avg,t1.bowl_5w_max	-	t2.	bowl_5w_max	diff_bowl_5w_max from to_be_played_team1_data t1,to_be_played_team2_data t2 where t1.No=t2.No order by t1.No")

to_be_played_team_diff_df.toPandas()

Unnamed: 0,Date,Venue,match_no,team,Team1,Team2,diff_Mat_avg,diff_Inns_avg,diff_NO_avg,diff_Runs_avg,...,diff_bowl_Wkts_max,diff_b_Ave_Avg,diff_bowl_Econ_Avg,diff_bowl_Econ_min,diff_bowl_SR_Avg,diff_bowl_SR_min,diff_bowl_4w_avg,diff_bowl_4w_max,diff_bowl_5w_avg,diff_bowl_5w_max
0,2019-07-02,Edgbaston- Birmingham,40,Bangladesh,Bangladesh,India,-164.909091,-135.545455,-23.363636,-4828.0,...,0.0,-0.218182,-9.209091,0.0,-1.0,0.0,-0.909091,-2.0,0.0,0.0
1,2019-07-03,The Riverside- Chester-le-Street,41,England,England,New Zealand,-86.818182,-72.363636,-13.545455,-1441.454545,...,0.0,0.206364,-9.790909,0.0,-0.090909,0.0,0.090909,0.0,0.0,0.0
2,2019-07-04,Headingley- Leeds,42,Afghanistan,Afghanistan,West Indies,-80.545455,-65.454545,-16.090909,-1888.545455,...,0.0,-0.055455,5.781818,0.0,-0.727273,0.0,-0.272727,-3.0,0.0,0.0
3,2019-07-05,Lord’s- London,43,Pakistan,Pakistan,Bangladesh,165.727273,134.727273,24.545455,3419.454545,...,0.0,0.281818,5.954545,0.0,3.272727,0.0,2.454545,8.0,0.0,0.0
4,2019-07-06,Headingley- Leeds,44,Sri Lanka,Sri Lanka,India,22.272727,8.636364,3.454545,-1244.181818,...,0.0,1.179091,-12.936364,0.0,2.181818,0.0,0.818182,7.0,0.0,0.0
5,2019-07-06,Old Trafford- Manchester,45,Australia,Australia,South Africa,-23.181818,-34.272727,-9.272727,-744.272727,...,0.0,0.833636,16.445455,0.0,0.818182,0.0,-0.363636,1.0,0.0,0.0


## 4.2 Score Predictions for the remaining Group Stage Matches

In [265]:
predictions = lr_PipelineModel.transform(to_be_played_team_diff_df)
predictions_df=predictions.select("Date","Venue","match_no","Team1","Team2","rawPrediction","probability", "prediction")
win_predictions=predictions_df.toPandas()
win_list1=win_predictions[(win_predictions.prediction==0.0)]['Team2'].value_counts()
win_list2=win_predictions[(win_predictions.prediction==1.0)]['Team1'].value_counts()
win_list=win_list1.append(win_list2)
predictions_df.toPandas()

Unnamed: 0,Date,Venue,match_no,Team1,Team2,rawPrediction,probability,prediction
0,2019-07-02,Edgbaston- Birmingham,40,Bangladesh,India,"[2.533546317491155, -2.533546317491155]","[0.9264603353309613, 0.07353966466903869]",0.0
1,2019-07-03,The Riverside- Chester-le-Street,41,England,New Zealand,"[0.6398962993169302, -0.6398962993169302]","[0.6547300185490335, 0.34526998145096655]",0.0
2,2019-07-04,Headingley- Leeds,42,Afghanistan,West Indies,"[0.12343074545109278, -0.12343074545109278]","[0.5308185691700561, 0.46918143082994385]",0.0
3,2019-07-05,Lord’s- London,43,Pakistan,Bangladesh,"[-3.023279346659099, 3.023279346659099]","[0.04638520118683787, 0.953614798813162]",1.0
4,2019-07-06,Headingley- Leeds,44,Sri Lanka,India,"[0.2001680895464068, -0.2001680895464068]","[0.5498756019123112, 0.4501243980876887]",0.0
5,2019-07-06,Old Trafford- Manchester,45,Australia,South Africa,"[0.24976479214895309, -0.24976479214895309]","[0.5621186073707369, 0.4378813926292631]",0.0


## 4.3 Acquire current points table 
Using the current world cup Points table(standings), Acquire and compute the projected points based on the rest of group stage match predictions

In [266]:
df_list = pd.read_html("https://www.cricketworldcup.com/")
points_table_df=df_list[0]
points_table_df['Team'], points_table_df['Team_Cd'] = points_table_df['Team'].str.rsplit(n=1).str
points_table_df['Projected PTS']=points_table_df['PTS']
points_table=points_table_df.set_index('Team')
for i in points_table_df['Team']:
     if i in win_list:
       points_table.loc[i, 'Projected PTS']=points_table.loc[i,'PTS']+2*win_list[i] 

points_table=points_table.sort_values(by='Projected PTS', ascending=False).reset_index()
points_table

Unnamed: 0,Team,Pos,PLD,PTS,NRR,Team_Cd,Projected PTS
0,India,2,7,11,0.854,IND,15
1,Australia,1,8,14,1.0,AUS,14
2,New Zealand,3,8,11,0.572,NZ,13
3,Pakistan,5,8,9,-0.792,PAK,11
4,England,4,8,10,1.0,ENG,10
5,Sri Lanka,6,8,8,-0.934,SL,8
6,Bangladesh,7,7,7,-0.133,BAN,7
7,South Africa,8,8,5,-0.08,SA,7
8,West Indies,9,8,3,-0.335,WI,5
9,Afghanistan,10,8,0,-1.418,AFG,0


In [267]:
knockoffs_df

Unnamed: 0,No,Date,Match Center,Venue,Timing
0,46,2019-07-09,First semifinal (1st place vs 4th place),Old Trafford- Manchester,Day
1,48,2019-07-11,Second semi-final (2nd place vs 3rd place),Edgbaston- Birmingham,Day


## 4.4 Score predictions for Semi-Finals.

In [268]:
knockoffs_df['Team1']=''
knockoffs_df['Team2']=''
knockoffs_df.loc[0,'Team1'] = points_table.loc[0,'Team']
knockoffs_df.loc[0,'Team2'] = points_table.loc[3,'Team']
knockoffs_df.loc[1,'Team1'] = points_table.loc[1,'Team']
knockoffs_df.loc[1,'Team2'] = points_table.loc[2,'Team']
knockoffs_df

knockoff_match_df=sqlContext.createDataFrame(knockoffs_df)
knockoff_team1_df=knockoff_match_df.join(all_teams_metrics_df,knockoff_match_df.Team1 == all_teams_metrics_df.team,how='inner')
knockoff_team2_df=knockoff_match_df.join(all_teams_metrics_df,knockoff_match_df.Team2 == all_teams_metrics_df.team,how='inner')
knockoff_team1_df.createOrReplaceTempView("knockoff_team1_data")
knockoff_team2_df.createOrReplaceTempView("knockoff_team2_data")
knockoff_team_diff_df=sqlContext.sql("select t1.Date, t1.Venue, t1.No as match_no,t1.team,t1.Team1,t1.Team2,t1.Mat_avg-t2.Mat_avg diff_Mat_avg,t1.Inns_avg - t2.Inns_avg diff_Inns_avg,t1.NO_avg - t2.NO_avg as diff_NO_avg,t1.Runs_ave-t2.Runs_ave diff_Runs_avg, t1.Runs_max-t2.Runs_max	diff_Runs_max,t1.HS_avg-t2.HS_avg diff_HS_avg,t1.HS_max-t2.HS_max diff_HS_max,t1.Ave_max	-	t2.	Ave_max	diff_Ave_max,t1.Ave_avg	-	t2.	Ave_avg	diff_Ave_avg,t1.BF_avg	-	t2.	BF_avg	diff_BF_avg,t1.SR_avg	-	t2.	SR_avg	diff_SR_avg,t1.SR_max	-	t2.	SR_max	diff_SR_max,t1.bat_100_avg	-	t2.bat_100_avg	diff_100_avg,t1.bat_100_max	-	t2.bat_100_max	diff_100_max,t1.bat_50_ave	-	t2.bat_50_ave	diff_50_ave,t1.bat_50_max	-	t2.bat_50_max	diff_50_max,t1.bat_4s_avg	-	t2.bat_4s_avg	diff_4s_avg,t1.bat_4s_max	-	t2.bat_4s_max	diff_4s_max,t1.bat_6s_avg	-	t2.bat_6s_avg	diff_6s_avg,t1.bat_6s_max	-	t2.bat_6s_max	diff_6s_max,t1.Ct_ave	-	t2.	Ct_ave	diff_Ct_ave,t1.Ct_max	-	t2.	Ct_max	diff_Ct_max,t1.St_Avg	-	t2.	St_Avg	diff_St_Avg,t1.St_max	-	t2.	St_max	diff_St_max,t1.bowl_Mat_Avg	-	t2.	bowl_Mat_Avg	diff_bowl_Mat_Avg,t1.bowl_Inns_avg	-	t2.	bowl_Inns_avg	diff_bowl_Inns_avg,t1.bowl_maidens_avg	-	t2.	bowl_maidens_avg	diff_bowl_maidens_avg,t1.bowl_maidens_max	-	t2.	bowl_maidens_max	diff_bowl_maidens_max,t1.bowl_Runs_avg	-	t2.	bowl_Runs_avg	diff_bowl_Runs_avg,t1.bowl_Runs_max	-	t2.	bowl_Runs_max	diff_bowl_Runs_max,t1.bowl_Wkts_avg	-	t2.	bowl_Wkts_avg	diff_bowl_Wkts_avg,t1.bowl_Wkts_max	-	t2.	bowl_Wkts_max	diff_bowl_Wkts_max,t1.b_Ave_Avg	-	t2.	b_Ave_Avg	diff_b_Ave_Avg,t1.bowl_Econ_Avg	-	t2.	bowl_Econ_Avg	diff_bowl_Econ_Avg,t1.bowl_Econ_min	-	t2.	bowl_Econ_min	diff_bowl_Econ_min,t1.bowl_SR_Avg	-	t2.	bowl_SR_Avg	diff_bowl_SR_Avg,t1.bowl_SR_min	-	t2.	bowl_SR_min	diff_bowl_SR_min,t1.bowl_4w_avg	-	t2.	bowl_4w_avg	diff_bowl_4w_avg,t1.bowl_4w_max	-	t2.	bowl_4w_max	diff_bowl_4w_max,t1.bowl_5w_avg	-	t2.	bowl_5w_avg	diff_bowl_5w_avg,t1.bowl_5w_max	-	t2.	bowl_5w_max	diff_bowl_5w_max from knockoff_team1_data t1,knockoff_team2_data t2 where t1.No=t2.No order by t1.No")

semis_predictions = lr_PipelineModel.transform(knockoff_team_diff_df)
semis_predictions_df=semis_predictions.select("Date","Venue","match_no","Team1","Team2","rawPrediction","probability", "prediction")
semis_win_predictions=semis_predictions_df.toPandas()
semis_win_list1=semis_win_predictions[(semis_win_predictions.prediction==0.0)]['Team2'].value_counts()
semis_win_list2=semis_win_predictions[(semis_win_predictions.prediction==1.0)]['Team1'].value_counts()
semis_win_list=semis_win_list1.append(semis_win_list2)
semis_win_predictions

Unnamed: 0,Date,Venue,match_no,Team1,Team2,rawPrediction,probability,prediction
0,2019-07-09,Old Trafford- Manchester,46,India,Pakistan,"[-0.13916002482050224, 0.13916002482050224]","[0.46526602914041715, 0.5347339708595829]",1.0
1,2019-07-11,Edgbaston- Birmingham,48,Australia,New Zealand,"[-1.1524829690753933, 1.1524829690753933]","[0.240035850885297, 0.759964149114703]",1.0


## 4.5 Score predictions for Finals.

In [269]:
final_df['Team1']=semis_win_list.keys()[0]
final_df['Team2']=semis_win_list.keys()[1]


final_match_df=sqlContext.createDataFrame(final_df)
final_match_team1_df=final_match_df.join(all_teams_metrics_df,final_match_df.Team1 == all_teams_metrics_df.team,how='inner')
final_match_team2_df=final_match_df.join(all_teams_metrics_df,final_match_df.Team2 == all_teams_metrics_df.team,how='inner')
final_match_team1_df.createOrReplaceTempView("final_team1_data")
final_match_team2_df.createOrReplaceTempView("final_team2_data")
final_team_diff_df=sqlContext.sql("select t1.Date, t1.Venue, t1.No as match_no,t1.team,t1.Team1,t1.Team2,t1.Mat_avg-t2.Mat_avg diff_Mat_avg,t1.Inns_avg - t2.Inns_avg diff_Inns_avg,t1.NO_avg - t2.NO_avg as diff_NO_avg,t1.Runs_ave-t2.Runs_ave diff_Runs_avg, t1.Runs_max-t2.Runs_max	diff_Runs_max,t1.HS_avg-t2.HS_avg diff_HS_avg,t1.HS_max-t2.HS_max diff_HS_max,t1.Ave_max	-	t2.	Ave_max	diff_Ave_max,t1.Ave_avg	-	t2.	Ave_avg	diff_Ave_avg,t1.BF_avg	-	t2.	BF_avg	diff_BF_avg,t1.SR_avg	-	t2.	SR_avg	diff_SR_avg,t1.SR_max	-	t2.	SR_max	diff_SR_max,t1.bat_100_avg	-	t2.bat_100_avg	diff_100_avg,t1.bat_100_max	-	t2.bat_100_max	diff_100_max,t1.bat_50_ave	-	t2.bat_50_ave	diff_50_ave,t1.bat_50_max	-	t2.bat_50_max	diff_50_max,t1.bat_4s_avg	-	t2.bat_4s_avg	diff_4s_avg,t1.bat_4s_max	-	t2.bat_4s_max	diff_4s_max,t1.bat_6s_avg	-	t2.bat_6s_avg	diff_6s_avg,t1.bat_6s_max	-	t2.bat_6s_max	diff_6s_max,t1.Ct_ave	-	t2.	Ct_ave	diff_Ct_ave,t1.Ct_max	-	t2.	Ct_max	diff_Ct_max,t1.St_Avg	-	t2.	St_Avg	diff_St_Avg,t1.St_max	-	t2.	St_max	diff_St_max,t1.bowl_Mat_Avg	-	t2.	bowl_Mat_Avg	diff_bowl_Mat_Avg,t1.bowl_Inns_avg	-	t2.	bowl_Inns_avg	diff_bowl_Inns_avg,t1.bowl_maidens_avg	-	t2.	bowl_maidens_avg	diff_bowl_maidens_avg,t1.bowl_maidens_max	-	t2.	bowl_maidens_max	diff_bowl_maidens_max,t1.bowl_Runs_avg	-	t2.	bowl_Runs_avg	diff_bowl_Runs_avg,t1.bowl_Runs_max	-	t2.	bowl_Runs_max	diff_bowl_Runs_max,t1.bowl_Wkts_avg	-	t2.	bowl_Wkts_avg	diff_bowl_Wkts_avg,t1.bowl_Wkts_max	-	t2.	bowl_Wkts_max	diff_bowl_Wkts_max,t1.b_Ave_Avg	-	t2.	b_Ave_Avg	diff_b_Ave_Avg,t1.bowl_Econ_Avg	-	t2.	bowl_Econ_Avg	diff_bowl_Econ_Avg,t1.bowl_Econ_min	-	t2.	bowl_Econ_min	diff_bowl_Econ_min,t1.bowl_SR_Avg	-	t2.	bowl_SR_Avg	diff_bowl_SR_Avg,t1.bowl_SR_min	-	t2.	bowl_SR_min	diff_bowl_SR_min,t1.bowl_4w_avg	-	t2.	bowl_4w_avg	diff_bowl_4w_avg,t1.bowl_4w_max	-	t2.	bowl_4w_max	diff_bowl_4w_max,t1.bowl_5w_avg	-	t2.	bowl_5w_avg	diff_bowl_5w_avg,t1.bowl_5w_max	-	t2.	bowl_5w_max	diff_bowl_5w_max from final_team1_data t1,final_team2_data t2 where t1.No=t2.No order by t1.No")
predictions = lr_PipelineModel.transform(final_team_diff_df)
predictions_df=predictions.select("Date","Venue","match_no","Team1","Team2","rawPrediction","probability", "prediction")
win_predictions=predictions_df.toPandas()
win_list1=win_predictions[(win_predictions.prediction==0.0)]['Team2'].value_counts()
win_list2=win_predictions[(win_predictions.prediction==1.0)]['Team1'].value_counts()
win_list=win_list1.append(win_list2)
predictions_df=predictions.select("Date","Venue","match_no","Team1","Team2","rawPrediction","probability", "prediction")
predictions_df.toPandas()


Unnamed: 0,Date,Venue,match_no,Team1,Team2,rawPrediction,probability,prediction
0,2019-07-14,Lord’s- London,50,India,Australia,"[0.40132330367119945, -0.40132330367119945]","[0.5990055564776708, 0.4009944435223292]",0.0


# 5. Conclusion (Winner of the World Cup)

In [270]:
print("Winner of Cricket World Cup 2019 is %s"%(win_list.keys()[0]))

Winner of Cricket World Cup 2019 is Australia
