# ICC Men’s Cricket World Cup - A Journey Through History

## Introduction

This case study is about analysing all the ICC Men’s Cricket World Cup matches held from 
1975-2023. 
- Data Preparation :

### 1. Reading and combining data
Load the all csv files and concatenate the files into a single DataFrame

In [289]:
import pandas as pd

In [290]:
import glob   # use glob for to get a list of files
csv_files = glob.glob("D:/WorldCup_Stats/*.csv")  # all csv file in WorldCup_Stats folder
wcinfo_list = []

for file in csv_files:    # read each csv and combine it into a list
    df = pd.read_csv(file)
    wcinfo_list.append(df)

crick_df = pd.concat(wcinfo_list, ignore_index=True)


### 2. Initial data exploration and cleaning 

In [291]:
crick_df.drop_duplicates(inplace=True)

In [292]:
crick_df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,date,venue,match_category,team_1,team_2,team_1_runs,team_1_wickets,team_2_runs,team_2_wickets,result,pom,best_batters,best_bowlers,commentary_line,world_cup_year,host_country
0,0,11,,Nottingham,League-Match,PAK,SL,330.0,6.0,138.0,0.0,Pakistan won by 192 runs,Zaheer Abbas,,,,1975,England
1,1,5,,Leeds,League-Match,EAf,IND,120.0,0.0,123.0,0.0,India won by 10 wickets (with 181 balls remain...,Farokh Engineer,,,,1975,England
2,2,12,1975-06-18,Leeds,Semi-Final,ENG,AUS,93.0,0.0,94.0,6.0,Australia won by 4 wickets (with 188 balls rem...,Gary Gilmour,,,,1975,England
3,3,8,1975-06-14,Birmingham,League-Match,ENG,EAf,290.0,5.0,94.0,0.0,England won by 196 runs,John Snow,,,,1975,England
4,4,13,,The Oval,Semi-Final,NZ,WI,158.0,0.0,159.0,5.0,West Indies won by 5 wickets (with 119 balls r...,Alvin Kallicharran,,,,1975,England


In [293]:
crick_df.info()  # understanding the dataframe structure

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528 entries, 0 to 527
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0.1     528 non-null    int64  
 1   Unnamed: 0       528 non-null    int64  
 2   date             364 non-null    object 
 3   venue            528 non-null    object 
 4   match_category   528 non-null    object 
 5   team_1           528 non-null    object 
 6   team_2           528 non-null    object 
 7   team_1_runs      518 non-null    float64
 8   team_1_wickets   518 non-null    float64
 9   team_2_runs      513 non-null    float64
 10  team_2_wickets   513 non-null    float64
 11  result           528 non-null    object 
 12  pom              510 non-null    object 
 13  best_batters     250 non-null    object 
 14  best_bowlers     250 non-null    object 
 15  commentary_line  83 non-null     object 
 16  world_cup_year   528 non-null    int64  
 17  host_country    

## Note:
- There exist some null records. Only 9 columns do not have the null records. If I remove null column, then there will be elimated many columns. And also I cannot remove the rows because rows are very important. So I decide to not change anything.

Type of 'date' change to date type

In [294]:
crick_df['date']=pd.to_datetime(crick_df['date'])   

In [295]:
crick_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528 entries, 0 to 527
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Unnamed: 0.1     528 non-null    int64         
 1   Unnamed: 0       528 non-null    int64         
 2   date             364 non-null    datetime64[ns]
 3   venue            528 non-null    object        
 4   match_category   528 non-null    object        
 5   team_1           528 non-null    object        
 6   team_2           528 non-null    object        
 7   team_1_runs      518 non-null    float64       
 8   team_1_wickets   518 non-null    float64       
 9   team_2_runs      513 non-null    float64       
 10  team_2_wickets   513 non-null    float64       
 11  result           528 non-null    object        
 12  pom              510 non-null    object        
 13  best_batters     250 non-null    object        
 14  best_bowlers     250 non-null    object   

In [296]:
print(crick_df)

     Unnamed: 0.1  Unnamed: 0       date        venue match_category team_1  \
0               0          11        NaT   Nottingham   League-Match    PAK   
1               1           5        NaT        Leeds   League-Match    EAf   
2               2          12 1975-06-18        Leeds     Semi-Final    ENG   
3               3           8 1975-06-14   Birmingham   League-Match    ENG   
4               4          13        NaT     The Oval     Semi-Final     NZ   
..            ...         ...        ...          ...            ...    ...   
523            45          18 2023-10-21      Lucknow   League-Match    NED   
524            46          40 2023-11-09    Bengaluru   League-Match     SL   
525            47           1 2023-10-06    Hyderabad   League-Match    PAK   
526            48          44 2023-11-12    Bengaluru   League-Match    IND   
527            49           6 2023-10-10   Dharamsala   League-Match    ENG   

    team_2  team_1_runs  team_1_wickets  team_2_run

### 3. Handle outliers and missing values 

- Sorting the values to understand clearly

In [297]:
crick_df = crick_df.sort_values(by=['world_cup_year', 'Unnamed: 0'],
    ascending=[True, True])

In [298]:
print(crick_df)

     Unnamed: 0.1  Unnamed: 0       date          venue match_category team_1  \
12             12           0 1975-06-07         Lord's   League-Match    ENG   
5               5           1        NaT     Birmingham   League-Match     NZ   
7               7           2        NaT          Leeds   League-Match    AUS   
10             10           3        NaT     Manchester   League-Match     SL   
9               9           4 1975-06-11     Nottingham   League-Match    ENG   
..            ...         ...        ...            ...            ...    ...   
515            37          43        NaT   Eden Gardens   League-Match    ENG   
526            48          44 2023-11-12      Bengaluru   League-Match    IND   
521            43          45 2023-11-15       Wankhede     Semi-Final    IND   
484             6          46 2023-11-16   Eden Gardens     Semi-Final     SA   
498            20          47 2023-11-19      Ahmedabad          Final    IND   

    team_2  team_1_runs  te

In [299]:
crick_df.set_index('Unnamed: 0', inplace=True)  

In [300]:
print(crick_df)

            Unnamed: 0.1       date          venue match_category team_1  \
Unnamed: 0                                                                 
0                     12 1975-06-07         Lord's   League-Match    ENG   
1                      5        NaT     Birmingham   League-Match     NZ   
2                      7        NaT          Leeds   League-Match    AUS   
3                     10        NaT     Manchester   League-Match     SL   
4                      9 1975-06-11     Nottingham   League-Match    ENG   
...                  ...        ...            ...            ...    ...   
43                    37        NaT   Eden Gardens   League-Match    ENG   
44                    48 2023-11-12      Bengaluru   League-Match    IND   
45                    43 2023-11-15       Wankhede     Semi-Final    IND   
46                     6 2023-11-16   Eden Gardens     Semi-Final     SA   
47                    20 2023-11-19      Ahmedabad          Final    IND   

           

### 4. Adding new columns to the DataFrame: 

### 5. Column Removal 