## Data Preprocessing on ODI Bowlers Dataset.
Done by: Audity Ghosh
<br>3rd Year, CSE, RUET

#### Objectives:
- Reading a specific sheet from an excel file.
- Getting the insights of data
- Extracting new information from a column.
- Removing redundant information from the dataset.


## Importing necessary modules

In [40]:
import numpy as np
import pandas as pd

## Read the data sheet into dataframe

In [41]:
df = pd.read_excel("ODI_cricket.xlsx", sheet_name="bowler", engine="openpyxl")

display(df.head())

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,Ave,Econ,SR,4,5
0,M Muralitharan (Asia/ICC/SL),1993-2011,350,341,18811,12326,534,23.08,3.93,35.2,15,10
1,Wasim Akram (PAK),1984-2003,356,351,18186,11812,502,23.52,3.89,36.2,17,6
2,Waqar Younis (PAK),1989-2003,262,258,12698,9919,416,23.84,4.68,30.5,14,13
3,WPUJC Vaas (Asia/SL),1994-2008,322,320,15775,11014,400,27.53,4.18,39.4,9,4
4,Shahid Afridi (Asia/ICC/PAK),1996-2015,398,372,17670,13632,395,34.51,4.62,44.7,4,9


Here our excel file contained two sheets, so we specified which sheet to take for dataframe.

## First 10 rows of the dataframe

In [42]:
display(df.head(10))

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,Ave,Econ,SR,4,5
0,M Muralitharan (Asia/ICC/SL),1993-2011,350,341,18811,12326,534,23.08,3.93,35.2,15,10
1,Wasim Akram (PAK),1984-2003,356,351,18186,11812,502,23.52,3.89,36.2,17,6
2,Waqar Younis (PAK),1989-2003,262,258,12698,9919,416,23.84,4.68,30.5,14,13
3,WPUJC Vaas (Asia/SL),1994-2008,322,320,15775,11014,400,27.53,4.18,39.4,9,4
4,Shahid Afridi (Asia/ICC/PAK),1996-2015,398,372,17670,13632,395,34.51,4.62,44.7,4,9
5,SM Pollock (Afr/ICC/SA),1996-2008,303,297,15712,9631,393,24.5,3.67,39.9,12,5
6,GD McGrath (AUS/ICC),1993-2007,250,248,12970,8391,381,22.02,3.88,34.0,9,7
7,B Lee (AUS),2000-2012,221,217,11185,8877,380,23.36,4.76,29.4,14,9
8,SL Malinga (SL),2004-2019,226,220,10936,9760,338,28.87,5.35,32.3,11,8
9,A Kumble (Asia/INDIA),1990-2007,271,265,14496,10412,337,30.89,4.3,43.0,8,2


## The meaning of each column

    Player: Name of the bowlers of the dataset.
    Span: The active years of the bowler
    Mat: The number of matches the bowler has played.
    Inns: The number of innings the bowler has played.
    Balls: The number of balls the bowler has bowled.
    Wkts: The number of wickets the specific bowler has taken.
    Runs: The number of runs yielded.
    Ave: The normal number of runs yielded per wicket. (Ave = Runs/Wkts)
    Econ: The normal number of runs surrendered per over. (Econ = Runs/(Balls/6)). 
    SR: The normal number of balls bowled per wicket taken. (SR = Balls/Wkts) 
    4: The quantity of innings where the bowler took precisely four wickets.
    5: The quantity of innings where the bowler took precisely five wickets.

## No of rows and columns in the dataset

In [43]:
print("No of rows ", df.shape[0])
print("No of columns ", df.shape[1])

No of rows  77
No of columns  12


##  Data statistics and Data types

In [44]:
df.describe()

Unnamed: 0,Mat,Inns,Balls,Runs,Wkts,Ave,Econ,SR,4,5
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,194.402597,181.194805,8839.402597,6671.714286,233.805195,28.958052,4.596753,37.909091,6.350649,2.87013
std,82.485606,67.958393,3316.055457,2245.839029,84.406603,4.826768,0.515814,6.060901,3.556929,2.530606
min,80.0,76.0,4074.0,2821.0,151.0,18.68,3.3,26.1,1.0,0.0
25%,136.0,128.0,6182.0,5058.0,173.0,24.97,4.28,33.0,3.0,1.0
50%,170.0,164.0,8054.0,6192.0,199.0,29.29,4.66,37.8,6.0,2.0
75%,227.0,218.0,10750.0,8021.0,272.0,31.9,4.92,41.4,8.0,4.0
max,463.0,372.0,18811.0,13632.0,534.0,44.48,5.83,52.5,17.0,13.0


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  77 non-null     object 
 1   Span    77 non-null     object 
 2   Mat     77 non-null     int64  
 3   Inns    77 non-null     int64  
 4   Balls   77 non-null     int64  
 5   Runs    77 non-null     int64  
 6   Wkts    77 non-null     int64  
 7   Ave     77 non-null     float64
 8   Econ    77 non-null     float64
 9   SR      77 non-null     float64
 10  4       77 non-null     int64  
 11  5       77 non-null     int64  
dtypes: float64(3), int64(7), object(2)
memory usage: 7.3+ KB


#### Here we can see, there is no null elements, all counts are equal to total number of rows.

## Renaming the column names appropriately

In [46]:
# renaming the column names
df = df.rename(columns={ 
                        'Mat':'Matches_Played',
                        'Inns': 'Innings_Played',
                        'Balls': 'Balls_Bowled',
                        'Runs': 'Runs_Yielded',
                        'Wkts': 'Wickets_Taken',
                        'Ave': 'Bowling_Average',
                        'Econ': 'Economy_Rate',
                        'SR': 'Strike_Rate',
                        4: "Four_Wickets_in_an_innings",
                        5: "Five_Wickets_in_an_innings"
})

# splitting the 'Player' column to get the information about 'Country'
df[["Player_Name", "Country"]] = df['Player'].str.split("(", expand=True)

# dropping the 'Player' columns
df = df.drop('Player', axis=1)

# remove the ")" from the 'Country' column
df['Country'] = df['Country'].str.replace(")", "")

# splitting the 'Span' column based on the "-"
df[['Start_year', 'End_year']] = df['Span'].str.split("-", expand=True)

# removing the "Span" column
df = df.drop("Span", axis=1)

# rearrange the columns
new_col_sequence = ['Player_Name', 'Country', 'Start_year', 'End_year', 'Matches_Played', 'Innings_Played', 'Balls_Bowled', 
                    'Runs_Yielded', 'Wickets_Taken','Bowling_Average', 'Economy_Rate', 'Strike_Rate', 
                    'Four_Wickets_in_an_innings', 'Five_Wickets_in_an_innings']
df = df[new_col_sequence]

display(df.head(10))

Unnamed: 0,Player_Name,Country,Start_year,End_year,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings
0,M Muralitharan,Asia/ICC/SL,1993,2011,350,341,18811,12326,534,23.08,3.93,35.2,15,10
1,Wasim Akram,PAK,1984,2003,356,351,18186,11812,502,23.52,3.89,36.2,17,6
2,Waqar Younis,PAK,1989,2003,262,258,12698,9919,416,23.84,4.68,30.5,14,13
3,WPUJC Vaas,Asia/SL,1994,2008,322,320,15775,11014,400,27.53,4.18,39.4,9,4
4,Shahid Afridi,Asia/ICC/PAK,1996,2015,398,372,17670,13632,395,34.51,4.62,44.7,4,9
5,SM Pollock,Afr/ICC/SA,1996,2008,303,297,15712,9631,393,24.5,3.67,39.9,12,5
6,GD McGrath,AUS/ICC,1993,2007,250,248,12970,8391,381,22.02,3.88,34.0,9,7
7,B Lee,AUS,2000,2012,221,217,11185,8877,380,23.36,4.76,29.4,14,9
8,SL Malinga,SL,2004,2019,226,220,10936,9760,338,28.87,5.35,32.3,11,8
9,A Kumble,Asia/INDIA,1990,2007,271,265,14496,10412,337,30.89,4.3,43.0,8,2


### Find the player who played for the longest time

##### Here the "Span" column was a string type, so when it was divided into Start_year and End_year, they both were string types as well. So it was necessary to convert them to integer type to get the difference.

In [47]:
df['Start_year'] = df['Start_year'].astype('int') 
df['End_year'] = df['End_year'].astype('int')

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Player_Name                 77 non-null     object 
 1   Country                     77 non-null     object 
 2   Start_year                  77 non-null     int32  
 3   End_year                    77 non-null     int32  
 4   Matches_Played              77 non-null     int64  
 5   Innings_Played              77 non-null     int64  
 6   Balls_Bowled                77 non-null     int64  
 7   Runs_Yielded                77 non-null     int64  
 8   Wickets_Taken               77 non-null     int64  
 9   Bowling_Average             77 non-null     float64
 10  Economy_Rate                77 non-null     float64
 11  Strike_Rate                 77 non-null     float64
 12  Four_Wickets_in_an_innings  77 non-null     int64  
 13  Five_Wickets_in_an_innings  77 non-nu

##### As we have End_year and Start_year give the information about Years_Played, they became redundant columns, so we discarded them.

In [48]:
df['Years_played'] = df['End_year'] - df['Start_year']

df = df.drop(['Start_year', "End_year"], axis=1)

# rearrange the columns
new_col_sequence = ['Player_Name', 'Country', 'Years_played', 'Matches_Played', 'Innings_Played', 'Balls_Bowled', 
                    'Runs_Yielded', 'Wickets_Taken','Bowling_Average', 'Economy_Rate', 'Strike_Rate', 
                    'Four_Wickets_in_an_innings', 'Five_Wickets_in_an_innings']
df = df[new_col_sequence]

display(df.head(10))

Unnamed: 0,Player_Name,Country,Years_played,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings
0,M Muralitharan,Asia/ICC/SL,18,350,341,18811,12326,534,23.08,3.93,35.2,15,10
1,Wasim Akram,PAK,19,356,351,18186,11812,502,23.52,3.89,36.2,17,6
2,Waqar Younis,PAK,14,262,258,12698,9919,416,23.84,4.68,30.5,14,13
3,WPUJC Vaas,Asia/SL,14,322,320,15775,11014,400,27.53,4.18,39.4,9,4
4,Shahid Afridi,Asia/ICC/PAK,19,398,372,17670,13632,395,34.51,4.62,44.7,4,9
5,SM Pollock,Afr/ICC/SA,12,303,297,15712,9631,393,24.5,3.67,39.9,12,5
6,GD McGrath,AUS/ICC,14,250,248,12970,8391,381,22.02,3.88,34.0,9,7
7,B Lee,AUS,12,221,217,11185,8877,380,23.36,4.76,29.4,14,9
8,SL Malinga,SL,15,226,220,10936,9760,338,28.87,5.35,32.3,11,8
9,A Kumble,Asia/INDIA,17,271,265,14496,10412,337,30.89,4.3,43.0,8,2


##### zip was used for combined traversal of two columns and df[column].max() was used to get the maximum value of the column

In [49]:
print("The player who played for the longest period: ")
for player,active in zip(df["Player_Name"],df["Years_played"]):
    if active==df["Years_played"].max():
        print(player,active)


The player who played for the longest period: 
SR Tendulkar  23


#### Alternative Solution

In [50]:
df[df['Years_played']==df['Years_played'].max()]['Player_Name']

72    SR Tendulkar 
Name: Player_Name, dtype: object

Here 72 is the serial number in the dataset.

## The player who played for the shortest period

In [51]:
print("The player who played for the shortest period: ")
for player,active in zip(df["Player_Name"],df["Years_played"]):
    if active==df["Years_played"].min():
        print(player,active)


The player who played for the shortest period: 
BKV Prasad  7
Saeed Ajmal  7
BAW Mendis  7
Rashid Khan  7


#### Alternative Solution

In [52]:
df[df['Years_played']==df['Years_played'].min()]['Player_Name']

40     BKV Prasad 
49    Saeed Ajmal 
73     BAW Mendis 
74    Rashid Khan 
Name: Player_Name, dtype: object

Left column denotes serial number

### Count how many players played for ICC

In [53]:
def icc_check(x):
    if "ICC" in x:
        return "Yes"
    else:
        return "No"

In [54]:
df['played_for_ICC'] = df['Country'].apply(icc_check)

In [55]:
print(df['played_for_ICC'].value_counts())

No     64
Yes    13
Name: played_for_ICC, dtype: int64


##### Here, we can see there are 13 playes who played for ICC

### Count how many Austrailian bowlers are there

In [56]:
# function for checking austrailian bowlers
def austrailia_check(x):
    if "AUS" in x:
        return "Yes"
    else:
        return "No"


In [57]:
df['played_for_Austrailia'] = df['Country'].apply(austrailia_check) #function applied to specific column values.

display(df.head(10))

Unnamed: 0,Player_Name,Country,Years_played,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings,played_for_ICC,played_for_Austrailia
0,M Muralitharan,Asia/ICC/SL,18,350,341,18811,12326,534,23.08,3.93,35.2,15,10,Yes,No
1,Wasim Akram,PAK,19,356,351,18186,11812,502,23.52,3.89,36.2,17,6,No,No
2,Waqar Younis,PAK,14,262,258,12698,9919,416,23.84,4.68,30.5,14,13,No,No
3,WPUJC Vaas,Asia/SL,14,322,320,15775,11014,400,27.53,4.18,39.4,9,4,No,No
4,Shahid Afridi,Asia/ICC/PAK,19,398,372,17670,13632,395,34.51,4.62,44.7,4,9,Yes,No
5,SM Pollock,Afr/ICC/SA,12,303,297,15712,9631,393,24.5,3.67,39.9,12,5,Yes,No
6,GD McGrath,AUS/ICC,14,250,248,12970,8391,381,22.02,3.88,34.0,9,7,Yes,Yes
7,B Lee,AUS,12,221,217,11185,8877,380,23.36,4.76,29.4,14,9,No,Yes
8,SL Malinga,SL,15,226,220,10936,9760,338,28.87,5.35,32.3,11,8,No,No
9,A Kumble,Asia/INDIA,17,271,265,14496,10412,337,30.89,4.3,43.0,8,2,No,No


In [58]:
print(df['played_for_Austrailia'].value_counts())

No     67
Yes    10
Name: played_for_Austrailia, dtype: int64


#### Here, we can see there are 10 Austrailian Bowlers in the dataset.

### Check if there is any Bangladeshi Bowler

In [59]:
#function to find bd player
def Bangladesh_check(x):
    if "BAN" in x:
        return "Yes"
    else:
        return "No"

In [60]:
df['played_for_Bangladesh'] = df['Country'].apply(Bangladesh_check) # function applied to specific column

display(df.sample(10))

Unnamed: 0,Player_Name,Country,Years_played,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings,played_for_ICC,played_for_Austrailia,played_for_Bangladesh
44,L Klusener,SA,8,171,164,7336,5751,192,29.95,4.7,38.2,1,6,No,No,No
15,AB Agarkar,INDIA,9,191,188,9484,8021,288,27.85,5.07,32.9,10,2,No,No,No
25,N Kapil Dev,INDIA,16,225,221,11202,6945,253,27.45,3.71,44.2,3,1,No,No,No
65,RJ Hadlee,NZ,17,115,112,6182,3407,158,21.56,3.3,39.1,1,5,No,No,No
5,SM Pollock,Afr/ICC/SA,12,303,297,15712,9631,393,24.5,3.67,39.9,12,5,Yes,No,No
23,Harbhajan Singh,Asia/INDIA,17,236,227,12479,8973,269,33.35,4.31,46.3,2,3,No,No,No
47,RA Jadeja,INDIA,11,168,164,8557,7024,188,37.36,4.92,45.5,7,1,No,No,No
74,Rashid Khan,AFG,7,80,76,4074,2821,151,18.68,4.15,26.9,5,4,No,No,No
70,GB Hogg,AUS,12,123,113,5564,4188,156,26.84,4.51,35.6,3,2,No,Yes,No
18,JH Kallis,Afr/ICC/SA,18,328,283,10750,8680,273,31.79,4.84,39.3,2,2,Yes,No,No


In [61]:
print(df['played_for_Bangladesh'].value_counts())

No     74
Yes     3
Name: played_for_Bangladesh, dtype: int64


##### Here, we can observe there are three Bangladeshi Bowlers

In [62]:
found_bd_player=False

for state in df['played_for_Bangladesh']:
    if state == "Yes":
        print("Yee! We found a Bangladeshi Bowler")
        found_bd_player=True
        break
if(found_bd_player is False):
    print("No, there is no Bangladeshi Bowler")
    

Yee! We found a Bangladeshi Bowler


## Find the player with the lowest Economic Rate

In [63]:
print("The player who has the lowest economic rate: ")
for player,eco in zip(df["Player_Name"],df["Economy_Rate"]):
    if eco==df["Economy_Rate"].min():
        print(player,eco)


The player who has the lowest economic rate: 
RJ Hadlee  3.3


#### Alternative

In [64]:
df[df['Economy_Rate']==df['Economy_Rate'].min()]['Player_Name']

65    RJ Hadlee 
Name: Player_Name, dtype: object

## Find the player with the lowest Strike Rate

In [65]:
print("The player who has the lowest strike rate: ")
for player,strike in zip(df["Player_Name"],df["Strike_Rate"]):
    if strike==df["Strike_Rate"].min():
        print(player,strike)

The player who has the lowest strike rate: 
MA Starc  26.1


#### Alternative

In [66]:
df[df['Strike_Rate']==df['Strike_Rate'].min()]['Player_Name']

41    MA Starc 
Name: Player_Name, dtype: object

Left column denotes serial number.

### Find the player with the lowest bowling average

In [67]:
print("The player who has the lowest bowling average: ")
for player,avg in zip(df["Player_Name"],df["Bowling_Average"]):
    if avg==df["Bowling_Average"].min():
        print(player,avg)

The player who has the lowest bowling average: 
Rashid Khan  18.68


#### Alternative

In [68]:
df[df['Bowling_Average']==df['Bowling_Average'].min()]['Player_Name']

74    Rashid Khan 
Name: Player_Name, dtype: object

Left column denotes serial number.

In [69]:
df.sample(20)

Unnamed: 0,Player_Name,Country,Years_played,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings,played_for_ICC,played_for_Austrailia,played_for_Bangladesh
29,HH Streak,Afr/ZIM,12,189,185,9468,7129,239,29.82,4.51,39.6,7,1,No,No,No
53,SCJ Broad,ENG,10,121,121,6109,5364,178,30.13,5.26,34.3,9,1,No,No,No
44,L Klusener,SA,8,171,164,7336,5751,192,29.95,4.7,38.2,1,6,No,No,No
51,Aaqib Javed,PAK,10,163,159,8012,5721,182,31.43,4.28,44.0,2,4,No,No,No
46,M Morkel,Afr/SA,11,117,114,5760,4761,188,25.32,4.95,30.6,7,2,No,No,No
68,M Prabhakar,INDIA,12,130,127,6360,4534,157,28.87,4.27,40.5,4,2,No,No,No
36,CL Cairns,ICC/NZ,15,215,186,8168,6594,201,32.8,4.84,40.6,3,1,Yes,No,No
62,CH Gayle,ICC/WI,20,301,199,7424,5926,167,35.48,4.78,44.4,3,1,Yes,No,No
23,Harbhajan Singh,Asia/INDIA,17,236,227,12479,8973,269,33.35,4.31,46.3,2,3,No,No,No
70,GB Hogg,AUS,12,123,113,5564,4188,156,26.84,4.51,35.6,3,2,No,Yes,No


## Remove redundant information

In [70]:
def asia_check(x):
    if "Asia" in x:
        return "Yes"
    else:
        return "No"

In [71]:
def africa_check(x):
    if "Afr" in x:
        return "Yes"
    else:
        return "No"

In [72]:

df['played_for_Asia'] = df['Country'].apply(asia_check)
df['played_for_Africa'] = df['Country'].apply(africa_check)


In [73]:

print(df['played_for_Asia'].value_counts())
print(df['played_for_Africa'].value_counts())

No     65
Yes    12
Name: played_for_Asia, dtype: int64
No     72
Yes     5
Name: played_for_Africa, dtype: int64


In [74]:
df['Country'] = df['Country'].str.replace("ICC/", "")
df['Country'] = df['Country'].str.replace("/ICC", "")
df['Country'] = df['Country'].str.replace("Asia/", "")
df['Country'] = df['Country'].str.replace("Afr/", "")

display(df.sample(10))

Unnamed: 0,Player_Name,Country,Years_played,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings,played_for_ICC,played_for_Austrailia,played_for_Bangladesh,played_for_Asia,played_for_Africa
69,A Nehra,INDIA,10,120,120,5751,4981,157,31.72,5.19,36.6,5,2,No,No,No,Yes,No
58,IK Pathan,INDIA,8,120,118,5855,5142,173,29.72,5.26,33.8,5,2,No,No,No,No,No
13,SK Warne,AUS,12,194,191,10642,7541,293,25.73,4.25,36.3,12,1,Yes,Yes,No,No,No
17,Z Khan,INDIA,12,200,197,10097,8301,282,29.43,4.93,35.8,7,1,No,No,No,Yes,No
37,DJ Bravo,WI,10,164,150,6511,5874,199,29.51,5.41,32.7,6,1,No,No,No,No,No
52,Umar Gul,PAK,13,130,128,6064,5253,179,29.34,5.19,33.8,4,2,No,No,No,No,No
36,CL Cairns,NZ,15,215,186,8168,6594,201,32.8,4.84,40.6,3,1,Yes,No,No,No,No
16,Shakib Al Hasan,BAN,16,221,218,11351,8401,285,29.47,4.44,39.8,9,3,No,No,Yes,No,No
21,JM Anderson,ENG,13,194,191,9584,7861,269,29.22,4.92,35.6,11,2,No,No,No,No,No
40,BKV Prasad,INDIA,7,161,160,8129,6332,196,32.3,4.67,41.4,3,1,No,No,No,No,No


Replace is used to replace the specific substring with the new defined ones.

### Checking the unique values for country column

In [75]:
df['Country'].value_counts()

INDIA    13
PAK      12
AUS      10
SL        9
NZ        8
SA        8
WI        6
ENG       6
BAN       3
AFG       1
ZIM       1
Name: Country, dtype: int64

In [76]:
# count no. of unique 
# values in Country column
n = df['Country'].nunique()
  
print("No.of.unique values in Country column :", n)

No.of.unique values in Country column : 11


##### Here, we can see there are 11 different countries in the Country column.

In [77]:
display(df.sample(20))

Unnamed: 0,Player_Name,Country,Years_played,Matches_Played,Innings_Played,Balls_Bowled,Runs_Yielded,Wickets_Taken,Bowling_Average,Economy_Rate,Strike_Rate,Four_Wickets_in_an_innings,Five_Wickets_in_an_innings,played_for_ICC,played_for_Austrailia,played_for_Bangladesh,played_for_Asia,played_for_Africa
2,Waqar Younis,PAK,14,262,258,12698,9919,416,23.84,4.68,30.5,14,13,No,No,No,No,No
19,AA Donald,SA,12,164,162,8561,5926,272,21.78,4.15,31.4,11,2,No,No,No,No,No
39,DW Steyn,SA,14,125,124,6256,5087,196,25.95,4.87,31.9,4,3,No,No,No,No,Yes
40,BKV Prasad,INDIA,7,161,160,8129,6332,196,32.3,4.67,41.4,3,1,No,No,No,No,No
45,TG Southee,NZ,12,143,141,7195,6558,190,34.51,5.46,37.8,4,3,No,No,No,No,No
55,NW Bracken,AUS,8,116,116,5759,4240,174,24.36,4.41,33.0,5,2,No,Yes,No,No,No
33,Abdur Razzak,BAN,10,153,152,7965,6065,207,29.29,4.56,38.4,5,4,No,No,Yes,No,No
48,CRD Fernando,SL,11,147,141,6507,5648,187,30.2,5.2,34.7,3,1,No,No,No,Yes,No
70,GB Hogg,AUS,12,123,113,5564,4188,156,26.84,4.51,35.6,3,2,No,Yes,No,No,No
3,WPUJC Vaas,SL,14,322,320,15775,11014,400,27.53,4.18,39.4,9,4,No,No,No,Yes,No


Here , as we have seperate columns to know whether the player played for ICC or Asia or Africa, there is no need to have them in the country column as well. In this way, we can get rid of redundant information from our dataset.