<a href="https://colab.research.google.com/github/SKawsar/data_preprocessing_for_ML_with_Python/blob/main/Lecture_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 3: Data Preproccessing with Pandas (Part 2)

Instructor: Md Shahidullah Kawsar
<br>Data Scientist, IDARE, Houston, TX, USA

#### Objectives:
- How to extract new information from a column?
- How to create a column based on a condition or function?
- Removing a string from a column
- Checking the unique values for each column
- performing calculation in dataframe columns
- dataframe sorting

#### References:
[1] Data Source: https://stats.espncricinfo.com/ci/content/records/223646.html
<br>[2] pandas split: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
<br>[3] pandas concatenation: https://pandas.pydata.org/docs/reference/api/pandas.concat.html
<br>[4] pandas replace: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
<br>[5] pandas column rename: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
<br>[6] pandas sorting: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
<br>[7] pandas counting unique values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
<br>[8] pandas drop: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
<br>[9] pandas data type conversion: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
<br>[10] **Self Study:** difference between .loc and .iloc: https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

In [None]:
import numpy as np
import pandas as pd

# display 100 rows of the dataframe
pd.options.display.max_rows = 100

#### Reading an excel file

In [None]:
df = pd.read_excel("ODI_cricket.xlsx", sheet_name="batsman", engine="openpyxl")

display(df.head())

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,4s,6s
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76


#### Codes from Lecture 2

In [None]:
# renaming the column names
df = df.rename(columns={'Mat':'Match', 
                        'Inns':'Innings',
                        'NO': 'NotOut',
                        'HS': 'Highest_score',
                        'Ave': 'Average',
                        'BF': 'Balls_Faced',
                        'SR': 'Strike_Rate',
                        100: 'Centuries',
                        50: 'Half_centuries',
                        0: 'Ducks',
                        "4s": "Fours",
                        "6s": "Sixes"})

# splitting the 'Player' column to get the information about 'Country'
df[["Player_Name", "Country"]] = df['Player'].str.split("(", expand=True)

# dropping the 'Player' columns
df = df.drop('Player', axis=1)

# remove the ")" from the 'Country' column
df['Country'] = df['Country'].str.replace(")", "")

# rearrange the columns
new_col_sequence = ['Player_Name', 'Country', 'Span', 'Match', 'Innings', 'NotOut', 'Runs', 'Highest_score',
       'Average', 'Balls_Faced', 'Strike_Rate', 'Centuries', 'Half_centuries', 'Ducks', 'Fours', 'Sixes']
df = df[new_col_sequence]

display(df.head())

Unnamed: 0,Player_Name,Country,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes
0,SR Tendulkar,INDIA,1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195
1,KC Sangakkara,Asia/ICC/SL,2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88
2,RT Ponting,AUS/ICC,1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162
3,ST Jayasuriya,Asia/SL,1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270
4,DPMD Jayawardene,Asia/SL,1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76


#### How to create a column based on a condition or function?

In [None]:
df['Country'].value_counts()

AUS               13
PAK                9
WI                 8
SA                 8
INDIA              8
SL                 6
ZIM                5
NZ                 5
AUS/ICC            3
BAN                3
Asia/INDIA         3
ENG                3
Asia/PAK           3
Asia/SL            3
ICC/WI             2
Asia/ICC/INDIA     2
Afr/SA             2
ENG/IRE            1
ICC/NZ             1
Asia/ICC/PAK       1
Asia/ICC/SL        1
Afr/ICC/SA         1
IRE                1
Name: Country, dtype: int64

In [None]:
def icc_check(x):
    if "ICC" in x:
        return "Yes"
    else:
        return "No"

In [None]:
def asia_check(x):
    if "Asia" in x:
        return "Yes"
    else:
        return "No"

In [None]:
def africa_check(x):
    if "Afr" in x:
        return "Yes"
    else:
        return "No"

In [None]:
df['played_for_ICC'] = df['Country'].apply(icc_check)
df['played_for_Asia'] = df['Country'].apply(asia_check)
df['played_for_Africa'] = df['Country'].apply(africa_check)

display(df.head(10))

Unnamed: 0,Player_Name,Country,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa
0,SR Tendulkar,INDIA,1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No
1,KC Sangakkara,Asia/ICC/SL,2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No
2,RT Ponting,AUS/ICC,1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No
3,ST Jayasuriya,Asia/SL,1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No
4,DPMD Jayawardene,Asia/SL,1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No
5,V Kohli,INDIA,2008-2022,260,251,39,12311,183,58.07,13249,92.92,43,64,15,1153,125,No,No,No
6,Inzamam-ul-Haq,Asia/PAK,1991-2007,378,350,53,11739,137*,39.52,15812,74.24,10,83,20,971,144,No,Yes,No
7,JH Kallis,Afr/ICC/SA,1996-2014,328,314,53,11579,139,44.36,15885,72.89,17,86,17,911,137,Yes,No,Yes
8,SC Ganguly,Asia/INDIA,1992-2007,311,300,23,11363,183,41.02,15416,73.7,22,72,16,1122,190,No,Yes,No
9,R Dravid,Asia/ICC/INDIA,1996-2011,344,318,40,10889,153,39.16,15285,71.23,12,83,13,950,42,Yes,Yes,No


In [None]:
print(df['played_for_ICC'].value_counts())
print(df['played_for_Asia'].value_counts())
print(df['played_for_Africa'].value_counts())

No     81
Yes    11
Name: played_for_ICC, dtype: int64
No     79
Yes    13
Name: played_for_Asia, dtype: int64
No     89
Yes     3
Name: played_for_Africa, dtype: int64


#### Removing "ICC/" from the 'Country'

In [None]:
df['Country'] = df['Country'].str.replace("ICC/", "")
df['Country'] = df['Country'].str.replace("/ICC", "")
df['Country'] = df['Country'].str.replace("Asia/", "")
df['Country'] = df['Country'].str.replace("Afr/", "")

display(df.head())

Unnamed: 0,Player_Name,Country,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa
0,SR Tendulkar,INDIA,1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No
1,KC Sangakkara,SL,2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No
2,RT Ponting,AUS,1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No
3,ST Jayasuriya,SL,1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No
4,DPMD Jayawardene,SL,1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No


#### Checking the unique values for each column

In [None]:
df['Country'].value_counts()

AUS        16
PAK        13
INDIA      13
SA         11
SL         10
WI         10
NZ          6
ZIM         5
ENG         3
BAN         3
ENG/IRE     1
IRE         1
Name: Country, dtype: int64

#### Find number of years played

In [None]:
# df['start_year'] = df['Span'].str[0:4]

# df['end_year'] = df['Span'].str[5:]

# display(df.head(10))

In [None]:
# splitting the 'Span' column based on the "-"
df[['start_year', 'end_year']] = df['Span'].str.split("-", expand=True)

# removing the "Span" column
df = df.drop("Span", axis=1)

display(df.head())

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,start_year,end_year
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,1989,2012
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,2000,2015
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,1995,2012
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,1989,2011
4,DPMD Jayawardene,SL,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No,1998,2015


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player_Name        92 non-null     object 
 1   Country            92 non-null     object 
 2   Match              92 non-null     int64  
 3   Innings            92 non-null     int64  
 4   NotOut             92 non-null     int64  
 5   Runs               92 non-null     int64  
 6   Highest_score      92 non-null     object 
 7   Average            92 non-null     float64
 8   Balls_Faced        92 non-null     int64  
 9   Strike_Rate        92 non-null     float64
 10  Centuries          92 non-null     int64  
 11  Half_centuries     92 non-null     int64  
 12  Ducks              92 non-null     int64  
 13  Fours              92 non-null     object 
 14  Sixes              92 non-null     object 
 15  played_for_ICC     92 non-null     object 
 16  played_for_Asia    92 non-nu

**Data type conversion**

In [None]:
df['start_year'] = df['start_year'].astype('int') 
df['end_year'] = df['end_year'].astype('int')

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player_Name        92 non-null     object 
 1   Country            92 non-null     object 
 2   Match              92 non-null     int64  
 3   Innings            92 non-null     int64  
 4   NotOut             92 non-null     int64  
 5   Runs               92 non-null     int64  
 6   Highest_score      92 non-null     object 
 7   Average            92 non-null     float64
 8   Balls_Faced        92 non-null     int64  
 9   Strike_Rate        92 non-null     float64
 10  Centuries          92 non-null     int64  
 11  Half_centuries     92 non-null     int64  
 12  Ducks              92 non-null     int64  
 13  Fours              92 non-null     object 
 14  Sixes              92 non-null     object 
 15  played_for_ICC     92 non-null     object 
 16  played_for_Asia    92 non-nu

In [None]:
df['years_played'] = df['end_year'] - df['start_year']

df = df.drop(['start_year', "end_year"], axis=1)

display(df.head(10))

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,15
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,17
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
4,DPMD Jayawardene,SL,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No,17
5,V Kohli,INDIA,260,251,39,12311,183,58.07,13249,92.92,43,64,15,1153,125,No,No,No,14
6,Inzamam-ul-Haq,PAK,378,350,53,11739,137*,39.52,15812,74.24,10,83,20,971,144,No,Yes,No,16
7,JH Kallis,SA,328,314,53,11579,139,44.36,15885,72.89,17,86,17,911,137,Yes,No,Yes,18
8,SC Ganguly,INDIA,311,300,23,11363,183,41.02,15416,73.7,22,72,16,1122,190,No,Yes,No,15
9,R Dravid,INDIA,344,318,40,10889,153,39.16,15285,71.23,12,83,13,950,42,Yes,Yes,No,15


#### Checking the average 

In [None]:
# df['avg'] = df['Runs']/(df['Innings'] - df['NotOut'])
# df['avg'] = np.round(df['avg'], 2)

# display(df.head(10))

#### Top 10 batsmen: Highest Batting average

In [None]:
df.sort_values(by='Average', ascending = False).head(10)

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
5,V Kohli,INDIA,260,251,39,12311,183,58.07,13249,92.92,43,64,15,1153,125,No,No,No,14
45,MG Bevan,AUS,232,196,67,6912,108*,53.58,9320,74.16,6,46,5,450,21,No,No,No,10
16,AB de Villiers,SA,228,218,39,9577,176,53.5,9473,101.09,25,53,7,840,204,No,No,Yes,13
60,JE Root,ENG,152,142,23,6109,133*,51.33,7034,86.84,16,35,5,491,44,No,No,No,8
10,MS Dhoni,INDIA,350,297,84,10773,183*,50.57,12303,87.56,10,73,10,826,229,No,Yes,No,15
28,HM Amla,SA,181,178,14,8113,159,49.46,9178,88.39,27,39,4,822,53,No,No,No,11
19,RG Sharma,INDIA,230,223,32,9283,264,48.6,10428,89.01,29,44,13,845,245,No,No,No,15
24,LRPL Taylor,NZ,233,217,39,8581,181*,48.2,10287,83.41,21,51,9,712,146,No,No,No,15
77,MEK Hussey,AUS,185,157,44,5442,109*,48.15,6243,87.16,3,39,3,383,80,No,No,No,8
58,KS Williamson,NZ,151,144,14,6173,148,47.48,7551,81.75,13,39,5,563,49,No,No,No,10


#### Top 10 batsmen: Highest number of centuries

In [None]:
df.sort_values(by=['Centuries', "Half_centuries"], ascending = False).head(10)

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
5,V Kohli,INDIA,260,251,39,12311,183,58.07,13249,92.92,43,64,15,1153,125,No,No,No,14
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,17
19,RG Sharma,INDIA,230,223,32,9283,264,48.6,10428,89.01,29,44,13,845,245,No,No,No,15
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
28,HM Amla,SA,181,178,14,8113,159,49.46,9178,88.39,27,39,4,822,53,No,No,No,11
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,15
11,CH Gayle,WI,301,294,17,10480,215,37.83,12019,87.19,25,54,25,1128,331,Yes,No,No,20
16,AB de Villiers,SA,228,218,39,9577,176,53.5,9473,101.09,25,53,7,840,204,No,No,Yes,13
8,SC Ganguly,INDIA,311,300,23,11363,183,41.02,15416,73.7,22,72,16,1122,190,No,Yes,No,15


#### Top 10 batsmen: Highest number of half centuries

In [None]:
df.sort_values(by="Half_centuries", ascending = False).head(10)

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,15
7,JH Kallis,SA,328,314,53,11579,139,44.36,15885,72.89,17,86,17,911,137,Yes,No,Yes,18
6,Inzamam-ul-Haq,PAK,378,350,53,11739,137*,39.52,15812,74.24,10,83,20,971,144,No,Yes,No,16
9,R Dravid,INDIA,344,318,40,10889,153,39.16,15285,71.23,12,83,13,950,42,Yes,Yes,No,15
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,17
4,DPMD Jayawardene,SL,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No,17
10,MS Dhoni,INDIA,350,297,84,10773,183*,50.57,12303,87.56,10,73,10,826,229,No,Yes,No,15
8,SC Ganguly,INDIA,311,300,23,11363,183,41.02,15416,73.7,22,72,16,1122,190,No,Yes,No,15
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22


#### Top 10 batsmen: Highest number of years played

In [None]:
df.sort_values(by="years_played", ascending = False).head(10)

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
38,Javed Miandad,PAK,233,218,41,7381,119*,41.7,11014,67.01,8,50,8,445+,44+,No,No,No,21
36,Shoaib Malik,PAK,287,258,40,7534,143,34.55,9199,81.9,9,44,15,603,113,No,No,No,20
11,CH Gayle,WI,301,294,17,10480,215,37.83,12019,87.19,25,54,25,1128,331,Yes,No,No,20
30,Shahid Afridi,PAK,398,369,27,8064,124,23.57,6892,117.0,6,39,30,730,351,Yes,Yes,No,19
18,PA de Silva,SL,308,296,30,9284,145,34.9,11443,81.13,11,64,17,712+,102+,No,No,No,19
73,MN Samuels,WI,207,196,26,5606,133*,32.97,7463,75.11,10,30,11,526,118,No,No,No,18
7,JH Kallis,SA,328,314,53,11579,139,44.36,15885,72.89,17,86,17,911,137,Yes,No,Yes,18
53,GW Flower,ZIM,221,214,18,6571,142*,33.52,9723,67.58,6,40,18,557+,37+,No,No,No,18


#### Top 10 batsmen: Highest number of matches played

In [None]:
df.sort_values(by="Match", ascending = False).head(10)

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
4,DPMD Jayawardene,SL,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No,17
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,15
30,Shahid Afridi,PAK,398,369,27,8064,124,23.57,6892,117.0,6,39,30,730,351,Yes,Yes,No,19
6,Inzamam-ul-Haq,PAK,378,350,53,11739,137*,39.52,15812,74.24,10,83,20,971,144,No,Yes,No,16
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,17
10,MS Dhoni,INDIA,350,297,84,10773,183*,50.57,12303,87.56,10,73,10,826,229,No,Yes,No,15
9,R Dravid,INDIA,344,318,40,10889,153,39.16,15285,71.23,12,83,13,950,42,Yes,Yes,No,15
17,M Azharuddin,INDIA,334,308,54,9378,153*,36.92,12669,74.02,7,58,9,622+,77+,No,No,No,15


In [None]:
display(df.head(10))

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,15
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,17
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
4,DPMD Jayawardene,SL,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No,17
5,V Kohli,INDIA,260,251,39,12311,183,58.07,13249,92.92,43,64,15,1153,125,No,No,No,14
6,Inzamam-ul-Haq,PAK,378,350,53,11739,137*,39.52,15812,74.24,10,83,20,971,144,No,Yes,No,16
7,JH Kallis,SA,328,314,53,11579,139,44.36,15885,72.89,17,86,17,911,137,Yes,No,Yes,18
8,SC Ganguly,INDIA,311,300,23,11363,183,41.02,15416,73.7,22,72,16,1122,190,No,Yes,No,15
9,R Dravid,INDIA,344,318,40,10889,153,39.16,15285,71.23,12,83,13,950,42,Yes,Yes,No,15


#### Removing the * symbol from the Highest_Score column

In [None]:
def star_remover(x):
    x = str(x)
    if "*" in x:
        return x.replace("*", "")
    else:
        return x

df['Highest_score'] = df['Highest_score'].apply(star_remover)
df['Highest_score'] = df['Highest_score'].astype('int')

display(df.head(10))
print(df.info())

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
0,SR Tendulkar,INDIA,463,452,41,18426,200,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
1,KC Sangakkara,SL,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,Yes,Yes,No,15
2,RT Ponting,AUS,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,Yes,No,No,17
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
4,DPMD Jayawardene,SL,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,No,Yes,No,17
5,V Kohli,INDIA,260,251,39,12311,183,58.07,13249,92.92,43,64,15,1153,125,No,No,No,14
6,Inzamam-ul-Haq,PAK,378,350,53,11739,137,39.52,15812,74.24,10,83,20,971,144,No,Yes,No,16
7,JH Kallis,SA,328,314,53,11579,139,44.36,15885,72.89,17,86,17,911,137,Yes,No,Yes,18
8,SC Ganguly,INDIA,311,300,23,11363,183,41.02,15416,73.7,22,72,16,1122,190,No,Yes,No,15
9,R Dravid,INDIA,344,318,40,10889,153,39.16,15285,71.23,12,83,13,950,42,Yes,Yes,No,15


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player_Name        92 non-null     object 
 1   Country            92 non-null     object 
 2   Match              92 non-null     int64  
 3   Innings            92 non-null     int64  
 4   NotOut             92 non-null     int64  
 5   Runs               92 non-null     int64  
 6   Highest_score      92 non-null     int32  
 7   Average            92 non-null     float64
 8   Balls_Faced        92 non-null     int64  
 9   Strike_Rate        92 non-null     float64
 10  Centuries          92 non-null     int64  
 11  Half_centuries     92 non-null     int64  
 12  Ducks              92 non-null     int64  
 13  Fours              92 non-null     object 
 14  Sixes              92 non-null     object 
 15  played_for_ICC     92 non-null     object 
 16  played_for_Asia    92 non-nu

In [None]:
df.sort_values(by="Highest_score", ascending = False).head(10)

Unnamed: 0,Player_Name,Country,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,played_for_ICC,played_for_Asia,played_for_Africa,years_played
19,RG Sharma,INDIA,230,223,32,9283,264,48.6,10428,89.01,29,44,13,845,245,No,No,No,15
44,MJ Guptill,NZ,186,183,19,6927,237,42.23,7896,87.72,16,37,15,702,181,No,No,No,12
27,V Sehwag,INDIA,251,245,9,8273,219,35.05,7929,104.33,15,38,14,1132,136,Yes,Yes,No,14
11,CH Gayle,WI,301,294,17,10480,215,37.83,12019,87.19,25,54,25,1128,331,Yes,No,No,20
0,SR Tendulkar,INDIA,463,452,41,18426,200,44.83,21368,86.23,49,96,20,2016,195,No,No,No,23
20,Saeed Anwar,PAK,247,244,19,8824,194,39.21,10938,80.67,20,43,15,938,97,No,No,No,14
49,IVA Richards,WI,187,167,24,6721,189,47.0,7451,90.2,11,45,7,600+,126+,No,No,No,16
3,ST Jayasuriya,SL,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,No,Yes,No,22
46,G Kirsten,SA,185,185,19,6798,188,40.95,9436,72.04,13,45,11,659,20,No,No,No,10
75,F du Plessis,SA,143,136,20,5507,185,47.47,6215,88.6,12,35,3,495,66,No,No,No,8
