# Data Exploration

In this notebook, the data and the columns in the dataset is examined before deciding on the config settings for cleaning the data and creating the database.

In [4]:
import pandas as pd

In [15]:
df = pd.read_csv("csv_files/cricket_data.csv")

for (col, col_type) in zip(df.dtypes.index, df.dtypes):
    print("{col}: {col_type}".format(col=col, col_type=col_type))

ID: int64
NAME: object
COUNTRY: object
Full name: object
Born: object
Died: object
Current age: object
Major teams: object
Education: object
Height: object
Nickname: float64
Playing role: object
Batting style: object
Bowling style: object
Other: object
Relation: object
In a nutshell: object
DESCRIPTION: object
AWARDS: object
BATTING_Tests_Mat: float64
BATTING_Tests_Inns: object
BATTING_Tests_NO: object
BATTING_Tests_Runs: object
BATTING_Tests_HS: object
BATTING_Tests_Ave: object
BATTING_Tests_BF: object
BATTING_Tests_SR: object
BATTING_Tests_100: object
BATTING_Tests_50: object
BATTING_Tests_4s: object
BATTING_Tests_6s: object
BATTING_Tests_Ct: object
BATTING_Tests_St: object
BATTING_ODIs_Mat: float64
BATTING_ODIs_Inns: object
BATTING_ODIs_NO: object
BATTING_ODIs_Runs: object
BATTING_ODIs_HS: object
BATTING_ODIs_Ave: object
BATTING_ODIs_BF: object
BATTING_ODIs_SR: object
BATTING_ODIs_100: object
BATTING_ODIs_50: object
BATTING_ODIs_4s: object
BATTING_ODIs_6s: object
BATTING_ODIs_Ct: ob

In [182]:
# Names of required columns and their types
column_types = {"ID": "int64",
"NAME": "object",
"Born": "object",
"Died": "object",
"Major teams": "object",
"Playing role": "object",
"Batting style": "object",
"Bowling style": "object",
"BATTING_Tests_Runs": "float64",
"BATTING_Tests_BF": "float64",
"BATTING_Tests_SR": "float64",
"BATTING_Tests_100": "float64",
"BATTING_Tests_50": "float64",
"BATTING_Tests_4s": "float64",
"BATTING_Tests_6s": "float64",
"BATTING_ODIs_Runs": "float64",
"BATTING_ODIs_BF": "float64",
"BATTING_ODIs_SR": "float64",
"BATTING_ODIs_100": "float64",
"BATTING_ODIs_50": "float64",
"BATTING_ODIs_4s": "float64",
"BATTING_ODIs_6s": "float64",
"BATTING_T20Is_Runs": "float64",
"BATTING_T20Is_BF": "float64",
"BATTING_T20Is_SR": "float64",
"BATTING_T20Is_100": "float64",
"BATTING_T20Is_50": "float64",
"BATTING_T20Is_4s": "float64",
"BATTING_T20Is_6s": "float64",
"BOWLING_Tests_Balls": "float64",
"BOWLING_Tests_Runs": "float64",
"BOWLING_Tests_Wkts": "float64",
"BOWLING_Tests_Econ": "float64",
"BOWLING_ODIs_Balls": "float64",
"BOWLING_ODIs_Runs": "float64",
"BOWLING_ODIs_Wkts": "float64",
"BOWLING_ODIs_Econ": "float64",
"BOWLING_T20Is_Balls": "float64",
"BOWLING_T20Is_Runs": "float64",
"BOWLING_T20Is_Wkts": "float64",
"BOWLING_T20Is_Econ": "float64"}

In [11]:
df = pd.read_csv("csv_files/cricket_data.csv", 
                 usecols=list(column_types.keys()), 
                 dtype=column_types,
                 na_values=['-', 'NA'])

ValueError: could not convert string to float: '5098+'

From the above error, it is evident that the data types of the columns cannot be forced upon them, since there are a couple of string entries like '5098+' under one of the numeric columns. In that case, the plus sign can be removed after reading the csv files, and then the data types of the columns can be enforced.

In [183]:
df = pd.read_csv("csv_files/cricket_data.csv", 
                 usecols=list(column_types.keys()))
# Just for the purpose of examining the occurrences of NaN,
# since the count() function cannot count the number of NaNs
df.fillna("-", inplace=True) 

The values in some columns need to be checked before deciding on how to form the tables out of this dataset.

In [184]:
df.groupby("Playing role")["Playing role"].count()

Playing role
-                             86590
12th man                         17
Allrounder                      735
Batsman                         369
Batsman, Batsman                  1
Batting allrounder               35
Bowler                         1083
Bowling allrounder               34
Middle-order batsman            297
Opening batsman                 321
Top-order batsman               236
Wicketkeeper                    328
Wicketkeeper batsman            261
Wicketkeeper, Wicketkeeper        1
Name: Playing role, dtype: int64

In [185]:
df.shape[0] - df.groupby("Batting style")["Batting style"].get_group("-").count()

57252

In [186]:
df.shape[0] - df.groupby("Bowling style")["Bowling style"].get_group("-").count()

44085

In [187]:
df.shape

(90308, 41)

In [188]:
for (teams, row) in zip(df.groupby("Major teams")["Major teams"].count().index, df.groupby("Major teams")["Major teams"].count()):
    print("{team} = {count}".format(team=teams, count=row))

- = 6271
A Lyth's XI = 1
A to K = 2
AA Jasdenwala's XI = 1
AA Priestley's XI = 1
AB St Hill's XII = 7
ACB Chairman's XI = 1
ACC = 8
AFS = 23
AJ Webbe's XI = 1
AJK Jaguars = 7
AL Ghurair Construction = 20
AL Naboodah Tigers = 15
AM Wood's XI = 1
AMC Abbottabad = 16
AMCC Ajman = 25
AP Institute of Information Technology = 17
APS = 15
AU Finja = 9
AWR 1 = 15
AWR 2 = 15
AWR 3 = 12
Abahani Limited = 4
Abbottabad = 19
Abbottabad Falcons = 11
Abbottabad Region = 4
Abbottabad Rhinos = 18
Abdul Wali Khan University Nowshera = 11
Abela and Co = 27
Aboriginal & Torres Strait Islander Commission XI = 1
Aboriginal XI = 4
Aboriginal XI Women = 9
Abu Dhabi United = 9
Abu Dhabi XI = 9
Adam Global Dubai = 18
Adelaide Strikers Women = 3
Adnoc Al Ruwais = 10
Affies = 5
Afghan Cheetas = 1
Afghan Rangers Cricket Club = 15
Afghan Wireless Cricket Club = 12
Afghanistan = 27
Afghanistan Disable = 18
Afghanistan Under-17s = 2
Afghanistan Under-19s = 32
Afghanistan Under-25s = 1
Agrani Bank Cricket Club = 17
Ag

['Afghanistan,', 'Afghan Cheetas,', 'Afghanistan Under-19s,', 'Band-e-Amir Region,', 'Boost Region,', 'Montreal Tigers'] = 1
['Afghanistan,', 'Afghan Cheetas,', 'Boost Defenders,', 'Boost Region'] = 1
['Afghanistan,', 'Afghan Cheetas,', 'Boost Region'] = 2
['Afghanistan,', 'Afghan Cheetas,', 'Mis Ainak Region,', 'Mohammedan Sporting Club'] = 1
['Afghanistan,', 'Afghan Cheetas,', 'Peshawar,', 'Peshawar Panthers'] = 1
['Afghanistan,', 'Afghan Wireless Cricket Club'] = 1
['Afghanistan,', 'Afghanistan A,', 'Afghanistan Under-19s,', 'Amo Region,', 'Balkh Province,', 'Logar Province,', 'Mis Ainak Region'] = 1
['Afghanistan,', 'Afghanistan A,', 'Afghanistan Under-19s,', 'Amo Region,', 'Band-e-Amir Region,', 'Speen Ghar Region'] = 1
['Afghanistan,', 'Afghanistan A,', 'Afghanistan Under-19s,', 'Band-e-Amir Region,', 'Kabul Eagles,', 'Kabul Green,', 'Speen Ghar Region'] = 1
['Afghanistan,', 'Afghanistan A,', 'Afghanistan Under-19s,', 'Band-e-Amir Region,', 'Mis Ainak Knights'] = 1
['Afghanistan,

['Border,', 'Eastern Province,', 'Griqualand West'] = 4
['Border,', 'Eastern Province,', 'Lancashire,', 'New South Wales'] = 1
['Border,', 'Eastern Province,', 'Marylebone Cricket Club'] = 1
['Border,', 'Eastern Province,', 'Marylebone Cricket Club,', 'North West'] = 1
['Border,', 'Eastern Province,', 'Northern Transvaal,', 'Nottinghamshire'] = 1
['Border,', 'Eastern Province,', 'Northern Transvaal,', 'Transvaal'] = 1
['Border,', 'Eastern Province,', 'Orange Free State'] = 4
['Border,', 'Eastern Province,', 'South Africa Schools XI'] = 1
['Border,', 'Eastern Province,', 'South African Universities,', 'Transvaal,', 'Western Province'] = 1
['Border,', 'Eastern Province,', 'Transvaal'] = 1
['Border,', 'Eastern Province,', 'University Sports South Africa XI,', 'Warriors'] = 1
['Border,', 'Eastern Province,', 'Western Province'] = 2
['Border,', 'Easterns Under-15s'] = 1
['Border,', 'Easterns,', 'Transvaal'] = 1
['Border,', 'Free State,', 'Orange Free State'] = 1
['Border,', 'Gauteng Under-1

['Devon,', 'Dorset,', 'Oxford University'] = 1
['Devon,', 'Durham MCCU'] = 1
['Devon,', 'England Under-19s,', 'Gloucestershire,', 'Marylebone Cricket Club,', 'Marylebone Cricket Club Young Cricketers,', 'Somerset,', 'Somerset 2nd XI'] = 1
['Devon,', 'England Under-19s,', 'Somerset,', 'Somerset 2nd XI'] = 1
['Devon,', 'England Under-19s,', 'Somerset,', 'Somerset 2nd XI,', 'Surrey,', 'Surrey 2nd XI'] = 1
['Devon,', 'Essex 2nd XI,', 'Minor Counties,', 'Warwickshire,', 'Warwickshire 2nd XI'] = 1
['Devon,', 'Essex 2nd XI,', 'Surrey 2nd XI'] = 1
['Devon,', 'Exeter University,', 'Kent 2nd XI,', 'Sussex 2nd XI'] = 1
['Devon,', 'Faisalabad,', 'Water and Power Development Authority'] = 1
['Devon,', 'Glamorgan'] = 1
['Devon,', 'Glamorgan,', 'Minor Counties'] = 1
['Devon,', 'Glamorgan,', 'Wales Minor Counties'] = 1
['Devon,', 'Gloucestershire Cricket Board'] = 1
['Devon,', 'Gloucestershire'] = 2
['Devon,', 'Great Britain'] = 1
['Devon,', 'Hampshire,', 'Hampshire 2nd XI,', 'Somerset'] = 1
['Devon,'

['India,', 'Baroda,', 'Central India,', 'Maharashtra'] = 1
['India,', 'Baroda,', 'Chennai Super Kings,', 'Delhi Daredevils,', 'Kings XI Punjab,', 'Middlesex,', 'Rising Pune Supergiants,', 'Sunrisers Hyderabad'] = 1
['India,', 'Baroda,', 'Chennai Super Kings,', 'Hyderabad (India),', 'Hyderabad Heroes,', 'ICL India XI,', 'India A,', 'Mumbai Indians'] = 1
['India,', 'Baroda,', 'Delhi,', 'Durham,', 'Punjab,', 'Wiltshire'] = 1
['India,', 'Baroda,', 'Gujarat'] = 1
['India,', 'Baroda,', 'Gujarat,', 'Hindus,', 'Services'] = 1
['India,', 'Baroda,', 'Gujarat,', 'Sind,', 'Western India'] = 1
['India,', 'Baroda,', 'Hindus,', 'Maharashtra'] = 1
['India,', 'Baroda,', 'Hindus,', 'Maharashtra,', 'Mumbai'] = 2
['India,', 'Baroda,', 'India A,', 'India Green,', 'Kolkata Knight Riders,', 'Rajasthan Royals,', 'Sunrisers Hyderabad'] = 1
['India,', 'Baroda,', 'India A,', 'Mumbai Indians'] = 1
['India,', 'Baroda,', 'Maharashtra'] = 1
['India,', 'Baroda,', 'Tamil Nadu'] = 1
['India,', 'Bedfordshire,', 'Karnata

['Moratuwa Sports Club,', 'Rio Sports Club'] = 1
['Moratuwa Sports Club,', 'Saracens Sports Club,', 'Sebastianites Cricket and Athletic Club'] = 1
['Moratuwa Sports Club,', 'Sebastianites Cricket and Athletic Club'] = 10
['Moratuwa Sports Club,', 'Sebastianites Cricket and Athletic Club,', 'Seeduwa Raddoluwa Cricket Club,', 'Sri Lanka Air Force Sports Club,', 'Sri Lanka Under-17s'] = 1
['Moratuwa Sports Club,', 'Seeduwa Raddoluwa Cricket Club'] = 1
['Moratuwa Sports Club,', 'Singha Sports Club,', 'Sri Lanka Air Force Sports Club'] = 1
['Moratuwa Sports Club,', 'Sri Lanka Air Force Sports Club'] = 1
['Moratuwa Sports Club,', 'Sri Lanka Army Sports Club'] = 1
['Moratuwa Sports Club,', 'St. Sebastians College'] = 1
['Moratuwa Sports Club,', 'Tamil Union Cricket and Athletic Club'] = 2
['Mosman,', 'New South Wales Country'] = 1
['Mosman,', 'North Sydney'] = 1
['Mount Lawley,', 'Western Australia Under-19s'] = 2
['Mountaineers B,', 'Zimbabwe Under-19s'] = 1
['Mountaineers,', 'Mountaineers B

['Sri Lanka,', 'Basnahira South,', 'Combined Provinces,', 'Cricket Coaching School,', 'Sri Lanka A,', 'Tamil Union Cricket and Athletic Club'] = 1
['Sri Lanka,', 'Basnahira South,', 'Deccan Chargers,', 'Khulna Division,', 'Sinhalese Sports Club'] = 1
['Sri Lanka,', 'Basnahira South,', 'Galle Cricket Club,', 'Gloucestershire,', 'Kalutara Town Club,', 'Kent,', 'Nondescripts Cricket Club,', 'Ragama Cricket Club,', 'Tamil Union Cricket and Athletic Club'] = 1
['Sri Lanka,', 'Basnahira South,', 'Kandurata Maroons,', 'Prime Bank Cricket Club,', 'Ragama Cricket Club,', 'Sri Lanka A,', "Sri Lanka Board President's XI,", "Sri Lanka Board President's XI,", 'Sri Lanka Cricket Development XI,', 'Sri Lanka Under-19s'] = 1
['Sri Lanka,', 'Basnahira South,', 'Matara Sports Club,', 'Sri Lanka A,', 'Tamil Union Cricket and Athletic Club'] = 1
['Sri Lanka,', 'Basnahira,', 'Royal College,', 'Sinhalese Sports Club,', "Sri Lanka Board President's XI,", 'Sri Lanka Cricket Combined XI,', 'Sri Lanka Developme

In [189]:
# List of teams needed
teams = ["India",
        "Australia",
        "South Africa",
        "New Zealand",
        "Sri Lanka",
        "England",
        "Pakistan",
        "West Indies",
        "Bangladesh",
        "Zimbawbe",
        "Afghanistan",
        "Ireland",
        "Scotland",
        "Hong Kong",
        "Kenya",
        "Oman",
        "India Under-19s",
        "Australia Under-19s",
        "South Africa Under-19s",
        "New Zealand Under-19s",
        "Sri Lanka Under-19s",
        "England Under-19s",
        "Pakistan Under-19s",
        "West Indies Under-19s",
        "Bangladesh Under-19s",
        "Zimbawbe Under-19s",
        "Afghanistan Under-19s",
        "Ireland Under-19s",
        "Scotland Under-19s",
        "Hong Kong Under-19s",
        "Kenya Under-19s",
        "Oman Under-19s",
        "India Women",
        "Australia Women",
        "South Africa Women",
        "New Zealand Women",
        "Sri Lanka Women",
        "England Women",
        "Pakistan Women",
        "West Indies Women",
        "Bangladesh Women",
        "Zimbawbe Women",
        "Afghanistan Women",
        "Ireland Women",
        "Scotland Women",
        "Hong Kong Women",
        "Kenya Women",
        "Oman Women"]

In [190]:
df = df.loc[df["Major teams"].isin(teams)]

In [191]:
df.shape

(2304, 41)

In [193]:
df.groupby("Playing role")["Playing role"].count()

Playing role
-                       2150
Allrounder                31
Batsman                   18
Bowler                    52
Bowling allrounder         2
Middle-order batsman      11
Opening batsman           13
Top-order batsman          7
Wicketkeeper              11
Wicketkeeper batsman       9
Name: Playing role, dtype: int64

In [194]:
# People with only bowling records in ODI
df.loc[((df["BATTING_ODIs_Runs"] == "-") | (df["BATTING_ODIs_Runs"] == 0)) & ((df["BOWLING_ODIs_Balls"] != "-") | (df["BOWLING_ODIs_Balls"] != 0))].shape

(2007, 41)

In [195]:
# People with only batting records in ODI
df.loc[((df["BATTING_ODIs_Runs"] != "-") | (df["BATTING_ODIs_Runs"] != 0)) & ((df["BOWLING_ODIs_Balls"] == "-") | (df["BOWLING_ODIs_Balls"] == 0))].shape

(2104, 41)

In [196]:
df["Died"].head(20)

23                                                     -
30                                                     -
34                                                     -
41                                                     -
218                                                    -
332                                                    -
334                                                    -
335                                                    -
336                                                    -
337                                                    -
338                                                    -
347                                                    -
493                                                    -
523                                                    -
627                                                    -
676                                                    -
688                                                    -
722    June 2, 1853, Tortworth,

In [197]:
df["Born"].head(20)

23                 \nNovember 28, 1991, Pakistan 
30                            \nOctober 18, 1988 
34      \nApril 22, 1937, Falkirk, Stirlingshire 
41      \nJuly 22, 1979, Karachi, Sind, Pakistan 
218                           \nOctober 24, 1987 
332                               \ndate unknown 
334                           \nOctober 30, 1985 
335                          \nDecember 12, 1995 
336                              \nJuly 22, 2000 
337                               \ndate unknown 
338                          \nDecember 26, 1998 
347                 \nOctober 10, 1987, Pakistan 
493                 \nDecember 26, 1996, Chakwal 
523                 \nSeptember 30, 1962, Lahore 
627           \nDecember 20, 1989, Karachi, Sind 
676    \nSeptember 6, 1995, Jalgaon, Maharashtra 
688                               \ndate unknown 
722           \nMay 8, 1802, Westminster, London 
872     \nJuly 8, 2001, Birmingham, Warwickshire 
935                               \ndate unknown 


In [198]:
df["Born"].tail(20)

89791                                      \ndate unknown 
89812               \nOctober 10, 2000, Trinidad & Tobago 
89856                         \nDecember 2, 1998, Jamaica 
89886                         \nJanuary 30, 1966, Jamaica 
89952                            \nJune 15, 1962, Grenada 
89955           \nNovember 5, 1961, Sandy Point, St Kitts 
89982    \nSeptember 23, 1983, New Amsterdam, West Bank...
90056                                      \ndate unknown 
90067                                      \ndate unknown 
90086                           \nMarch 28, 2000, Antigua 
90098                                      \ndate unknown 
90100               \nOctober 22, 1962, Trinidad & Tobago 
90127                                      \ndate unknown 
90146                          \nMarch 16, 1989, Barbados 
90173                                      \ndate unknown 
90231                       \nSeptember 1, 1989, Barbados 
90233               \nJuly 3, 1997, Five Rivers, Trinida

In [152]:
from dateutil.parser import parse

#for val in df["Born"].drop_duplicates():
#    print(val)

df["Born"] = df["Born"].apply(lambda x: '{dd}-{mm}-{yyyy}'.format(dd=parse(x, fuzzy=True).day, mm=parse(x, fuzzy=True).month, yyyy=parse(x, fuzzy=True).year if not x in ['\ndate unknown', '-', 'date unknown'] else x))

ParserError: String does not contain a date: 
date unknown 

In [202]:
import numpy as np

df[["BATTING_Tests_Runs",
    "BATTING_Tests_BF",
    "BATTING_Tests_SR",
    "BATTING_Tests_100",
    "BATTING_Tests_50",
    "BATTING_Tests_4s",
    "BATTING_Tests_6s",
    "BATTING_ODIs_Runs",
    "BATTING_ODIs_BF",
    "BATTING_ODIs_SR",
    "BATTING_ODIs_100",
    "BATTING_ODIs_50",
    "BATTING_ODIs_4s",
    "BATTING_ODIs_6s",
    "BATTING_T20Is_Runs",
    "BATTING_T20Is_BF",
    "BATTING_T20Is_SR",
    "BATTING_T20Is_100",
    "BATTING_T20Is_50",
    "BATTING_T20Is_4s",
    "BATTING_T20Is_6s",
    "BOWLING_Tests_Balls",
    "BOWLING_Tests_Runs",
    "BOWLING_Tests_Wkts",
    "BOWLING_Tests_Econ",
    "BOWLING_ODIs_Balls",
    "BOWLING_ODIs_Runs",
    "BOWLING_ODIs_Wkts",
    "BOWLING_ODIs_Econ",
    "BOWLING_T20Is_Balls",
    "BOWLING_T20Is_Runs",
    "BOWLING_T20Is_Wkts",
    "BOWLING_T20Is_Econ"]] = df[["BATTING_Tests_Runs",
                                "BATTING_Tests_BF",
                                "BATTING_Tests_SR",
                                "BATTING_Tests_100",
                                "BATTING_Tests_50",
                                "BATTING_Tests_4s",
                                "BATTING_Tests_6s",
                                "BATTING_ODIs_Runs",
                                "BATTING_ODIs_BF",
                                "BATTING_ODIs_SR",
                                "BATTING_ODIs_100",
                                "BATTING_ODIs_50",
                                "BATTING_ODIs_4s",
                                "BATTING_ODIs_6s",
                                "BATTING_T20Is_Runs",
                                "BATTING_T20Is_BF",
                                "BATTING_T20Is_SR",
                                "BATTING_T20Is_100",
                                "BATTING_T20Is_50",
                                "BATTING_T20Is_4s",
                                "BATTING_T20Is_6s",
                                "BOWLING_Tests_Balls",
                                "BOWLING_Tests_Runs",
                                "BOWLING_Tests_Wkts",
                                "BOWLING_Tests_Econ",
                                "BOWLING_ODIs_Balls",
                                "BOWLING_ODIs_Runs",
                                "BOWLING_ODIs_Wkts",
                                "BOWLING_ODIs_Econ",
                                "BOWLING_T20Is_Balls",
                                "BOWLING_T20Is_Runs",
                                "BOWLING_T20Is_Wkts",
                                "BOWLING_T20Is_Econ"]].replace("-", np.nan)


df = df.astype(column_types)

#df[["BATTING_Tests_Runs"]].drop_duplicates().loc[type(df["BATTING_Tests_Runs"]) == "str"]

In [203]:
test_df = df[["NAME", "Major teams"]].groupby(["NAME"]).count()
test_df = test_df.loc[test_df["Major teams"] > 1]
test_df.index

people = list(test_df.index)
people

['Aimee Maria',
 'Alan Spiers',
 'David Brown',
 'David Simpson',
 'Greg Francois',
 'Harold Sheppard',
 'James Graham',
 'James Jones',
 'James Mitchell',
 'Kapila Perera',
 'Michael Williams',
 'Mohammad Asif',
 'Munir Khan',
 'Peter Thompson',
 'Riswan Farouq',
 'Robert Wilson',
 'Ross Mitchinson',
 'Russell Emmins',
 'Ryan Watson']

In [204]:
df[["NAME", "Born", "Major teams", "BOWLING_ODIs_Balls", "BOWLING_ODIs_Econ", "BATTING_ODIs_Runs", "BATTING_ODIs_BF"]].loc[df["NAME"].isin(people)]

Unnamed: 0,NAME,Born,Major teams,BOWLING_ODIs_Balls,BOWLING_ODIs_Econ,BATTING_ODIs_Runs,BATTING_ODIs_BF
9597,Mohammad Asif,"\nJanuary 1, 1970, Karachi, Pakistan",Oman,,,,
10700,Michael Williams,"\nSeptember 14, 1910, Ireland",Ireland,,,,
10777,David Brown,"\nJune 14, 1941, Insch, Aberdeenshire",Scotland,,,,
10778,David Brown,"\nDecember 24, 1849, Dunfermline, Fife",Scotland,,,,
10779,David Brown,"\nJuly 29, 1900, Dunfermline, Fife",Scotland,,,,
10943,James Graham,"\nNovember 2, 1874, Ayr",Scotland,,,,
11011,James Jones,"\nJanuary 9, 1911, Larbert, Stirlingshire",Scotland,,,,
11012,James Jones,"\nAugust 19, 1910, Larbert, Stirlingshire",Scotland,,,,
11079,Ross Mitchinson,"\nApril 14, 1978, Kirkaldy, Fifeshire",Scotland Under-19s,,,,
11080,Ross Mitchinson,"\nApril 14, 1978, Scotland",Scotland Under-19s,,,,


The dataset has some identical sets of player names, who may or may not correspond to the same person (as is evident from the values of the "Born" and "Major teams" fields). This would have required appending an extra indicator (year of birth, team name) at the end of these player's names in order to be able to distinguish them.

However, all these players (except for one) seem to have missing values in all batting and bowling stats columns, so the rows with identical names will anyway get eliminated once the dropna() function is applied to the dataframe to get rid of entries that do not have the required details.
Hence, no extra logic will be applied in trying to distinguish these names.

In [205]:
df["sport"] = "Cricket"

In [207]:
df.groupby(["NAME", "Major teams"])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x783111ba2ba8>