# Data Cleaning in Pandas 2

In this dataset, I have analyzed some ODI Cricket data. The data was obtained from cricinfo.

#### Task
1. How many rows and columns are present in this dataset? 

2. Are there any missing values present in this dataset? If so, in which columns? 

3. What are the data types in this dataset? 

5. Rename the column names accordingly: 1. 'Mat':'Match', 2. 'Inns':'Innings', 3. 'NO': 'NotOut', 4. 'HS': 'Highest_score', 5.   'Ave': 'Average', 6. 'BF': 'Balls_Faced', 7. 'SR': 'Strike_Rate'. 

6. Remove the columns: 'BF', 0, 4s, and 6s.

7. Show the top 10 batsmen with the highest batting average. If players have the same average, reorder them according to the      highest number of centuries. Is there any Bangladeshi player present in the Top 10?

8. Which player(s) had played for the longest and the shortest period of time in this dataset?

9. Based on the country column, how many players played for "Asia XI"?

10. Save the cleaned file in a csv file named "batsmen".


#### My Approach

To answer the above questions, I started the proceeding by creating a function. All the wrangling were done inside the function. Later I answered all the questions.

In [54]:
# importing necessary libraries
import pandas as pd

In [55]:
# Import Data
def data_prep(filename):
    
    #Read data into a csv file
    df = pd.read_csv(filename)
    
    # Renaming the columns
    df = (df.rename(columns={"Mat": "Match", "NO": "NotOut","HS": "Highest_score","Ave": "Average",
                            "Ave": "Average","SR": "Strike_Rate"}))

    # splitting the span column to claculate years played
    df[['Start_career', 'End_career']] = df['Span'].str.split("-", expand=True).astype(int)
    df['Years_active'] = df['End_career'] - df['Start_career']
    
    # Creating columns for Name and Teams/Country represented 
    
    df[['Player', 'Country']] = df['Player'].str.split("(", expand=True)
    df['Country'] = df["Country"].str.replace(")","", regex=False)
    df[["Team 1", "Team 2", "Team 3"]] = df['Country'].str.split("/", expand=True).fillna("None")
    
    # dropping the columns
    df.drop(columns = ['0', '4s', '6s', 'Span', 'Country', 'BF'], inplace=True)
   
    
    
    return df

In [56]:
# seeing the file
df = data_prep('batsman.csv')
df.head()

Unnamed: 0,Player,Match,Inns,NotOut,Runs,Highest_score,Average,Strike_Rate,100,50,Start_career,End_career,Years_active,Team 1,Team 2,Team 3
0,SR Tendulkar,463,452,41,18426,200*,44.83,86.23,49,96,1989,2012,23,INDIA,,
1,KC Sangakkara,404,380,41,14234,169,41.98,78.86,25,93,2000,2015,15,Asia,ICC,SL
2,RT Ponting,375,365,39,13704,164,42.03,80.39,30,82,1995,2012,17,AUS,ICC,
3,ST Jayasuriya,445,433,18,13430,189,32.36,91.2,28,68,1989,2011,22,Asia,SL,
4,DPMD Jayawardene,448,418,39,12650,144,33.37,78.96,19,77,1998,2015,17,Asia,SL,


In [57]:
# Checking the basic information of the data.
print(df.info(verbose=True, show_counts=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Player         92 non-null     object 
 1   Match          92 non-null     int64  
 2   Inns           92 non-null     int64  
 3   NotOut         92 non-null     int64  
 4   Runs           92 non-null     int64  
 5   Highest_score  92 non-null     object 
 6   Average        92 non-null     float64
 7   Strike_Rate    92 non-null     float64
 8   100            92 non-null     int64  
 9   50             92 non-null     int64  
 10  Start_career   92 non-null     int32  
 11  End_career     92 non-null     int32  
 12  Years_active   92 non-null     int32  
 13  Team 1         92 non-null     object 
 14  Team 2         92 non-null     object 
 15  Team 3         92 non-null     object 
dtypes: float64(2), int32(3), int64(6), object(5)
memory usage: 10.5+ KB
None


With a few newly created rows, there are 16 columns and 92 rows. None of the rows contain any missing values. The types of data across the datasets are string, float and integer.

In [58]:
# The command adds up the null values in each column.
df.isnull().sum()

Player           0
Match            0
Inns             0
NotOut           0
Runs             0
Highest_score    0
Average          0
Strike_Rate      0
100              0
50               0
Start_career     0
End_career       0
Years_active     0
Team 1           0
Team 2           0
Team 3           0
dtype: int64

In [59]:
df[['Player', 'Years_active']].sort_values(by='Years_active', ascending=False).head(1)

Unnamed: 0,Player,Years_active
0,SR Tendulkar,23


Sachin Played for the longest Period. 23 Years.

In [60]:
df[['Player', 'Years_active']].sort_values(by='Years_active', ascending=False).tail(1)

Unnamed: 0,Player,Years_active
82,AJ Finch,7


AJ Finch has the shortest career, only 7 years long.

In [61]:
df[['Player', 'Team 1', 'Team 2','Team 3', 'Average']].sort_values(by='Average', ascending=False).head(10)

Unnamed: 0,Player,Team 1,Team 2,Team 3,Average
5,V Kohli,INDIA,,,58.07
45,MG Bevan,AUS,,,53.58
16,AB de Villiers,Afr,SA,,53.5
60,JE Root,ENG,,,51.33
10,MS Dhoni,Asia,INDIA,,50.57
28,HM Amla,SA,,,49.46
19,RG Sharma,INDIA,,,48.6
24,LRPL Taylor,NZ,,,48.2
77,MEK Hussey,AUS,,,48.15
58,KS Williamson,NZ,,,47.48


Here are the top 10 batsmen with highest Average. No, Bangladeshi Batsman in the top 10.

Let's do following to figure out how many players have played for Asia XI.

In [62]:
(df['Team 1'] == "Asia").value_counts()

Team 1
False    79
True     13
Name: count, dtype: int64

In [63]:
(df['Team 2'] == "Asia").value_counts()

Team 2
False    92
Name: count, dtype: int64

In [64]:
(df['Team 3'] == "Asia").value_counts()

Team 3
False    92
Name: count, dtype: int64

So, only 13 players have played for Asia XI.

In [65]:
# Saving the new file in csv mode as 'batsmen'
df.to_csv('batsmen.csv', index=False)
