<a href="https://colab.research.google.com/github/SKawsar/data_preprocessing_for_ML_with_Python/blob/main/Lecture_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 02: Data Preprocessing with Pandas (Part 1)

Instructor: **Md Shahidullah Kawsar**
<br>Data Scientist, IDARE, Houston, TX, USA

#### Objectives:
- Challenges of reading a CSV or Excel file
- Choose columns by name before reading a csv file
- Choose columns by number before reading a csv file
- Reading only the first n number of rows

#### References:
[1] Data Source: https://stats.espncricinfo.com/ci/content/records/83548.html
https://stats.espncricinfo.com/ci/content/records/283193.html
<br>[2] Reading CSV file in pandas:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
<br>[3] Reading Excel file in pandas: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

#### Import required libraries

In [None]:
import pandas as pd

#### choose columns by name to read a csv file

In [None]:
col_names = ['Player', 'Mat', 'Runs', '100']
df_usecols = pd.read_csv("batsman_most_runs_ODI.csv", usecols=col_names)

display(df_usecols.head(10))

Unnamed: 0,Player,Mat,Runs,100
0,SR Tendulkar (INDIA),463,18426,49
1,KC Sangakkara (Asia/ICC/SL),404,14234,25
2,RT Ponting (AUS/ICC),375,13704,30
3,ST Jayasuriya (Asia/SL),445,13430,28
4,DPMD Jayawardene (Asia/SL),448,12650,19
5,V Kohli (INDIA),260,12311,43
6,Inzamam-ul-Haq (Asia/PAK),378,11739,10
7,JH Kallis (Afr/ICC/SA),328,11579,17
8,SC Ganguly (Asia/INDIA),311,11363,22
9,R Dravid (Asia/ICC/INDIA),344,10889,12


#### Choose columns by number to read a csv file

In [None]:
col_nums = [0, 2, 5, 10]
df_usecols_index = pd.read_csv("batsman_most_runs_ODI.csv", usecols=col_nums)

display(df_usecols_index.head(10))
print(df_usecols_index.shape)

Unnamed: 0,Player,Mat,Runs,100
0,SR Tendulkar (INDIA),463,18426,49
1,KC Sangakkara (Asia/ICC/SL),404,14234,25
2,RT Ponting (AUS/ICC),375,13704,30
3,ST Jayasuriya (Asia/SL),445,13430,28
4,DPMD Jayawardene (Asia/SL),448,12650,19
5,V Kohli (INDIA),260,12311,43
6,Inzamam-ul-Haq (Asia/PAK),378,11739,10
7,JH Kallis (Afr/ICC/SA),328,11579,17
8,SC Ganguly (Asia/INDIA),311,11363,22
9,R Dravid (Asia/ICC/INDIA),344,10889,12


(92, 4)


#### Reading only the first n number of rows

In [None]:
df = pd.read_csv("batsman_most_runs_ODI.csv", nrows=50)

display(df.head())
print(df.shape)

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,4s,6s
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76


(50, 15)


In [None]:
# showing randomly 2 different rows
df.sample(2)

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,4s,6s
44,MJ Guptill (NZ),2009-2021,186,183,19,6927,237*,42.23,7896,87.72,16,37,15,702,181
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162


#### Reading Excel file

In [None]:
df = pd.read_excel("ODI_cricket.xlsx", sheet_name="batsman", engine="openpyxl")

display(df.head())

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,4s,6s
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76


In [None]:
# print(df.shape)
# col_names = ['Player', 'Mat', 'Runs', 'SR']
# # selecting columns after data importing
# df = df[col_names]

# print(df.shape)

#### How to rename the column names?

In [None]:
print(df.columns)

Index(['Player',   'Span',    'Mat',   'Inns',     'NO',   'Runs',     'HS',
          'Ave',     'BF',     'SR',      100,       50,        0,     '4s',
           '6s'],
      dtype='object')


In [None]:
df = df.rename(columns={'Mat':'Match', 
                        'Inns':'Innings',
                        'NO': 'NotOut',
                        'HS': 'Highest_score',
                        'Ave': 'Average',
                        'BF': 'Balls_Faced',
                        'SR': 'Strike_Rate',
                        100: 'Centuries',
                        50: 'Half_centuries',
                        0: 'Ducks',
                        "4s": "Fours",
                        "6s": "Sixes"})

display(df.head())

Unnamed: 0,Player,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76


#### How to split a column and create two new columns?

In [None]:
df_player = df['Player'].str.split("(", expand=True)

display(df_player.head(10))

Unnamed: 0,0,1
0,SR Tendulkar,INDIA)
1,KC Sangakkara,Asia/ICC/SL)
2,RT Ponting,AUS/ICC)
3,ST Jayasuriya,Asia/SL)
4,DPMD Jayawardene,Asia/SL)
5,V Kohli,INDIA)
6,Inzamam-ul-Haq,Asia/PAK)
7,JH Kallis,Afr/ICC/SA)
8,SC Ganguly,Asia/INDIA)
9,R Dravid,Asia/ICC/INDIA)


In [None]:
df[["Player_Name", "Country"]] = df['Player'].str.split("(", expand=True)

display(df.head())

Unnamed: 0,Player,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,Player_Name,Country
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,SR Tendulkar,INDIA)
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,KC Sangakkara,Asia/ICC/SL)
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,RT Ponting,AUS/ICC)
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,ST Jayasuriya,Asia/SL)
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,DPMD Jayawardene,Asia/SL)


#### How to remove a column?

In [None]:
# line 1
# df = df.drop('Player', axis=1)

# line 2
df.drop('Player', axis=1, inplace=True)

# line 1 and line 2 both are same

display(df.head())

Unnamed: 0,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,Player_Name,Country
0,1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,SR Tendulkar,INDIA)
1,2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,KC Sangakkara,Asia/ICC/SL)
2,1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,RT Ponting,AUS/ICC)
3,1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,ST Jayasuriya,Asia/SL)
4,1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,DPMD Jayawardene,Asia/SL)


#### How to replace/remove a value from a pandas column?

In [None]:
df['Country'] = df['Country'].str.replace(")", "")

display(df.head())

Unnamed: 0,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes,Player_Name,Country
0,1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195,SR Tendulkar,INDIA
1,2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88,KC Sangakkara,Asia/ICC/SL
2,1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162,RT Ponting,AUS/ICC
3,1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270,ST Jayasuriya,Asia/SL
4,1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76,DPMD Jayawardene,Asia/SL


In [None]:
print(df.columns)

new_col_sequence = ['Player_Name', 'Country', 'Span', 'Match', 'Innings', 'NotOut', 'Runs', 'Highest_score',
       'Average', 'Balls_Faced', 'Strike_Rate', 'Centuries', 'Half_centuries', 'Ducks', 'Fours', 'Sixes']

Index(['Span', 'Match', 'Innings', 'NotOut', 'Runs', 'Highest_score',
       'Average', 'Balls_Faced', 'Strike_Rate', 'Centuries', 'Half_centuries',
       'Ducks', 'Fours', 'Sixes', 'Player_Name', 'Country'],
      dtype='object')


In [None]:
df = df[new_col_sequence]

display(df.head())

Unnamed: 0,Player_Name,Country,Span,Match,Innings,NotOut,Runs,Highest_score,Average,Balls_Faced,Strike_Rate,Centuries,Half_centuries,Ducks,Fours,Sixes
0,SR Tendulkar,INDIA,1989-2012,463,452,41,18426,200*,44.83,21368,86.23,49,96,20,2016,195
1,KC Sangakkara,Asia/ICC/SL,2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,1385,88
2,RT Ponting,AUS/ICC,1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,1231,162
3,ST Jayasuriya,Asia/SL,1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,1500,270
4,DPMD Jayawardene,Asia/SL,1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,1119,76
