## Hi, I am Panda! Hope you are doing great!
![Panda](img/panda.jpg)

#### Fun apart, what we are going to look at not this panda but pandas which help us a lot in data crunching!
![Pandas](img/pandas.png)

## Let's see why we need Pandas?
* Fast and efficient DataFrame object for data manipulation
* Reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
* Intelligent data alignment and integrated handling of missing data
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Columns can be inserted and deleted
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionality

In [3]:
import pandas as pd
import numpy as np

## What are Series?
* It is an array
* It is one-dimensional
* Can store data of any type (Integer, Float, String, Dates etc.)
* Has indexes referring to the axis labels

In [209]:
arr = np.array([1,2,3,4,5])
type(arr)

numpy.ndarray

In [14]:
ser1 = pd.Series(arr)
ser1
# index = np.arange(len(arr))
# select * from sometable where someid = something

0    1
1    2
2    3
3    4
4    5
dtype: int32

In [22]:
ser2 = pd.Series(data=[11,12,13,14,15])
ser2

0    11
1    12
2    13
3    14
4    15
dtype: int64

In [23]:
df1 = pd.DataFrame({'col1':ser1,'col2':ser2})
df1

Unnamed: 0,col1,col2
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15


In [30]:
team_data = pd.read_csv('data/team.csv')
team_data

Unnamed: 0,team_name,team_captain
0,Chennai Super Kings,MS Dhoni
1,Royal Challengers Bangalore,Virat Kohli
2,Kolkata Knight Riders,Dinesh Karthik
3,Rajasthan Royals,Ajinkya Rahane
4,Sunrisers Hyderabad,Kane Williamson
5,Kings XI Punjab,Ravichandran Ashwin
6,Mumbai Indians,Rohit Sharma
7,Delhi Daredevils,Shreyas Iyer


In [34]:
player_data = pd.read_csv('data/iplfinal.csv',sep='|')
player_data
#player.head()
#player_data.tail()

Unnamed: 0,match_description,match_date,match_venue,match_location,match_result,team_name,innings_order,batsman_name,dismissal_mode,runs,balls,fours,sixes
958,"Chennai Super Kings vs Sunrisers Hyderabad, Fi...",2018-05-27,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 8 wkts,Sunrisers Hyderabad Innings,innings_1,Carlos Brathwaite,c Rayudu b SN Thakur,21,11,0,3
959,"Chennai Super Kings vs Sunrisers Hyderabad, Fi...",2018-05-27,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 8 wkts,Chennai Super Kings Innings,innings_2,Shane Watson,not out,117,57,1,8
960,"Chennai Super Kings vs Sunrisers Hyderabad, Fi...",2018-05-27,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 8 wkts,Chennai Super Kings Innings,innings_2,Faf du Plessis,c & b Sandeep Sharma,10,11,1,0
961,"Chennai Super Kings vs Sunrisers Hyderabad, Fi...",2018-05-27,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 8 wkts,Chennai Super Kings Innings,innings_2,Suresh Raina,c Goswami b C Brathwaite,32,24,3,1
962,"Chennai Super Kings vs Sunrisers Hyderabad, Fi...",2018-05-27,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 8 wkts,Chennai Super Kings Innings,innings_2,Ambati Rayudu,not out,16,19,1,1


In [35]:
player_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 963 entries, 0 to 962
Data columns (total 13 columns):
match_description    963 non-null object
match_date           963 non-null object
match_venue          963 non-null object
match_location       963 non-null object
match_result         963 non-null object
team_name            963 non-null object
innings_order        963 non-null object
batsman_name         963 non-null object
dismissal_mode       963 non-null object
runs                 963 non-null int64
balls                963 non-null int64
fours                963 non-null int64
sixes                963 non-null int64
dtypes: int64(4), object(9)
memory usage: 97.9+ KB


In [36]:
player_data.describe()

Unnamed: 0,runs,balls,fours,sixes
count,963.0,963.0,963.0,963.0
mean,20.516096,14.865005,1.665628,0.937695
std,21.900609,13.28922,1.990187,1.559046
min,0.0,0.0,0.0,0.0
25%,4.0,5.0,0.0,0.0
50%,13.0,11.0,1.0,0.0
75%,31.5,22.0,3.0,1.0
max,128.0,70.0,9.0,11.0


In [113]:
# let's see what all information we can fetch from a series
ser3 = player_data['runs']
# ser3.size
# ser3.axes
# ser3.ndim
# ser3.values
ser3.empty

False

In [40]:
player_data.match_venue

0                                  Wankhede Stadium,
1                                  Wankhede Stadium,
2                                  Wankhede Stadium,
3                                  Wankhede Stadium,
4                                  Wankhede Stadium,
5                                  Wankhede Stadium,
6                                  Wankhede Stadium,
7                                  Wankhede Stadium,
8                                  Wankhede Stadium,
9                                  Wankhede Stadium,
10                                 Wankhede Stadium,
11                                 Wankhede Stadium,
12                                 Wankhede Stadium,
13                                 Wankhede Stadium,
14                                 Wankhede Stadium,
15                                 Wankhede Stadium,
16                                 Wankhede Stadium,
17     Punjab Cricket Association IS Bindra Stadium,
18     Punjab Cricket Association IS Bindra St

In [120]:
# aggregations on a data frame
# player_data.count()
# player_data.sum()
# player_data.mean()
# player_data.median()
# player_data.std()
player_data.max()

match_description    Sunrisers Hyderabad vs Royal Challengers Banga...
match_date                                                  2018-05-27
match_venue                                          Wankhede Stadium,
match_location                                                    Pune
match_result                         Sunrisers Hyderabad won by 9 wkts
team_name                                  Sunrisers Hyderabad Innings
innings_order                                                innings_2
batsman_name                                          Yuzvendra Chahal
dismissal_mode                          st de Kock b Washington Sundar
runs                                                               128
balls                                                               70
fours                                                                9
sixes                                                               11
dtype: object

In [58]:
x = player_data[['match_description','runs']]
type(x)

pandas.core.frame.DataFrame

In [64]:
y = player_data[['runs','sixes']]
type(y)

pandas.core.frame.DataFrame

In [95]:
# fetching data
# player_data.loc[2:4]
# player_data.iloc[2:4]
# player_data.loc[2:4,'match_description':'match_location']
# player_data.loc[player_data['runs']>100,'batsman_name']
# player_data.iat[2,3]

274          Chris Gayle
285         Shane Watson
680    Rishabh Pant (wk)
959         Shane Watson
Name: batsman_name, dtype: object

In [103]:
# setting data
# player_data.loc[0]
player_data.loc[player_data['runs']>100,'runs'] = 200

274    104
285    106
680    128
959    117
Name: runs, dtype: int64

In [104]:
#creating calculated columns
player_data['str_rate'] = player_data['runs']/player_data['balls']

In [107]:
# remove columns 
# del player_data['str_rate']

Unnamed: 0,match_description,match_date,match_venue,match_location,match_result,team_name,innings_order,batsman_name,dismissal_mode,runs,balls,fours,sixes
0,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Rohit Sharma (c),c Rayudu b Watson,15,18,1,1
1,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Evin Lewis,lbw b D Chahar,0,2,0,0
2,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Ishan Kishan (wk),c Mark Wood b Tahir,40,29,4,1
3,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Suryakumar Yadav,c Harbhajan b Watson,43,29,6,1
4,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Hardik Pandya,not out,22,20,2,0


In [208]:
#working with grouping functions
grp1 = player_data.groupby(['match_location'])
# grp1.groups
# grp1.get_group('Mumbai')
# grp1.runs.agg(np.size)
# grp1.runs.agg(np.median)
# grp1.sixes.agg(np.max)
# grp1.runs.agg(np.std)
# grp1.runs.agg([np.size,np.sum,np.mean,np.median])
# grp1.transform(lambda x:(x-x.mean())/x.std())
grp1.filter(lambda x: x.runs.sum()>=100)

Unnamed: 0,match_description,match_date,match_venue,match_location,match_result,team_name,innings_order,batsman_name,dismissal_mode,runs,balls,fours,sixes,run_category,boundaries
0,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Rohit Sharma (c),c Rayudu b Watson,15,18,1,1,others,2
1,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Evin Lewis,lbw b D Chahar,0,2,0,0,others,0
2,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Ishan Kishan (wk),c Mark Wood b Tahir,40,29,4,1,others,5
3,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Suryakumar Yadav,c Harbhajan b Watson,43,29,6,1,others,7
4,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Hardik Pandya,not out,22,20,2,0,others,2
5,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Mumbai Indians Innings,innings_1,Krunal Pandya,not out,41,22,5,2,others,7
6,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Chennai Super Kings Innings,innings_2,Shane Watson,c Lewis b Hardik Pandya,16,14,1,1,others,2
7,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Chennai Super Kings Innings,innings_2,Ambati Rayudu,lbw b Markande,22,19,4,0,others,4
8,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Chennai Super Kings Innings,innings_2,Suresh Raina,c Krunal Pandya b Hardik Pandya,4,6,0,0,others,0
9,"Mumbai Indians vs Chennai Super Kings, 1st Mat...",2018-04-07,"Wankhede Stadium,",Mumbai,Chennai Super Kings won by 1 wkt,Chennai Super Kings Innings,innings_2,Kedar Jadhav,not out,24,22,1,2,others,3


In [None]:
player_data.groupby('match_location',)

In [142]:
def run_category(r):
    if r>=100:
        return 'century'
    elif r>=50 and r<100:
        return 'half-century'
    else:
        return 'others'
    
player_data['run_category'] = player_data.runs.map(run_category)
player_data.loc[:,'runs':]

Unnamed: 0,runs,balls,fours,sixes,run_category
0,15,18,1,1,others
1,0,2,0,0,others
2,40,29,4,1,others
3,43,29,6,1,others
4,22,20,2,0,others
5,41,22,5,2,others
6,16,14,1,1,others
7,22,19,4,0,others
8,4,6,0,0,others
9,24,22,1,2,others


In [147]:
player_data['boundaries'] = player_data.apply(lambda x: x['sixes']+x['fours'],axis=1)
player_data.loc[:,'runs':]

Unnamed: 0,runs,balls,fours,sixes,run_category,boundaries
0,15,18,1,1,others,2
1,0,2,0,0,others,0
2,40,29,4,1,others,5
3,43,29,6,1,others,7
4,22,20,2,0,others,2
5,41,22,5,2,others,7
6,16,14,1,1,others,2
7,22,19,4,0,others,4
8,4,6,0,0,others,0
9,24,22,1,2,others,3
