## **Questions based on Titanic Dataset:**

To read the dataset as csv, use the below code:

```python
import pandas as pd

url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv"
titanic_df = pd.read_csv(url)
```

### `Q-1:` Using `groupby` make groups using the `"Pclass"` column and find out the average age and total number of missing values in the `"Age"` column for every group.

In [1]:
# code here
import pandas as pd

url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv"
titanic_df = pd.read_csv(url)
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
titanic_df.groupby('Pclass')['Age'].mean()

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [5]:
titanic_df.groupby('Pclass')['Age'].hasna()

AttributeError: 'SeriesGroupBy' object has no attribute 'hasna'

In [13]:
# Correct solution
pclass_group = titanic_df.groupby('Pclass')

for group in list(pclass_group.groups.keys()):
    print(
        f"Pclass group: {group} : Avg Age {pclass_group.get_group(group)['Age'].mean()} and total missing values are : {pclass_group.get_group(group)['Age'].isna().sum()}")

Pclass group: 1 : Avg Age 38.233440860215055 and total missing values are : 30
Pclass group: 2 : Avg Age 29.87763005780347 and total missing values are : 11
Pclass group: 3 : Avg Age 25.14061971830986 and total missing values are : 136


### `Q-2:` Using `groupby` make groups using the `"Pclass"` column and fill every group's `"Embarked"` column's missing values with the mode value of that group. After that, print every group's `"Embarked"` column's value counts in ascending order.

In [21]:
# code here

for group, data in pclass_group:
    data['Embarked'].fillna(data['Embarked'].mode())
    print(data['Embarked'].value_counts(ascending=True))

Embarked
Q      2
C     85
S    127
Name: count, dtype: int64
Embarked
Q      3
C     17
S    164
Name: count, dtype: int64
Embarked
C     66
Q     72
S    353
Name: count, dtype: int64


In [23]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### `Q-3:` Make groups based on `"Embarked"` column. And for each of this embarked group, make another group based on `"Pclass"` and find out the average fare (round off up to 2 decimal places) for each "Pclass" for each group of "Embarked".

**Sample Output:**

```bash
{'C': {1: 105, 2: 25, 3: 11},
 'Q': {1: 90, 2: 12, 3: 11},
 'S': {1: 70, 2: 20, 3: 15}}
```

In [28]:
# code here
p_dict = {}
embarked_groups = titanic_df.groupby('Embarked')

for embarked_group in embarked_groups.groups.keys():
    pclass_groups = embarked_groups.get_group(embarked_group).groupby("Pclass")
    p_dict[embarked_group] = {}

    for pclass_group in pclass_groups.groups.keys():
        p_dict[embarked_group][pclass_group] = round(pclass_groups.get_group(pclass_group)['Fare'].mean(), 2)

p_dict



{'C': {1: 104.72, 2: 25.36, 3: 11.21},
 'Q': {1: 90.0, 2: 12.35, 3: 11.18},
 'S': {1: 70.36, 2: 20.33, 3: 14.64}}

## **Questions Based on Fifa Worldcup - 2022 Dataset:**

You can read the dataset by using the below sample code

```python
import pandas as pd

fifa_df = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vT3D_x_4DS6d51LKJ7ze1sxT5WpV5uiSVOFYHLwBiGru6vFyVv5h5-83AwFjxWYiWfCDjDAaarHAV-k/pub?gid=0&single=true&output=csv")
```

### `Q-4:` Perform `groupby` based on the `"Team"` column and then perform Z Normalization on top of the below columns of each group:
1. Passes
2. Passes Completed
3. Attempted Line Breaks
4. Completed Line Breaks

You have to make a python function named `z_normalization` which takes two arguments:

1. *group:* Every group that you have created
2. *cols_to_perform:* This parameter takes a list of columns on which you have to perform the Z-Normalization.

For this problem, you have to use th `apply()` method.

$\Large Z \ - \ Normalization = \frac{X_i - \mu}{std} $

After that find out the below values for each group:
- minimum "Passess"
- maximum "Passess"
- minimum "Yellow Cards"
- maximum "Yellow cards"
- average "Yellow Cards"
- maximum "Attempted Line Breaks"
- minimum "Attempted Line Breaks"
- standardard deviation of "Attempted Line Breaks"
- average Possession

In [29]:
# code here
import pandas as pd

fifa_df = pd.read_csv(
    "https://docs.google.com/spreadsheets/d/e/2PACX-1vT3D_x_4DS6d51LKJ7ze1sxT5WpV5uiSVOFYHLwBiGru6vFyVv5h5-83AwFjxWYiWfCDjDAaarHAV-k/pub?gid=0&single=true&output=csv")
fifa_df.head()

Unnamed: 0,Sl. No,Match No.,Team,Against,Group,Goal,Possession (%),Inside Penalty Area,Outside Penalty Area,Assists,...,Fouls Against,Offsides,Passes,Passes Completed,Crosses,Crosses Completed,Corners,Free Kicks,Penalties Scored,Pts
0,1,1,Qatar,Ecuador,A,0,40,0,0,0,...,15,3,453,387,10,5,1,19,0,0
1,2,1,Ecuador,Qatar,A,2,46,2,0,1,...,15,4,484,419,26,10,3,17,1,3
2,3,2,England,Iran,B,6,69,6,0,6,...,9,2,810,733,29,9,8,16,0,3
3,4,2,Iran,England,B,2,20,2,0,1,...,14,2,232,156,11,3,0,10,1,0
4,5,3,Senegal,Netherlands,A,0,39,0,0,0,...,13,2,391,326,22,8,6,14,0,0


In [98]:
# My attempt

def z_normalization(group, cols_to_perform):
    # for col in cols_to_perform:
    pass
    # print(group['Passes'] - group['Passes'].mean() / group['Passes'].std()) 


team_groups = fifa_df.groupby('Team')

for group, data in team_groups:
    z_normalization(data, ['Passes', 'Passes Completed', 'Attempted Line Breaks', 'Completed Line Breaks'])

In [99]:
# Complete Solution

def z_normalization(group, cols_to_perform):
    for col in cols_to_perform:
        std = group[col].std()
        mean = group[col].mean()
        group[f"{col}_z_norm"] = ((group[col] - mean) / std)
    return group


cols_to_perform = ['Passes', 'Passes Completed', 'Attempted Line Breaks', 'Completed Line Breaks']
groups = fifa_df.groupby('Team')

groups = groups.apply(z_normalization, cols_to_perform=cols_to_perform).groupby('Team')

  groups = groups.apply(z_normalization, cols_to_perform=cols_to_perform).groupby('Team')


ValueError: 'Team' is both an index level and a column label, which is ambiguous.

In [100]:
print(groups.agg(
    {
        "Passes": ['min', 'max'],
        "Yellow Cards": ['min', "max", 'mean'],
        "Attempted Line Breaks": ['max', 'min', 'std'],
        "Possession (%)": 'mean'
    }
))

               Passes       Yellow Cards               Attempted Line Breaks  \
                  min   max          min max      mean                   max   
Team                                                                           
Argentina         408   862            0   8  2.285714                   249   
Australia         286   466            0   3  1.750000                   171   
Belgium           512   685            1   3  1.666667                   195   
Brazil            548   695            0   3  1.200000                   193   
Cameroon          295   500            1   5  2.666667                   182   
Canada            448   536            2   4  2.666667                   176   
Costa Rica        231   454            1   3  2.000000                   154   
Croatia           461   724            0   2  1.142857                   259   
Denmark           537   650            1   2  1.666667                   241   
Ecuador           429   484            0

## **Questions on IPL wala dataset**

ball by ball dataset - https://drive.google.com/file/d/1-kvv_9KCSAFWcrhS9WgTxSrURkRh6GNt/view?usp=share_link





### `Q-5:` Find batsman in below category-
* Highest score while chasing
* Best Strike rate while chasing and have faced 100+ balls


> Chasing mean team batting in second inning

In [64]:
# code here
balls = pd.read_csv('ipl_deliveries.csv')
balls.head()

Unnamed: 0,ID,Team,innings,overs,ballnumber,batter,bowler,non-striker,extra_type,batsman_run,extras_run,total_run,non_boundary,isWicketDelivery,player_out,kind,fielders_involved,BattingTeam,BowlingTeam
0,1312200,Rajasthan RoyalsGujarat Titans,1,0,1,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans
1,1312200,Rajasthan RoyalsGujarat Titans,1,0,2,YBK Jaiswal,Mohammed Shami,JC Buttler,legbyes,0,1,1,0,0,,,,Rajasthan Royals,Gujarat Titans
2,1312200,Rajasthan RoyalsGujarat Titans,1,0,3,JC Buttler,Mohammed Shami,YBK Jaiswal,,1,0,1,0,0,,,,Rajasthan Royals,Gujarat Titans
3,1312200,Rajasthan RoyalsGujarat Titans,1,0,4,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans
4,1312200,Rajasthan RoyalsGujarat Titans,1,0,5,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans


In [65]:
df = balls[balls['innings'] == 2]
df.groupby(['ID', 'batter']).sum().sort_values(by=['batsman_run'], ascending=False)['batsman_run'].head()

ID       batter       
501206   PC Valthaty      120
501243   V Sehwag         119
1254061  SV Samson        119
1136620  SR Watson        117
336018   ST Jayasuriya    114
Name: batsman_run, dtype: int64

In [66]:
temp_df = df[~(df.extra_type == 'wides')]
temp_df = temp_df.groupby('batter').agg(
    {
        'batsman_run': 'sum',
        'ballnumber': 'count'
    }
)
temp_df['strike_rate'] = (temp_df['batsman_run'] / temp_df['ballnumber']) * 100
temp_df[temp_df['ballnumber'] >= 100].sort_values('strike_rate', ascending=False).reset_index().head()

Unnamed: 0,batter,batsman_run,ballnumber,strike_rate
0,PJ Cummins,222,114,194.736842
1,AD Russell,986,570,172.982456
2,LS Livingstone,182,107,170.093458
3,SP Narine,599,356,168.258427
4,SO Hetmyer,330,200,165.0


### `Q-6` Most Successful bowler against any batsman. Find that pair of bowler and batsman.
> Most Successful in terms of dissmissal. A bowler who have dissmissed any batsman most no of times. If any two pairs have same no of dissmisal, consider runs conceded by bowler to that batsman. Those who have concede lesser runs is more successful.

In [68]:
balls.head()

Unnamed: 0,ID,Team,innings,overs,ballnumber,batter,bowler,non-striker,extra_type,batsman_run,extras_run,total_run,non_boundary,isWicketDelivery,player_out,kind,fielders_involved,BattingTeam,BowlingTeam
0,1312200,Rajasthan RoyalsGujarat Titans,1,0,1,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans
1,1312200,Rajasthan RoyalsGujarat Titans,1,0,2,YBK Jaiswal,Mohammed Shami,JC Buttler,legbyes,0,1,1,0,0,,,,Rajasthan Royals,Gujarat Titans
2,1312200,Rajasthan RoyalsGujarat Titans,1,0,3,JC Buttler,Mohammed Shami,YBK Jaiswal,,1,0,1,0,0,,,,Rajasthan Royals,Gujarat Titans
3,1312200,Rajasthan RoyalsGujarat Titans,1,0,4,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans
4,1312200,Rajasthan RoyalsGujarat Titans,1,0,5,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans


In [75]:
# code here
balls['isBatterOut'] = (balls.batter == balls.player_out) & (
    ~balls.kind.isin(['run out', 'retired hurt', 'retired out']))
balls.groupby(['batter', 'bowler']).agg(
    {
        'isBatterOut': 'sum',
        'batsman_run': 'sum'
    }
).sort_values(by=['isBatterOut', 'batsman_run'], ascending=[False, True]).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,isBatterOut,batsman_run
batter,bowler,Unnamed: 2_level_1,Unnamed: 3_level_1
MS Dhoni,Z Khan,7,74
V Kohli,Sandeep Sharma,7,78
RG Sharma,A Mishra,7,87
RV Uthappa,R Ashwin,7,123
RG Sharma,SP Narine,7,137
RG Sharma,R Vinay Kumar,6,22
Q de Kock,YS Chahal,6,44
RR Pant,JJ Bumrah,6,48
GJ Maxwell,RA Jadeja,6,49
AT Rayudu,MM Sharma,6,52


### `Q-7`: Most successful batting pair in IPL. Batting pair who have scored most runs playing together.


In [76]:
balls.head()

Unnamed: 0,ID,Team,innings,overs,ballnumber,batter,bowler,non-striker,extra_type,batsman_run,extras_run,total_run,non_boundary,isWicketDelivery,player_out,kind,fielders_involved,BattingTeam,BowlingTeam,isBatterOut
0,1312200,Rajasthan RoyalsGujarat Titans,1,0,1,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans,False
1,1312200,Rajasthan RoyalsGujarat Titans,1,0,2,YBK Jaiswal,Mohammed Shami,JC Buttler,legbyes,0,1,1,0,0,,,,Rajasthan Royals,Gujarat Titans,False
2,1312200,Rajasthan RoyalsGujarat Titans,1,0,3,JC Buttler,Mohammed Shami,YBK Jaiswal,,1,0,1,0,0,,,,Rajasthan Royals,Gujarat Titans,False
3,1312200,Rajasthan RoyalsGujarat Titans,1,0,4,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans,False
4,1312200,Rajasthan RoyalsGujarat Titans,1,0,5,YBK Jaiswal,Mohammed Shami,JC Buttler,,0,0,0,0,0,,,,Rajasthan Royals,Gujarat Titans,False


In [81]:
import numpy as np

def func(x):
    return '-'.join(list(np.sort(x.values)))

balls['batter_pair'] = balls[['batter', 'non-striker']].apply(func, axis=1)

In [84]:
balls.groupby('batter_pair')['total_run'].sum().sort_values(ascending=False).head()

batter_pair
AB de Villiers-V Kohli    3134
CH Gayle-V Kohli          2802
DA Warner-S Dhawan        2357
G Gambhir-RV Uthappa      1906
KL Rahul-MA Agarwal       1731
Name: total_run, dtype: int64

### `Q-8:` Make a dataframe for all batting pairs played together.
```
Batsman1 Batsman2 Runs Avg StrikeRate
```

> Just to ease this question you can count wide-balls for strike rate.

In [89]:
temp_df = balls.groupby('batter_pair').agg(
    {
        'total_run': 'sum',
        'ballnumber': 'count',
        'isWicketDelivery': 'sum'
    }
).reset_index()
temp_df.head()

Unnamed: 0,batter_pair,total_run,ballnumber,isWicketDelivery
0,A Ashish Reddy-A Mishra,40,31,1
1,A Ashish Reddy-AA Jhunjhunwala,4,5,1
2,A Ashish Reddy-CL White,13,9,1
3,A Ashish Reddy-DB Ravi Teja,10,7,0
4,A Ashish Reddy-DJG Sammy,45,34,2


In [94]:
temp_df['Batsman 1'] = temp_df['batter_pair'].apply(lambda x: x.split("-")[0])
temp_df['Batsman 2'] = temp_df['batter_pair'].apply(lambda x: x.split('-')[1])
temp_df.rename(columns={'total_run':'Runs'}, inplace=True)
temp_df['StrikeRate'] = (temp_df['Runs'] / temp_df['ballnumber']) * 100
temp_df['Avg'] = temp_df['Runs'] / temp_df['isWicketDelivery']
temp_df.sort_values('Runs', ascending=False, inplace=True)
temp_df[['Batsman 1', 'Batsman 2','Runs', 'Avg', 'StrikeRate']]

Unnamed: 0,Batsman 1,Batsman 2,Runs,Avg,StrikeRate
302,AB de Villiers,V Kohli,3134,44.140845,152.209811
1251,CH Gayle,V Kohli,2802,52.867925,142.017233
1508,DA Warner,S Dhawan,2357,48.102041,136.637681
1954,G Gambhir,RV Uthappa,1906,39.708333,133.754386
2803,KL Rahul,MA Agarwal,1731,52.454545,142.939719
...,...,...,...,...,...
2203,Harshit Rana,UT Yadav,0,,0.000000
2978,LH Ferguson,SN Thakur,0,0.000000,0.000000
2975,LH Ferguson,R Tewatia,0,0.000000,0.000000
2973,LH Ferguson,M Prasidh Krishna,0,,0.000000
