## CASE STUDY: TITANIC DATA ANALYSIS

- DataFrames
    - loading & basic analysis
    - sorting & subsetting
    - creating new columns
- Aggregating Data
    - summary statistics
    - counting
    - grouped summary statistics

## 1. Data loading & Basic Analysis

1. Import Libraries

In [1]:
import numpy as np
import pandas as pd

2. Read Titanic Comma Seperated file as `titanic_df`

In [2]:
titanic_df = pd.read_csv('./dataset/titanic.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


3. Basic Analysis with `.info()`, `.describe()` methods.

In [3]:
titanic_df.info() # number of col, name, non_null, type, rows (for data analysis)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
titanic_df.describe() # by default numeric columns

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
titanic_df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


4. Basic Analysis with `.shape`, `.columns`, `.values`, `.index`

In [6]:
titanic_df.shape

(891, 12)

In [7]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [8]:
titanic_df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

### 2. Sorting & SubSetting

5. Sorting Data Set by Single columns `.sort_values()`
- by=""
- ascending=""
- na_position=""

In [9]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
titanic_df.sort_values(by='Pclass')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
445,446,1,1,"Dodge, Master. Washington",male,4.0,0,2,33638,81.8583,A34,S
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C
309,310,1,1,"Francatelli, Miss. Laura Mabel",female,30.0,0,0,PC 17485,56.9292,E36,C
307,308,1,1,"Penasco y Castellana, Mrs. Victor de Satode (M...",female,17.0,1,0,PC 17758,108.9000,C65,C
306,307,1,1,"Fleming, Miss. Margaret",female,,0,0,17421,110.8833,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
379,380,0,3,"Gustafsson, Mr. Karl Gideon",male,19.0,0,0,347069,7.7750,,S
381,382,1,3,"Nakid, Miss. Maria (""Mary"")",female,1.0,0,2,2653,15.7417,,C
382,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.9250,,S
371,372,0,3,"Wiklund, Mr. Jakob Alfred",male,18.0,1,0,3101267,6.4958,,S


In [11]:
titanic_df.sort_values(by='Age', ascending=False, na_position="first")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0000,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5000,,S


6. Sorting DataSet by Multiple columns `.sort_values()`

In [12]:
titanic_df.sort_values(by=['Pclass', 'Age'], ascending=[True,False])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
745,746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0000,B22,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


7. Subsetting Single column & Multiple columns

In [13]:
titanic_df['Age']# series

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [14]:
titanic_df[['Age','Pclass']] # data frame

Unnamed: 0,Age,Pclass
0,22.0,3
1,38.0,1
2,26.0,3
3,35.0,1
4,35.0,3
...,...,...
886,27.0,2
887,19.0,1
888,,3
889,26.0,1


8. Subsetting based on Specific Requirements
- Passengers of age greater than 30
- Passengers whose gender is `'Male'` only
- Male Passengers who Survived in the Incident
- Passengers who PClass isin([1,2])
- Give Discounts of 10 percent

In [15]:
titanic_df.Age.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [16]:
titanic_df.Age > 30

0      False
1       True
2      False
3       True
4       True
       ...  
886    False
887    False
888    False
889    False
890     True
Name: Age, Length: 891, dtype: bool

In [17]:
condition = titanic_df.Age > 30
titanic_df[condition]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


In [18]:
condition1 = titanic_df.Sex == 'male'
titanic_df[condition1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [19]:
titanic_df.Sex == 'male'

0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Sex, Length: 891, dtype: bool

In [20]:
titanic_df.Survived == 1

0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Survived, Length: 891, dtype: bool

In [21]:
condition_1 = (titanic_df.Sex == 'male') and (titanic_df.Survived == 1)
titanic_df[condition_1]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
survived_cond =titanic_df.Survived == 1
sex_cond =titanic_df.Sex == 'male'
titanic_df[survived_cond & sex_cond]

In [22]:
titanic_df.isin([1,2])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,True,False,False,False,False,False,True,False,False,False,False,False
1,True,True,True,False,False,False,True,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,False,False
3,False,True,True,False,False,False,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,True,False,False,False,False,False,False,False,False,False
887,False,True,True,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,False,True,True,False,False,False,False
889,False,True,True,False,False,False,False,False,False,False,False,False


In [23]:
titanic_df['discount'] = titanic_df['Fare']*0.10
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,discount
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0.72500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,7.12833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0.79250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,5.31000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0.80500
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1.30000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,3.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,2.34500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,3.00000


## 3. Aggregating Data

#### Summarizing numerical data
- .mean()
- .median()
- .min()
- .maxx()
- .var()
- .std()
- .sum()
- .quantile()

In [24]:
titanic_df["Age"].mean()

29.69911764705882

In [25]:
np.mean(titanic_df["Age"])

29.69911764705882

In [26]:
titanic_df["Age"].median()

28.0

In [27]:
titanic_df["Age"].min()

0.42

In [28]:
titanic_df["Age"].max()

80.0

In [29]:
titanic_df["Age"].quantile(0.15)

17.0

In [30]:
titanic_df["Age"].var()

211.0191247463081

In [31]:
titanic_df["Age"].std()

14.526497332334044

#### .agg() method

- One or more operation on single Or multiple columns
- Function creation = parametrs eg: `def pct30(column):return column.quantile(0.3)`
- Function Calling = Arguments eg: `titanic['Age'].agg(pct30)`

In [32]:
def pct30(column):
    return column.quantile(0.3)

titanic_df['Age'].agg(pct30)

22.0

In [33]:
titanic_df[['Age', 'Pclass']].agg([np.mean, np.median, pct30])

Unnamed: 0,Age,Pclass
mean,29.699118,2.308642
median,28.0,3.0
pct30,22.0,2.0


#### Characteristics of Lamdba function

- One line funtion
- Without name function
- Not used even before
- Not used even after

In [34]:
titanic_df.Age.agg(lambda x:x.quantile(0.3))

22.0

## 4. Cumulative statistics
- .cumsum()
- .cummax()
- .cummin()
- .cumprod()

lst = [1,2,3,4,5,6]
cumsum = [1,3,6,10,15,21] # previous number addition

In [35]:
titanic_df.Age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [36]:
titanic_df.Age.cumsum()

0         22.00
1         60.00
2         86.00
3        121.00
4        156.00
         ...   
886    21128.17
887    21147.17
888         NaN
889    21173.17
890    21205.17
Name: Age, Length: 891, dtype: float64

In [37]:
titanic_df.Age.cummin()

0      22.00
1      22.00
2      22.00
3      22.00
4      22.00
       ...  
886     0.42
887     0.42
888      NaN
889     0.42
890     0.42
Name: Age, Length: 891, dtype: float64

In [38]:
titanic_df.Age.cumprod()

  return bound(*args, **kwds)


0            22.0
1           836.0
2         21736.0
3        760760.0
4      26626600.0
          ...    
886           inf
887           inf
888           NaN
889           inf
890           inf
Name: Age, Length: 891, dtype: float64

In [39]:
titanic_df.Age.cummax()

0      22.0
1      38.0
2      38.0
3      38.0
4      38.0
       ... 
886    80.0
887    80.0
888     NaN
889    80.0
890    80.0
Name: Age, Length: 891, dtype: float64

## 5. Counting

- So far, in this chapter, you've learned how to ``summarize numeric variables``. In below notebook, you'll learn how to ``summarize categorical data`` using counting.

- Categorical variables represent types of **data which may be divided into groups**. Examples of categorical variables are race, sex, age group, and educational

In [40]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,discount
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0.72500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,7.12833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0.79250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,5.31000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0.80500
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1.30000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,3.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,2.34500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,3.00000


In [41]:
titanic_df.drop(columns=['Embarked','Parch'], inplace = True) # inplace removed columns permanently from dataframe

In [42]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Ticket,Fare,Cabin,discount
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,A/5 21171,7.2500,,0.72500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,PC 17599,71.2833,C85,7.12833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,STON/O2. 3101282,7.9250,,0.79250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,113803,53.1000,C123,5.31000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,373450,8.0500,,0.80500
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,211536,13.0000,,1.30000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,112053,30.0000,B42,3.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,W./C. 6607,23.4500,,2.34500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,111369,30.0000,C148,3.00000


#### Drop Duplicates by Single & Multiple Columns
- drop_duplicates(subset= "")

#### Getting Count Stats using `.value_counts()`
- sort=False
- normalize=True

In [43]:
titanic_df.Age.value_counts(sort=False).sort_values(ascending=True)

Age
74.0     1
14.5     1
70.5     1
12.0     1
36.5     1
        ..
30.0    25
19.0    25
18.0    26
22.0    27
24.0    30
Name: count, Length: 88, dtype: int64

In [44]:
titanic_df.Age.value_counts(normalize=True) # normalize for percentile

Age
24.00    0.042017
22.00    0.037815
18.00    0.036415
19.00    0.035014
28.00    0.035014
           ...   
36.50    0.001401
55.50    0.001401
0.92     0.001401
23.50    0.001401
74.00    0.001401
Name: proportion, Length: 88, dtype: float64

## 6. Group summary statistics

- Average age of Males & Females Using subsetting
- Average age of Males & Females Using `.groupby()`
- Apply Different statistics methods like mean, counts, max & group.

In [45]:
cond = titanic_df['Sex'] == "male"
titanic_df[cond]['Age'].mean()

30.72664459161148

In [46]:
titanic_df[titanic_df['Sex'] == "female"]['Age'].mean()

27.915708812260537

In [47]:
# first it find categories i.e,Sex(male,female)
#filter it on required column i.e, Age
titanic_df.groupby('Sex')['Age'].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [48]:
pd.DataFrame(titanic_df.groupby(['Sex','Survived'])['Age'].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Sex,Survived,Unnamed: 2_level_1
female,0,25.046875
female,1,28.847716
male,0,31.618056
male,1,27.276022


In [49]:
pd.DataFrame(titanic_df.groupby(['Sex','Survived'])['Age'].count())

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Sex,Survived,Unnamed: 2_level_1
female,0,64
female,1,197
male,0,360
male,1,93


In [50]:
pd.DataFrame(titanic_df.groupby(['Sex','Survived'])['Age'].agg(['mean', 'count'])) # 0 Dead 1 Alive

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Sex,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0,25.046875,64
female,1,28.847716,197
male,0,31.618056,360
male,1,27.276022,93


In [52]:
pd.DataFrame(titanic_df.groupby(['Survived', 'Sex'])[['Age', 'SibSp']].agg(['count','min','max']))

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Age,SibSp,SibSp,SibSp
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,max,count,min,max
Survived,Sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,female,64,2.0,57.0,81,0,8
0,male,360,1.0,74.0,468,0,8
1,female,197,0.75,63.0,233,0,4
1,male,93,0.42,80.0,109,0,4


## Pivot tables

In [55]:
titanic_df.groupby('Sex')['Age'].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [53]:
titanic_df.pivot_table(index='Sex', values = 'Age')

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


- The ``"values"`` argument is the column that you want to ``summarize/Operation``, and the ``"index"`` column is the column that you want to ``group by``. 
- By default, pivot_table takes the **mean** value for each group.

In [58]:
#pivot and implicitly define agffunc=np.mean
titanic_df.pivot_table(values = 'Age', index='Sex')

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [62]:
#explicitly define statistics i:e np.median
titanic_df.pivot_table(values= 'Age', index='Sex', aggfunc=np.median)


Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.0
male,29.0


In [65]:
#multiple statistics
titanic_df.pivot_table(values='Age', index='Sex', aggfunc=[np.std, np.median])


Unnamed: 0_level_0,std,median
Unnamed: 0_level_1,Age,Age
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,14.110146,27.0
male,14.678201,29.0


# pivot on two varibales
To group by two variables, we can pass a second variable name into the columns argument.

In [66]:
#in groupby

#titanic.groupby(['Survived','Sex'])['Age'].mean().unstack()

#pivot on two varibales
titanic_df.pivot_table(values='Age', index='Sex', columns='Survived')


Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,25.046875,28.847716
male,31.618056,27.276022


In [68]:
titanic_df.pivot_table(index='Sex', values = 'Age', columns='Survived' ,fill_value = 0) # default aggfunc=mean

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,25.046875,28.847716
male,31.618056,27.276022


In [60]:
titanic_df.pivot_table(index='Sex',
                       values = 'Age',
                       columns='Survived',
                       aggfunc=['count','mean'])

Unnamed: 0_level_0,count,count,mean,mean
Survived,0,1,0,1
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,64,197,25.046875,28.847716
male,360,93,31.618056,27.276022


# summing with pivot table
Using margins equals True allows us to see a summary statistic for multiple levels of the dataset: the entire dataset, grouped by one variable, by another variable, and by two variables.


In [69]:
titanic_df.pivot_table(values='Age', 
                    index='Sex', 
                    columns='Survived',
                    fill_value=0,
                    margins=True)

Survived,0,1,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,25.046875,28.847716,27.915709
male,31.618056,27.276022,30.726645
All,30.626179,28.34369,29.699118


In [71]:
titanic_df.pivot_table(values='Age', 
                    index='Sex', 
                    columns='Survived',
                    fill_value=0,
                    margins=True,
                    margins_name='mean')

Survived,0,1,mean
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,25.046875,28.847716,27.915709
male,31.618056,27.276022,30.726645
mean,30.626179,28.34369,29.699118


#### Thanks:)
- Assignment work on Dogs Dataset