In [1]:
import pandas as pd 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
titanic = pd.read_csv(r"C:\Users\Casper\Desktop\pandada\titanic.csv")

# How to calculate summary statistics

In [3]:
# What is the average age of the Titanic passengers?
titanic["Age"].mean()

29.69911764705882

Different statistics are available and can be applied to columns with numerical data. Operation in general exclude missing data and operate across rows by default. 

In [6]:
# What is the median age and ticket fare(price) of the Titanic passengers? 
titanic[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

The aggegating statistic can be calculated for multiple columns at the same time. Remember the describe function from the first tutorial? 

In [7]:
titanic[["Age", "Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.agg() method:

In [8]:
titanic.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)

Unnamed: 0,Age,Fare
min,0.42,0.0
max,80.0,512.3292
median,28.0,14.4542
skew,0.389108,
mean,,32.204208


# Aggregating statistics grouped by category 

In [9]:
# What is the average age for male versus female Titanic passengers? 
titanic[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


Calculating a given statistic for each category in a column is a common pattern. 
The groupby method is used to support this type of operations. This fits in the more general split-apply-combine pattern: 
* Split the data into groups 
* Apply a function to each group independently 
* Combine the results into a data structure 

The apply and combine steps are typically done together in pandas. 
In the previous example, we explictly selected the 2 columns first. If not, the mean method is applied to each column containing numerical columns by passing numeric_only = True: 

In [10]:
titanic.groupby("Sex").mean(numeric_only=True)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


It does not make much sense to get the average value of the Pclass. If we are only interested in the average age for each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well: 

In [11]:
titanic.groupby("Sex")["Age"].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [None]:
https://pandas.pydata.org/docs/_images/06_groupby_select_detail.svg