# 3: Functions and descriptive statistics

Last week we learned how to select row, column and element from a dataframe. In this week's tutorial, we will explore some common summary functions which will allow us to quickly draw insights about the different features in a dataframe. 

Similar to last week, we will be working with the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle.

## Import pandas library

In [1]:
#from pandas.core.computation.check import NUMEXPR_INSTALLED
import pandas as pd

## Import data

In [2]:
data = pd.read_csv("train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
data.shape

(891, 12)

## Summary functions

Summary functions like describe and info give a high-level summary of our data.

Let's see how they work.

In [4]:
data["Parch"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Parch
Non-Null Count  Dtype
--------------  -----
891 non-null    int64
dtypes: int64(1)
memory usage: 7.1 KB


In [5]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
data.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [7]:
# Describe function on numerical variable

data[['Fare']].describe()

Unnamed: 0,Fare
count,891.0
mean,32.204208
std,49.693429
min,0.0
25%,7.9104
50%,14.4542
75%,31.0
max,512.3292


In [8]:
# Describe function on text variable

data['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [9]:
data["Sex"].describe()

count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object

In [10]:
data['Embarked']

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

## Unique and value counts function

In [11]:
# How many unique Embarked values are there?

data['Embarked'].nunique(dropna=False)

4

In [12]:
# What are the unique Embarked values?

data['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [13]:
# What are the counts of those individual values?

data['Embarked'].value_counts(ascending= True, dropna=False)

Embarked
NaN      2
Q       77
C      168
S      644
Name: count, dtype: int64

In [14]:
data['Embarked'].value_counts(dropna= False)

Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64

In [15]:
data['Embarked'].value_counts(normalize= True)  #z = x - Mean / Standard Deviation

Embarked
S    0.724409
C    0.188976
Q    0.086614
Name: proportion, dtype: float64

In [16]:
data['Embarked'].count()

np.int64(889)

## Descriptive statistics

In [17]:
# What is the oldest age?

print(data['Age'].max())

80.0


In [18]:
print(data['Age'].min())

0.42


In [19]:
# Who is that passenger?
# Recall loc function from last week

data.loc[data['Age'] == 80, :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S


In [20]:
# Who is that passenger?

data.loc[data['Age'] == 0.42, :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C


In [21]:
# Who is that passenger?

data.loc[data['Age'] == data['Age'].min(), :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C


In [22]:
# What is the average age?

print(data['Age'].mean())   #sum of the points / total number

29.69911764705882


In [23]:
data.loc[data['Age'] == 29, :].count()

PassengerId    20
Survived       20
Pclass         20
Name           20
Sex            20
Age            20
SibSp          20
Parch          20
Ticket         20
Fare           20
Cabin           5
Embarked       20
dtype: int64

In [24]:
# What is the median fare?

print(data['Fare'].median())

14.4542


In [25]:
# What is the most frequent Embarked value?
# We can cross check this with the value counts function above
# This should return 'S' as the answer

print(data['Age'].mode())

0    24.0
Name: Age, dtype: float64


There are more functions for descriptive statistics than what I have shown here. If you are interested, you can have a look at [this page](https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm).

## Map and apply function

Both map and apply help us transform our data. Map is a series method that is it only works with a single column whereas apply works with both a single column as well as an entire dataframe. 

Because this is a beginner's course to pandas as well as Python, I want to first go over some basics about functions before we get into how we can use map and apply functions.

So what is a function? The easiest way to think about a function is that it takes in one or more variable and subsequently spits out an output. For example, y = x + 1 is a function. It takes in a number x and returns that number plus one.

All the methods for descriptive statistics in the section above such as max, min and mean are all examples of functions that have already been built into pandas so that we don't have to write the functions ourselves. But what if we have come up with our own unique transformation that we would like to implement to our dataframe? This is where map and apply comes in.

So what's the game plan?
1. First, we have to write out our desired function.
2. Then, we need to apply that function over a series in our dataframe (via map) or over the entire dataframe (via apply).

In Python, there are two ways to write functions that you should know of. First is via def and second is via something called a lambda function which is a slightly quicker and easier way. In this next section, I will teach you both these methods.

In [26]:
# Say we want to write a function which computes the cube of a number
# Method 1: def

def cube(n):
    output = n ** 3
    return output

cube(2)

8

In [27]:
# Method 2: lambda function

cube = lambda n: n ** 3
cube(3)

27

Now that we have learned how to write functions, let's move on to applying functions to our dataframe.

Suppose we would like to extract the last name out of the Name column of our dataset. This requires a little function called split but don't worry I will explain it very clearly in the video tutorial.

In [28]:
str1 = "Ahmed Hassan Amr Mohamed"
str2 = str1.split(" ")
print(str2[0])

Ahmed


In [29]:
data["Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

# Difference between apply and map function 

## apply: 
It is used when you want to apply a function on the values of Series (variable or column).


In [30]:
def extractLastName(name):
    token = name.split(',')
    #print(token[0])
    token2= token[1].split('.')
    #print(token2)
    return token2[0]

# Map the function to the Name column and assign a new column in our dataframe called Last Name
data['Titles'] = data['Name'].apply(extractLastName) #using User defined function
data['Titles'] = data['Name'].apply(lambda Name: Name.split('.')[0].split(',')[1]) #using lambda function
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Titles
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr


In [31]:
# Define our function
def extractLastName(name):
    token = name.split(',')
    #print(token)
    return token[0]

# Map the function to the Name column and assign a new column in our dataframe called Last Name
data['Last Name'] = data['Name'].apply(extractLastName)
data['Last Name'] = data['Name'].apply(lambda x:x.split(',')[0])

# Let's have a look at the first 5 rows
data.loc[:4, ['Last Name','Name']]
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Titles,Last Name
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Cumings
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,Heikkinen
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,Futrelle
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,Allen



## map 
It is to subsitute each value with another one.

## Bonus tip

You can also use the map function to encode categorical variables. This is particularly useful and important when you are preparing your dataset for machine learning. Most machine learning algorithms cannot learn from non-numeric inputs therefore, we have to first turn our categorical variables into numbers before fitting the model to our data. 

Examples of categorical variables in our titanic dataset are the Pclass, Sex and Embarked columns.

Don't worry if you do not understand any machine learning, this section is merely to illustrate how you can encode using the map function.

Suppose we want to encode the Sex column such that male gets assigned as 1 and female gets assgined as 0.

In [32]:
# Encode male as 1 and female as 0
data['Encoded Sex'] = data['Sex'].map({'male':1, 'female':0})

# Show the first 5 rows of Sex and Encoded Sex
data.loc[:4, ['Sex', 'Encoded Sex']]

Unnamed: 0,Sex,Encoded Sex
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


An alternative way to accomplish this is via a pandas function called get_dummies.

In [33]:
pd.get_dummies(data['Sex'])

Unnamed: 0,female,male
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True
...,...,...
886,False,True
887,True,False
888,True,False
889,False,True


In [34]:
pd.get_dummies(data['Pclass'])

Unnamed: 0,1,2,3
0,False,False,True
1,True,False,False
2,False,False,True
3,True,False,False
4,False,False,True
...,...,...,...
886,False,True,False
887,True,False,False
888,False,False,True
889,True,False,False


In [35]:
pd.get_dummies(data['Embarked'])

Unnamed: 0,C,Q,S
0,False,False,True
1,True,False,False
2,False,False,True
3,False,False,True
4,False,False,True
...,...,...,...
886,False,False,True
887,False,False,True
888,False,False,True
889,True,False,False


In [36]:
pd.get_dummies(data['Age'])

Unnamed: 0,0.42,0.67,0.75,0.83,0.92,1.00,2.00,3.00,4.00,5.00,...,62.00,63.00,64.00,65.00,66.00,70.00,70.50,71.00,74.00,80.00
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
887,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
889,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [37]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Titles,Last Name,Encoded Sex
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,Braund,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Cumings,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,Heikkinen,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,Futrelle,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,Allen,1


In [38]:
numeric = data.loc[:, ["PassengerId","Age", "Fare", "Parch", "SibSp", "Survived"]]
numeric

Unnamed: 0,PassengerId,Age,Fare,Parch,SibSp,Survived
0,1,22.0,7.2500,0,1,0
1,2,38.0,71.2833,0,1,1
2,3,26.0,7.9250,0,0,1
3,4,35.0,53.1000,0,1,1
4,5,35.0,8.0500,0,0,0
...,...,...,...,...,...,...
886,887,27.0,13.0000,0,0,0
887,888,19.0,30.0000,0,0,1
888,889,,23.4500,2,1,0
889,890,26.0,30.0000,0,0,1


In [39]:
numeric.corr()

Unnamed: 0,PassengerId,Age,Fare,Parch,SibSp,Survived
PassengerId,1.0,0.036847,0.012658,-0.001652,-0.057527,-0.005007
Age,0.036847,1.0,0.096067,-0.189119,-0.308247,-0.077221
Fare,0.012658,0.096067,1.0,0.216225,0.159651,0.257307
Parch,-0.001652,-0.189119,0.216225,1.0,0.414838,0.081629
SibSp,-0.057527,-0.308247,0.159651,0.414838,1.0,-0.035322
Survived,-0.005007,-0.077221,0.257307,0.081629,-0.035322,1.0
