# Getting more value from the Pandas’ value_counts()

Data exploration is an important aspect of the Machine Learning pipeline. Before we decide which model to train and how many to train, we must have an idea of what our data contains. The Pandas library is equipped with a number of useful functions for this very purpose and value_counts is one of them. This function returns the count of unique items in a pandas dataframe. However, most of the time, we end up using value_counts with the default parameters. So in this short article, I’ll show you how to achieve more by altering the default parameters.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
train = pd.read_csv('../meta/titanic-train.csv')

Let's look at the first few rows to get an idea about the dataset

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Calculating the number of null values

In [4]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Age, Cabin and Embarked columns have null values

# Value_Counts()

The [value_counts() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) returns an object containing counts of unique values.This means it enables us to count the number of unique elements in a column of a Pandas dataframe.

### Syntax

`Series.value_counts()`

### Parameters

![](https://miro.medium.com/max/597/1*j5Gi_-E-b4h6tqtbYsxTrA.png)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

Let's see how we can use it in our analysis

# 1. value_counts() with default parameters

Let’s call the value_counts() on the Embarked column of the dataset. This will return the count of unique occurrences in this column.

In [5]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [6]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

The function returns the count of all unique values in the given index in descending order, without any null values. The function returns the count of all unique values in the given index in descending order without any null values. We can quickly see that the maximum people embarked from Southampton, followed by Cherbourg and then Queenstown.

# 2. value_counts() with relative frequencies of the unique values.

Sometimes, getting a percentage of the total is a better criteria then the count. By setting `normalize =True`,the object returned will contain the relative frequencies of the unique values. `normalize` is set to `False` by default.

In [7]:
train['Embarked'].value_counts(normalize=True)

S    0.724409
C    0.188976
Q    0.086614
Name: Embarked, dtype: float64

# 3. value_counts() in ascending order

1. Again, to sort the results obtained in ascending order, simply set the `ascending` parameter to `True`, which is again set to `False` by default. 

In [8]:
train['Embarked'].value_counts(ascending=True)

Q     77
C    168
S    644
Name: Embarked, dtype: int64

# 4. value_counts() with NaN values

By default, count of null values are excluded. However, this can be reversed by setting `dropna=False`.

In [9]:
train['Embarked'].value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

This shows there are 2 null values in the `Embarked' column.

# 5. value_counts() with bins
value_counts() can also be used to bin continuous data into discrete intervals with the help of `bin` parameter.So rather than counting one can group the values in bins. This option works only with numerical data.

In [10]:
# applying value_counts on a numerical column
train['Fare'].value_counts()

8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
8.4583      1
9.8375      1
8.3625      1
14.1083     1
17.4000     1
Name: Fare, Length: 248, dtype: int64

This doesn't convey much since the function above has given a count of every available Fare amount. Instead, let's group them into 7 bins.

In [11]:
train['Fare'].value_counts(bins=7)

(-0.513, 73.19]       789
(73.19, 146.38]        71
(146.38, 219.57]       15
(219.57, 292.76]       13
(439.139, 512.329]      3
(365.949, 439.139]      0
(292.76, 365.949]       0
Name: Fare, dtype: int64

Binning makes it easy to understand the idea being conveyed. We can easily see that most of the people out of the total population paid less than 73.19 for their ticket.

value_counts() is a very useful method and helps to get a sense of data easily.

## References

* [Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)
* [Five ways to use value_counts()](https://www.kaggle.com/parulpandey/five-ways-to-use-value-counts)