# Pandas DataFrames



---

### Table of Contents

1 - [Manipulating Columns ](#section1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Indexing &  Slicing in Pandas](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Uniqueness](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Frequencies](#subsection3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 - [Sorting](#subsection4)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.5 - [Min, Max, Range](#subsection5)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.6 - [Missing Values](#subsection6)<br>


2 - [Booleans & Boolean Indexing](#section2)<br>


---

In [45]:
import numpy as np
import pandas as pd

Let's begin by reading a new data set from the file `internet_world_data.csv`. We will use `pd.read_csv` just as we did before. We will use the keyword `index_col` to set the column 'Country' as our index. In addition, we will drop the column 'S.NO' and make sure the columns 'Internet users' and 'Population' will be in float data types so they are easier to work with in this notebook.

You can learn more about the data [here](https://www.kaggle.com/datasets/ramjasmaurya/1-gb-internet-price).

In [46]:
internet = pd.read_csv('data/internet_world_data.csv', index_col='Country')

internet = internet.drop('S.NO', axis=1)
internet['Internet users'] = internet['Internet users'].str.replace(',','')
internet['Internet users'] = internet['Internet users'].astype(float)
internet['Population'] = internet['Population'].str.replace(',','')
internet['Population'] = internet['Population'].astype(float)

In the five cells below, use some functions on the dataframe `internet` to do some **Exploratory Data Analysis** to explore the data!

_You can always look back in Notebook 06 for ideas._

In [47]:
#Exploratory Data Analysis 1

#This can be head, tail, info, columns, describe, mean, median, mode, etc.

internet.head()

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Israel,IL,NEAR EAST,27.0,0.05,0.02,20.95,0.11,0.9,6788737.0,8381516.0,28.01
Kyrgyzstan,KG,CIS (FORMER USSR),20.0,0.15,0.1,7.08,0.21,0.27,2309235.0,6304030.0,16.3
Fiji,FJ,OCEANIA,18.0,0.19,0.05,0.85,0.59,3.57,452479.0,883483.0,25.99
Italy,IT,WESTERN EUROPE,29.0,0.27,0.09,3.54,0.43,1.73,50540000.0,60627291.0,37.15
Sudan,SD,SUB-SAHARAN AFRICA,33.0,0.27,0.03,0.92,0.63,0.68,12512639.0,41801533.0,9.5


In [48]:
#Exploratory Data Analysis 2

internet.tail()

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Vatican City (Holy See),VA,Europe,,NO PROVIDERS,,,,,,,
Venezuela,VE,SOUTH AMERICA,,HYPERINFLATION,,,,,20564451.0,28887118.0,5.68
Wallis and Futuna,WF,OCEANIA,,NO PROVIDERS,,,,,1383.0,11661.0,
Democratic Republic of the Congo,CD,SUB-SAHARAN AFRICA,,Prices listed in non-convertible 'units',,,,,7011507.0,84068091.0,12.08
Zimbabwe,ZW,SUB-SAHARAN AFRICA,,UNRELIABLE EXCHANGE RATES,,,,,4472992.0,14438802.0,13.99


In [49]:
#Exploratory Data Analysis 3

internet.describe(include='all')

Unnamed: 0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
count,242,242,231.0,242,231.0,231.0,231.0,231.0,212.0,210.0,141.0
unique,239,14,,198,,,192.0,213.0,,,
top,CW,SUB-SAHARAN AFRICA,,NO PROVIDERS,,,1.12,5.56,,,
freq,2,49,,8,,,3.0,3.0,,,
mean,,,26.61039,,1.718442,29.664848,,,23444870.0,36238860.0,37.275674
std,,,16.457446,,4.764523,59.235115,,,95213180.0,140488000.0,28.669584
min,,,1.0,,0.0,0.63,,,1034.0,1620.0,4.89
25%,,,15.0,,0.215,5.905,,,353509.5,1003262.0,16.73
50%,,,22.0,,0.63,12.5,,,2661349.0,7003837.0,27.84
75%,,,36.0,,1.37,34.29,,,9671491.0,25026460.0,49.66


In [50]:
#Exploratory Data Analysis 4

internet.mean()
#internet['NO. OF Internet Plans'].mean()

  internet.mean()


NO. OF Internet Plans             2.661039e+01
Cheapest 1GB for 30 days (USD)    1.718442e+00
Most expensive 1GB (USD)          2.966485e+01
Internet users                    2.344487e+07
Population                        3.623886e+07
Avg \n(Mbit/s)Ookla               3.727567e+01
dtype: float64

In [51]:
#Exploratory Data Analysis 5

internet.info()

<class 'pandas.core.frame.DataFrame'>
Index: 242 entries, Israel to Zimbabwe
Data columns (total 11 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Country code                                      242 non-null    object 
 1   Continental region                                242 non-null    object 
 2   NO. OF Internet Plans                             231 non-null    float64
 3   Average price of 1GB (USD)                        242 non-null    object 
 4   Cheapest 1GB for 30 days (USD)                    231 non-null    float64
 5   Most expensive 1GB (USD)                          231 non-null    float64
 6   Average price of 1GB (USD  at the start of 2021)  231 non-null    object 
 7   Average price of 1GB (USD – at start of 2020)     231 non-null    object 
 8   Internet users                                    212 non-null    float64
 9   Population      

## 1. Manipulating Columns
 <a id='Section1'></a>

### 1.1 Indexing &  Slicing  <a id='subsection1'></a>

There are two main ways of indexing through DataFrames. We can still use our old friend, the square brackets [ : ], or we can use it with the help of two functions: **loc** and **iloc**.

**loc**: uses names or labels of rows and columns.
**iloc**: uses indices of rows and columns. You can think of *iloc* as *index-loc*.


Let's start with loc:

#### .loc[rows-label(s), columns-label(s)]
`.loc` Helps us view and index our DataFrame. 
* It works with string labels. Notice that most of the times you will have specific column names, but our row names often come as a number. Hence the label of the rows will be a number.   
* It can take 
    * one label __(df.loc[row-label, 'col-label-1'])__
    * a list of labels __(df.loc[[row-label 1, row-label-2, row-label-4],['col-label-1',  'col-label-2', 'col-label-4']])__
    * or a _slice_ of labels __(df.loc[row label-50 : row-label-100,'col-label-1': 'col-label-8'])__


**Remember!** `loc` is **inclusive**, `iloc` is **exclusive** for its stop index.

We can still iterate through our DataFrame (aka table) with square brackets, by identifying the column name.

In [52]:
# EXAMPLE

internet["Average price of 1GB (USD)"]

Country
Israel                                                                  0.05
Kyrgyzstan                                                              0.15
Fiji                                                                    0.19
Italy                                                                   0.27
Sudan                                                                   0.27
                                                      ...                   
Vatican City (Holy See)                                         NO PROVIDERS
Venezuela                                                     HYPERINFLATION
Wallis and Futuna                                               NO PROVIDERS
Democratic Republic of the Congo    Prices listed in non-convertible 'units'
Zimbabwe                                           UNRELIABLE EXCHANGE RATES
Name: Average price of 1GB (USD), Length: 242, dtype: object

So why would we want to opt out of this option and switch to `loc` and `iloc`? There are a few reasons for that, and the main being compute time. With the examples we use in this notebook, it will be impossible to notice the difference, but once we get to DataFrames with hundreds of thousands or millions of values, this will become important! 

On a climate care note, increased compute time leads to increased electricity and data server use, which contributes to climate change! And that's part of the reason we need to consider compute time. So let's dive into learning how to use our helpers `loc` and `iloc` to be more climate conscious.

#### Rows

Let's use loc to see what are the values in the row on 'Japan' in our DataFrame.

In [53]:
# EXAMPLE

internet.loc['Japan']

Country code                                                          JP
Continental region                                  ASIA (EX. NEAR EAST)
NO. OF Internet Plans                                               35.0
Average price of 1GB (USD)                                          3.38
Cheapest 1GB for 30 days (USD)                                      0.88
Most expensive 1GB (USD)                                           45.53
Average price of 1GB (USD  at the start of 2021)                    3.91
Average price of 1GB (USD – at start of 2020)                       10.4
Internet users                                               117400000.0
Population                                                   127202192.0
Avg \n(Mbit/s)Ookla                                                44.05
Name: Japan, dtype: object

In [54]:
#Try using loc to find data on the United States
internet.loc['United States']

Country code                                                      US
Continental region                                  NORTHERN AMERICA
NO. OF Internet Plans                                           45.0
Average price of 1GB (USD)                                      3.33
Cheapest 1GB for 30 days (USD)                                   1.0
Most expensive 1GB (USD)                                        30.0
Average price of 1GB (USD  at the start of 2021)                   8
Average price of 1GB (USD – at start of 2020)                   8.34
Internet users                                           312320000.0
Population                                               327096265.0
Avg \n(Mbit/s)Ookla                                            61.12
Name: United States, dtype: object

In [55]:
#Use loc to find data on the country of your choice!
internet.loc['Poland']

Country code                                                    PL
Continental region                                  EASTERN EUROPE
NO. OF Internet Plans                                         60.0
Average price of 1GB (USD)                                    0.64
Cheapest 1GB for 30 days (USD)                                0.03
Most expensive 1GB (USD)                                     23.02
Average price of 1GB (USD  at the start of 2021)               0.7
Average price of 1GB (USD – at start of 2020)                 1.32
Internet users                                          34697848.0
Population                                              37921592.0
Avg \n(Mbit/s)Ookla                                          40.14
Name: Poland, dtype: object

`iloc` uses **indices** instead of labels. Try running the cell below:

In [56]:
internet.iloc[10]

Country code                                                    FR
Continental region                                  WESTERN EUROPE
NO. OF Internet Plans                                         45.0
Average price of 1GB (USD)                                    0.41
Cheapest 1GB for 30 days (USD)                                0.09
Most expensive 1GB (USD)                                     118.2
Average price of 1GB (USD  at the start of 2021)              0.81
Average price of 1GB (USD – at start of 2020)                 1.21
Internet users                                          59470000.0
Population                                              64990511.0
Avg \n(Mbit/s)Ookla                                          60.94
Name: France, dtype: object

You can pass in indices as a list to return a dataframe instead of a series, as you can see in the cell below:

In [57]:
internet.iloc[[10]]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
France,FR,WESTERN EUROPE,45.0,0.41,0.09,118.2,0.81,1.21,59470000.0,64990511.0,60.94


In [58]:
#EXERCISE - Use iloc to grab information on the country found at index 50

internet.iloc[[50]] #or internet.iloc[50]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Nicaragua,NI,CENTRAL AMERICA,30.0,0.94,0.02,2.82,1.71,65.83,1732218.0,6465501.0,18.3


You can also grab specific rows by passing a list of indices:

In [59]:
internet.iloc[[1,3,6,8,9]]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Kyrgyzstan,KG,CIS (FORMER USSR),20.0,0.15,0.1,7.08,0.21,0.27,2309235.0,6304030.0,16.3
Italy,IT,WESTERN EUROPE,29.0,0.27,0.09,3.54,0.43,1.73,50540000.0,60627291.0,37.15
Moldova,MD,EASTERN EUROPE,18.0,0.32,0.07,2.79,1.12,2.82,3083783.0,4051944.0,29.46
Sri Lanka,LK,ASIA (EX. NEAR EAST),60.0,0.38,0.0,5.53,0.51,0.78,7121116.0,21228763.0,13.15
Chile,CL,SOUTH AMERICA,59.0,0.39,0.24,1.83,0.71,1.87,14864456.0,18729160.0,22.49


Recall __start:stop:step__ from lists? We can also select a range of rows with a specified step value in our data DataFrame. Below we will take every 2nd element from the row 0 to row 10.

**Remember `iloc` is exclusive for the stop index! In other words, it only goes until stopIndex-1**

In [60]:
internet.iloc[0:11:2]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Israel,IL,NEAR EAST,27.0,0.05,0.02,20.95,0.11,0.9,6788737.0,8381516.0,28.01
Fiji,FJ,OCEANIA,18.0,0.19,0.05,0.85,0.59,3.57,452479.0,883483.0,25.99
Sudan,SD,SUB-SAHARAN AFRICA,33.0,0.27,0.03,0.92,0.63,0.68,12512639.0,41801533.0,9.5
Moldova,MD,EASTERN EUROPE,18.0,0.32,0.07,2.79,1.12,2.82,3083783.0,4051944.0,29.46
Sri Lanka,LK,ASIA (EX. NEAR EAST),60.0,0.38,0.0,5.53,0.51,0.78,7121116.0,21228763.0,13.15
France,FR,WESTERN EUROPE,45.0,0.41,0.09,118.2,0.81,1.21,59470000.0,64990511.0,60.94


In [61]:
#EXERCISE - use iloc to get every 5th element from row 0 to row 150

internet.iloc[0:151:5]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Israel,IL,NEAR EAST,27.0,0.05,0.02,20.95,0.11,0.9,6788737.0,8381516.0,28.01
Russia,RU,CIS (FORMER USSR),22.0,0.29,0.13,1.86,0.52,0.91,124000000.0,145734038.0,20.46
France,FR,WESTERN EUROPE,45.0,0.41,0.09,118.2,0.81,1.21,59470000.0,64990511.0,60.94
Algeria,DZ,NORTHERN AFRICA,20.0,0.51,0.16,2.24,0.65,5.15,26350000.0,42228408.0,12.44
Uzbekistan,UZ,CIS (FORMER USSR),60.0,0.6,0.01,23.75,1.34,3.27,16692456.0,32476244.0,13.27
Poland,PL,EASTERN EUROPE,60.0,0.64,0.03,23.02,0.7,1.32,34697848.0,37921592.0,40.14
Ukraine,UA,CIS (FORMER USSR),19.0,0.75,0.14,35.88,0.46,5.93,31100000.0,44246156.0,15.62
Myanmar,MM,ASIA (EX. NEAR EAST),35.0,0.78,0.0,14.15,0.78,0.87,16374103.0,53708320.0,24.06
Bhutan,BT,ASIA (EX. NEAR EAST),42.0,0.83,0.33,1.06,1.16,1.49,388541.0,754388.0,
Malaysia,MY,ASIA (EX. NEAR EAST),60.0,0.89,0.12,7.26,1.12,1.66,25343685.0,31528033.0,25.87


We can also use `loc` to grab columns. Don't forget that we are still using `loc`, so we will have to use column labels.

In [62]:
# EXAMPLE

internet.loc[:,'Most expensive 1GB (USD)']

Country
Israel                              20.95
Kyrgyzstan                           7.08
Fiji                                 0.85
Italy                                3.54
Sudan                                0.92
                                    ...  
Vatican City (Holy See)               NaN
Venezuela                             NaN
Wallis and Futuna                     NaN
Democratic Republic of the Congo      NaN
Zimbabwe                              NaN
Name: Most expensive 1GB (USD), Length: 242, dtype: float64

Another way to index by only one column is by adding the column label in a list. It will return a a one-column DataFrame because we passed a list. 

In [63]:
# EXAMPLE

internet.loc[:,['Most expensive 1GB (USD)']]

Unnamed: 0_level_0,Most expensive 1GB (USD)
Country,Unnamed: 1_level_1
Israel,20.95
Kyrgyzstan,7.08
Fiji,0.85
Italy,3.54
Sudan,0.92
...,...
Vatican City (Holy See),
Venezuela,
Wallis and Futuna,
Democratic Republic of the Congo,


Notice that here we had to specify the range of rows that we want to index that column by. We used `:` in order to return all values in the column.

`iloc` works just as `loc`, but instead of using labels we use the index. How would you use iloc to get the 'Most expensive 1GB (USD)' column?

_Hint: If you don't remember the order of columns, create a new cell and use `.columns` on your dataframe. Remember we start counting from 0!_



In [64]:
# EXERCISE

#internet.columns
internet.iloc[:,5]

Country
Israel                              20.95
Kyrgyzstan                           7.08
Fiji                                 0.85
Italy                                3.54
Sudan                                0.92
                                    ...  
Vatican City (Holy See)               NaN
Venezuela                             NaN
Wallis and Futuna                     NaN
Democratic Republic of the Congo      NaN
Zimbabwe                              NaN
Name: Most expensive 1GB (USD), Length: 242, dtype: float64

Just as we sliced rows, we can do the same with columns. In the cell below, use `loc` and return all rows for columns *Country code* through *NO. OF Internet Plans* (inclusive of the last column).



In [65]:
# EXERCISE

internet.loc[:,'Country code':'NO. OF Internet Plans']

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Israel,IL,NEAR EAST,27.0
Kyrgyzstan,KG,CIS (FORMER USSR),20.0
Fiji,FJ,OCEANIA,18.0
Italy,IT,WESTERN EUROPE,29.0
Sudan,SD,SUB-SAHARAN AFRICA,33.0
...,...,...,...
Vatican City (Holy See),VA,Europe,
Venezuela,VE,SOUTH AMERICA,
Wallis and Futuna,WF,OCEANIA,
Democratic Republic of the Congo,CD,SUB-SAHARAN AFRICA,


Now do the same thing, but with `iloc`.

_Remember that iloc is exclusive!_

In [66]:
#EXERCISE

internet.iloc[:,0:3]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Israel,IL,NEAR EAST,27.0
Kyrgyzstan,KG,CIS (FORMER USSR),20.0
Fiji,FJ,OCEANIA,18.0
Italy,IT,WESTERN EUROPE,29.0
Sudan,SD,SUB-SAHARAN AFRICA,33.0
...,...,...,...
Vatican City (Holy See),VA,Europe,
Venezuela,VE,SOUTH AMERICA,
Wallis and Futuna,WF,OCEANIA,
Democratic Republic of the Congo,CD,SUB-SAHARAN AFRICA,


### 1.2 Uniqueness  <a id='subsection2'></a>

Suppose that we want to find out the number of unique continental regions in our data. The `.unique()` method allows us to check this. 

There are two ways to accomplish this, one is using the "dot" notation, and the other using brackets. For the most part, we will stick to the second method as it can be easy to run into errors.

Method 1: **df.column_label.unique( )**

Method 2: **df['column_label'].unique( )**


In [67]:
internet['Continental region'].unique()

array(['NEAR EAST', 'CIS (FORMER USSR)', 'OCEANIA', 'WESTERN EUROPE',
       'SUB-SAHARAN AFRICA', 'EASTERN EUROPE', 'ASIA (EX. NEAR EAST)',
       'SOUTH AMERICA', 'NORTHERN AFRICA', 'CARIBBEAN', 'CENTRAL AMERICA',
       'BALTICS', 'NORTHERN AMERICA', 'Europe'], dtype=object)

We can also use the method `.nunique()` to tell us how _many_ unique items we have rather which items. An alternative way to compute this is __len(df['column_label'].nunique())__.

In [68]:
# EXAMPLE

internet['Continental region'].nunique()

14

In the cell below, figure out how many unique number of internet plans there are.

In [69]:
# EXERCISE

internet['NO. OF Internet Plans'].nunique()

55

### 1.3 Frequencies  <a id='subsection3'></a>

Say we want to find out how many instances of each continental region exists. In this case, we would use the `.value_counts()` method. This method returns the counts for the unique values in our column. 

In [70]:
# EXAMPLE

internet['Continental region'].value_counts()

SUB-SAHARAN AFRICA      49
CARIBBEAN               32
WESTERN EUROPE          30
ASIA (EX. NEAR EAST)    28
OCEANIA                 24
NEAR EAST               16
EASTERN EUROPE          14
SOUTH AMERICA           14
CIS (FORMER USSR)       11
NORTHERN AFRICA          8
CENTRAL AMERICA          8
NORTHERN AMERICA         4
BALTICS                  3
Europe                   1
Name: Continental region, dtype: int64

Now let's try to find out the counts for the number of internet plans.

In [71]:
# EXERCISE

internet['NO. OF Internet Plans'].value_counts()

60.0    20
18.0    11
17.0    11
21.0    10
9.0      9
22.0     9
42.0     8
13.0     8
35.0     7
19.0     7
14.0     7
27.0     6
24.0     6
20.0     6
11.0     6
15.0     6
25.0     5
16.0     5
23.0     5
3.0      4
7.0      4
12.0     4
30.0     4
4.0      4
45.0     4
34.0     4
58.0     3
36.0     3
2.0      3
40.0     3
28.0     3
52.0     3
33.0     2
31.0     2
37.0     2
8.0      2
32.0     2
5.0      2
29.0     2
6.0      2
46.0     2
44.0     2
59.0     1
54.0     1
48.0     1
50.0     1
26.0     1
38.0     1
47.0     1
49.0     1
53.0     1
39.0     1
10.0     1
51.0     1
1.0      1
Name: NO. OF Internet Plans, dtype: int64

### 1.4 Sorting  <a id='subsection4'></a>

Notice that this method sorts our values in decreasing order? What if you had an alternative sorting that you wanted to use? Maybe you want to sort by index, that is, by alphabetical order. In this case you would want to use the `sort_index()` method as seen below. 

In [72]:
# EXAMPLE

internet['Continental region'].value_counts().sort_index()

ASIA (EX. NEAR EAST)    28
BALTICS                  3
CARIBBEAN               32
CENTRAL AMERICA          8
CIS (FORMER USSR)       11
EASTERN EUROPE          14
Europe                   1
NEAR EAST               16
NORTHERN AFRICA          8
NORTHERN AMERICA         4
OCEANIA                 24
SOUTH AMERICA           14
SUB-SAHARAN AFRICA      49
WESTERN EUROPE          30
Name: Continental region, dtype: int64

If instead you wanted to sort by counts, but in ascending (from smallest to largest) order, you can use the `.sort_values()` method instead with the argument __ascending = True__.

In [73]:
# EXAMPLE

internet['Continental region'].value_counts().sort_values(ascending=True)

Europe                   1
BALTICS                  3
NORTHERN AMERICA         4
NORTHERN AFRICA          8
CENTRAL AMERICA          8
CIS (FORMER USSR)       11
EASTERN EUROPE          14
SOUTH AMERICA           14
NEAR EAST               16
OCEANIA                 24
ASIA (EX. NEAR EAST)    28
WESTERN EUROPE          30
CARIBBEAN               32
SUB-SAHARAN AFRICA      49
Name: Continental region, dtype: int64

We can also use `sort_values()` without calling `value_counts()` first. In the cell below, try using `sort_values()` for the column 'Most expensive 1GB (USD)'. Have it so that it sorts in decreasing order (highest to lowest):

In [74]:
# EXERCISE

internet['Most expensive 1GB (USD)'].sort_values(ascending=True)

Country
San Marino                          0.63
Fiji                                0.85
Sudan                               0.92
Bhutan                              1.06
China                               1.21
                                    ... 
Vatican City (Holy See)              NaN
Venezuela                            NaN
Wallis and Futuna                    NaN
Democratic Republic of the Congo     NaN
Zimbabwe                             NaN
Name: Most expensive 1GB (USD), Length: 242, dtype: float64

You might notice that we are getting some **NaN** values! This will be revisited in section 1.6.

### 1.5 Min, Max, Range  <a id='subsection5'></a>

Say that for our analysis we want to find out which country has the highest number of internet users and which country has the lowest number of internet users.

A good starting point might be to see what the __min__ and the __max__ are for our data. We can do this by using the functions `.min()` and `.max()` respectably. 

In [75]:
# EXAMPLE

print('Min number of users is :',internet['Internet users'].min())
print('Max number of users is :',internet['Internet users'].max())

Min number of users is : 1034.0
Max number of users is : 1010740000.0


To get the range, all you need to do is subtract the min from the max! Let's do this in the cell below:

In [76]:
# EXERCISE

internet_users_max = internet['Internet users'].max()
internet_users_min = internet['Internet users'].min()
internet_users_range = internet_users_max - internet_users_min

print("The range is ", internet_users_range) 

The range is  1010738966.0


Now try to find the range for *Most expensive 1GB (USD)*

In [77]:
#EXERCISE

expensive_GB_max = internet['Most expensive 1GB (USD)'].max()
expensive_GB_min = internet['Most expensive 1GB (USD)'].min()

expensive_GB_range = expensive_GB_max - expensive_GB_min

print("The range is ", expensive_GB_range) 

The range is  768.24


### 1.6 Missing Values  <a id='subsection6'></a>

A common problem that you will come across when analyzing data is **missing** data. You can check if you data set contains missing data by using the function `.isnull()`. This function returns **True** whenever a values is missing (Null, NaN) and **False** whenever it is not. We can combine this function with `.sum()` to add up all the values that are True  & False.

*In Python (as in most programming languages), True is represented by 1, and False by 0. So using the `.sum()` function allows us to treat these True/False as numerical values.*

In [78]:
# EXAMPLE

internet.isnull().sum()

Country code                                          0
Continental region                                    0
NO. OF Internet Plans                                11
Average price of 1GB (USD)                            0
Cheapest 1GB for 30 days (USD)                       11
Most expensive 1GB (USD)                             11
Average price of 1GB (USD  at the start of 2021)     11
Average price of 1GB (USD – at start of 2020)        11
Internet users                                       30
Population                                           32
Avg \n(Mbit/s)Ookla                                 101
dtype: int64

Notice that for the example above we checked for the number of missing values in each of the columns? What if you only wanted to do it for one? You can use the same methods we discuss prior, that is bracket and dot notation.

In [79]:
# EXAMPLE

internet['NO. OF Internet Plans'].isnull().sum()

11

In the cell below, find the number of null values in the column *Population*. 

In [80]:
# EXERCISE

internet['Population'].isnull().sum()

32

However, **Null** and **NaN** are not the only ways to represent missing value. Sometimes, they can appear as 999 (commonly used in census or other data involving humans) or even -1.

In this data set, you might notice that instead of numerical values there are some notes used - for example, if you look at the column *Average price of 1GB (USD)*, you will notice some countries with 'NO PROVIDERS' among other things.

**Discuss:** Would these count as null values / missing data? Why or why not?

## Booleans & Boolean Indexing <a id='section2'></a>

Suppose we only want to look at the countries that have less than 10 internet plans. We will use **boolean indexing** to create a DataFrame that meets this criteria. 

Boolean indexing allows us to define what kind of data we want to output. For example, we can only select rows that correspond to a specific continental region or choose the rows that are below a select number of internet plans.

We will use **comparison operators** in boolean indexing - below is the table from notebook 03:

|Operator| Meaning|
|--------|---------|
|< | less than |
|<= | less than or equal to|
|> | greater than |
|>= | greater than or equal to|
|!= | not equal to|
|== | equal to|

Often people use the term **filtering data** when using boolean indexing. It's easier to break-up boolean indexing in steps by first creating a filter that specifies your criteria, then passing that filter to your dataframe.

For example, let's say we only wanted to look at countries that have **less than 10 internet plans.**

First, let's create a filter using the appropriate comparison operator:

In [81]:
# EXAMPLE

internet_plan_filter = internet['NO. OF Internet Plans'] < 10
internet_plan_filter

Country
Israel                              False
Kyrgyzstan                          False
Fiji                                False
Italy                               False
Sudan                               False
                                    ...  
Vatican City (Holy See)             False
Venezuela                           False
Wallis and Futuna                   False
Democratic Republic of the Congo    False
Zimbabwe                            False
Name: NO. OF Internet Plans, Length: 242, dtype: bool

Next, we will pass this filter into our original dataframe:

In [82]:
#EXAMPLE

internet[internet_plan_filter]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
San Marino,SM,WESTERN EUROPE,2.0,0.43,0.24,0.63,1.16,6.86,20100.0,33785.0,
Guadeloupe,GP,CARIBBEAN,9.0,0.89,0.49,3.55,2.42,6.06,,,
Monaco,MC,WESTERN EUROPE,3.0,1.08,0.84,2.37,0.98,1.21,37553.0,38682.0,
Djibouti,DJ,SUB-SAHARAN AFRICA,7.0,1.12,0.47,28.01,1.12,37.92,532849.0,958923.0,
Tonga,TO,OCEANIA,2.0,1.28,1.09,1.46,3.41,2.92,44558.0,103197.0,
French Guiana,GF,SOUTH AMERICA,9.0,1.58,0.82,5.91,3.61,13.41,,,
Palau,PW,OCEANIA,6.0,1.67,1.0,2.5,2.5,8.34,,,
Ethiopia,ET,SUB-SAHARAN AFRICA,3.0,1.71,1.41,4.17,2.44,2.06,19543075.0,109224414.0,21.08
Brunei,BN,ASIA (EX. NEAR EAST),9.0,2.23,1.78,7.43,2.64,8.51,406705.0,428963.0,71.38
Eswatini,SZ,SUB-SAHARAN AFRICA,9.0,2.24,0.69,2.8,13.31,5.25,414278.0,1136281.0,


You can see that with the filter applied, we are only looking at data of the countries that fulfill the condition of having less than 10 internet plans.

In the cell below, try changing the filter so that it now only looks at countries with 50 or more internet plans:

In [83]:
# EXERCISE

internet_plan_filter = internet['NO. OF Internet Plans'] >= 50
internet[internet_plan_filter]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Bangladesh,BD,ASIA (EX. NEAR EAST),60.0,0.34,0.11,2.22,0.7,0.99,129180000.0,166303500.0,10.43
Sri Lanka,LK,ASIA (EX. NEAR EAST),60.0,0.38,0.0,5.53,0.51,0.78,7121116.0,21228760.0,13.15
Chile,CL,SOUTH AMERICA,59.0,0.39,0.24,1.83,0.71,1.87,14864456.0,18729160.0,22.49
Indonesia,ID,ASIA (EX. NEAR EAST),53.0,0.42,0.17,2.94,0.64,2.99,196000000.0,267670500.0,17.7
Pakistan,PK,ASIA (EX. NEAR EAST),60.0,0.59,0.06,8.59,0.69,1.85,118800000.0,213756300.0,16.73
Uzbekistan,UZ,CIS (FORMER USSR),60.0,0.6,0.01,23.75,1.34,3.27,16692456.0,32476240.0,13.27
Turkey,TR,NEAR EAST,60.0,0.63,0.05,2.26,0.72,2.25,69945905.0,82340090.0,30.48
Poland,PL,EASTERN EUROPE,60.0,0.64,0.03,23.02,0.7,1.32,34697848.0,37921590.0,40.14
India,IN,ASIA (EX. NEAR EAST),58.0,0.68,0.05,2.73,0.09,0.26,833710000.0,1352642000.0,13.67
Tanzania,TZ,SUB-SAHARAN AFRICA,60.0,0.75,0.28,4.31,0.73,3.71,30000000.0,56313440.0,10.39


How many countries can you have 50 or more internet plans?

**Hint:** you can use our old friend `len()`.

In [84]:
# EXERCISE

len(internet[internet_plan_filter])

31

We can also see if rows of data fulfill multiple conditions. We can use `&` to unite them in our filter.

**Note**: if we want to use two or more specifications, we need to pass each of the in a separate set of parentheses. The structure should look like this:

`(argument 1) & (argument 2)`

In the example below, let's find countries that have 50 or more internet plans located in the continental region of Sub-Saharan Africa:

In [85]:
# EXAMPLE

internet_plan_filter = (internet['NO. OF Internet Plans'] >= 50) & (internet['Continental region'] == 'SUB-SAHARAN AFRICA')

internet[internet_plan_filter]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Tanzania,TZ,SUB-SAHARAN AFRICA,60.0,0.75,0.28,4.31,0.73,3.71,30000000.0,56313438.0,10.39
Nigeria,NG,SUB-SAHARAN AFRICA,60.0,0.88,0.03,5.25,1.39,7.91,136203231.0,195874685.0,18.92
Zambia,ZM,SUB-SAHARAN AFRICA,60.0,1.13,0.01,6.8,1.36,2.25,4760715.0,17351708.0,10.36
Uganda,UG,SUB-SAHARAN AFRICA,60.0,1.56,0.45,22.71,1.62,5.02,10162807.0,42729036.0,15.99
Burundi,BI,SUB-SAHARAN AFRICA,54.0,2.1,0.09,5.12,2.12,18.79,607311.0,11175374.0,
Kenya,KE,SUB-SAHARAN AFRICA,50.0,2.25,0.26,10.93,1.05,2.73,8861485.0,51392565.0,16.93
South Africa,ZA,SUB-SAHARAN AFRICA,60.0,2.67,0.12,34.95,4.3,7.77,31858027.0,57792518.0,33.62
Gabon,GA,SUB-SAHARAN AFRICA,52.0,4.82,1.06,18.06,4.89,3.39,1040000.0,2119275.0,


In the cell below, find countries in the continental region Western Europe that have a value less than 0.5 USD for the cheapest 1GB of internet for 30 days:

In [86]:
# EXERCISE

internet_price_filter = (internet['Cheapest 1GB for 30 days (USD)'] < 0.5) & (internet['Continental region'] == 'WESTERN EUROPE')

internet[internet_price_filter]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Italy,IT,WESTERN EUROPE,29.0,0.27,0.09,3.54,0.43,1.73,50540000.0,60627291.0,37.15
France,FR,WESTERN EUROPE,45.0,0.41,0.09,118.2,0.81,1.21,59470000.0,64990511.0,60.94
San Marino,SM,WESTERN EUROPE,2.0,0.43,0.24,0.63,1.16,6.86,20100.0,33785.0,
Denmark,DK,WESTERN EUROPE,34.0,0.79,0.0,2.23,0.8,1.36,5407278.0,5752126.0,105.65
Finland,FI,WESTERN EUROPE,18.0,0.97,0.26,1.63,2.14,1.16,4831170.0,5522576.0,71.23
Austria,AT,WESTERN EUROPE,60.0,1.17,0.24,23.43,1.08,1.88,7681957.0,8891388.0,56.6
Iceland,IS,WESTERN EUROPE,20.0,1.23,0.16,17.29,1.46,3.78,329196.0,336713.0,
United Kingdom,GB,WESTERN EUROPE,60.0,1.42,0.11,71.29,1.39,6.66,65001016.0,67141684.0,48.1
Ireland,IE,WESTERN EUROPE,12.0,1.42,0.13,11.64,1.36,3.95,4024552.0,4818690.0,30.16
Sweden,SE,WESTERN EUROPE,58.0,1.45,0.2,75.08,2.07,3.66,9554907.0,9971638.0,73.61


In the cell below, find countries in the region South America that has a population greater than 10 million (10000000) and have less than 50 internet plans.

In [87]:
# EXERCISE

internet_filter = (internet['Continental region'] == 'SOUTH AMERICA') & (internet['Population'] > 10000000) & (internet['NO. OF Internet Plans'] < 50)

internet[internet_filter]

Unnamed: 0_level_0,Country code,Continental region,NO. OF Internet Plans,Average price of 1GB (USD),Cheapest 1GB for 30 days (USD),Most expensive 1GB (USD),Average price of 1GB (USD at the start of 2021),Average price of 1GB (USD – at start of 2020),Internet users,Population,Avg \n(Mbit/s)Ookla
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Ecuador,EC,SOUTH AMERICA,19.0,1.06,0.63,3.0,3.24,6.93,9521056.0,17084358.0,17.7
Peru,PE,SOUTH AMERICA,49.0,1.15,0.85,15.58,2.13,2.48,15674241.0,31989260.0,15.64
Bolivia,BO,SOUTH AMERICA,48.0,2.18,0.87,14.44,5.09,5.99,4843916.0,11353142.0,14.1
Argentina,AR,SOUTH AMERICA,28.0,2.38,0.44,11.47,1.45,7.4,33561876.0,44361150.0,20.64


---
Notebook developed by: Kseniya Usovich, Karla Palos, Alisa Bettale