# Pandas DataFrames



---

### Table of Contents

1 - [Manipulating Columns ](#section1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Indexing &  Slicing in Pandas](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Uniqueness](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Frequencies](#subsection3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 - [Sorting](#subsection4)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.5 - [Min, Max, Range](#subsection5)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.6 - [Missing Values](#subsection6)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.7 - [GroupBy](#subsection7)<br>

2 - [Booleans & Boolean Indexing](#section2)<br>


---

Before we start this notebook, similar to our previous notebook, we will have to run some code at the beginning to make sure everything is set up correctly. Please run the cells below.

In [None]:
# RUN THIS CELL;

###DO NOT MODIFY###

import otter
grader = otter.Notebook(tests_dir="data/07_intro_to_python")

In [None]:
import numpy as np
import pandas as pd

Let's begin by reading a new data set from the file `internet_world_data.csv`. We will use `pd.read_csv` just as we did before. We will use the keyword `index_col` to set the column 'Country' as our index. In addition, we will drop the column 'S.NO' and make sure the columns 'Internet users' and 'Population' will be in float data types so they are easier to work with in this notebook.

You can learn more about the data [here](https://www.kaggle.com/datasets/ramjasmaurya/1-gb-internet-price).

In [None]:
internet = pd.read_csv('data/internet_world_data.csv')
internet.head()

In [None]:
internet = pd.read_csv('data/internet_world_data.csv', index_col='Country')

internet = internet.drop('S.NO', axis=1)
internet['Internet users'] = internet['Internet users'].str.replace(',','')
internet['Internet users'] = internet['Internet users'].astype(float)
internet['Population'] = internet['Population'].str.replace(',','')
internet['Population'] = internet['Population'].astype(float)
internet.head()

In the five cells below, use some functions on the dataframe `internet` to do some **Exploratory Data Analysis** to explore the data!

_You can always look back in Notebook 06 for ideas._

In [None]:
#Exploratory Data Analysis 1



In [None]:
#Exploratory Data Analysis 2



In [None]:
#Exploratory Data Analysis 3



In [None]:
#Exploratory Data Analysis 4



In [None]:
#Exploratory Data Analysis 5



## 1. Manipulating Columns
 <a name='Section1'></a>

### 1.1 Indexing &  Slicing  <a name='subsection1'></a>

There are two main ways of indexing through DataFrames. We can still use our old friend, the square brackets `[:]`, or we can use it with the help of two functions: **`loc`** and **`iloc`**.

**`loc`**: uses names or labels of rows and columns.  
**`iloc`**: uses indices of rows and columns. You can think of *iloc* as *index-loc*.


Let's start with loc:

#### `.loc[rows-label(s), columns-label(s)]`
`.loc` Helps us view and index our DataFrame, i.e. locating the data that we want.
* It works with string labels. Notice that most of the times you will have specific column names, but our row names often come as a number. Hence the label of the rows will be a number.   
* It can take
    * one label __(`df.loc[row-label, 'col-label-1']`)__
    * a list of labels __(`df.loc[[row-label 1, row-label-2, row-label-4],['col-label-1',  'col-label-2', 'col-label-4']]`)__
    * or a _slice_ of labels __(`df.loc[row label-50 : row-label-100,'col-label-1': 'col-label-8']`)__


**Remember!** `loc` is **inclusive**, `iloc` is **exclusive** for its stop index.

We can still iterate through our DataFrame (aka table) with square brackets, by identifying the column name.

In [None]:
# EXAMPLE

internet["Average price of 1GB (USD)"]

So why would we want to opt out of this option and switch to `loc` and `iloc`? There are a few reasons for that, and the main being compute time. With the examples we use in this notebook, it will be impossible to notice the difference, but once we get to DataFrames with hundreds of thousands or millions of values, this will become important!

On a climate care note, increased compute time leads to increased electricity and data server use, which contributes to climate change! And that's part of the reason we need to consider compute time. So let's dive into learning how to use our helpers `loc` and `iloc` to be more climate conscious.

#### Rows

Let's use loc to see what are the values in the row on 'Japan' in our DataFrame.

In [None]:
# EXAMPLE

internet.loc['Japan']

In [None]:
#Try using loc to find data on the United States
# internet.loc['...']

In [None]:
#Use loc to find data on the country of your choice!
...

`iloc` uses **indices** instead of labels. Try running the cell below:

In [None]:
internet.iloc[10]

You can pass in indices as a list to return a dataframe instead of a series, as you can see in the cell below:

In [None]:
internet.iloc[[10]]

In [None]:
#EXERCISE - Use iloc to grab information on the country found at index 50

# internet.iloc[...]

You can also grab specific rows by passing a list of indices:

In [None]:
internet.iloc[[1,3,6,8,9]]

Recall __start:stop:step__ from lists? We can also select a range of rows with a specified step value in our data DataFrame. Below we will take every 2nd element from the row 0 to row 10.

**Remember `iloc` is exclusive for the stop index! In other words, it only goes until stopIndex-1**

In [None]:
internet.iloc[0:11:2]

In [None]:
#EXERCISE - use iloc to get every 5th element from row 0 to row 150

internet.iloc[...]

We can also use `loc` to grab columns. Don't forget that we are still using `loc`, so we will have to use column labels.

In [None]:
# EXAMPLE

internet.loc[:,'Most expensive 1GB (USD)']

Another way to index by only one column is by adding the column label in a list. It will return a a one-column DataFrame because we passed a list.

In [None]:
# EXAMPLE

internet.loc[:,['Most expensive 1GB (USD)']]

Notice that here we had to specify the range of rows that we want to index that column by. We used `:` in order to return all values in the column.

`iloc` works just as `loc`, but instead of using labels we use the index. How would you use iloc to get the 'Most expensive 1GB (USD)' column?

_Hint: If you don't remember the order of columns, create a new cell and use `.columns` on your dataframe. Remember we start counting from 0!_



In [None]:
# EXERCISE

# internet.iloc[...]

Just as we sliced rows, we can do the same with columns. In the cell below, use `loc` and return all rows for columns *Country code* through *NO. OF Internet Plans* (inclusive of the last column).



In [None]:
# EXERCISE

# internet.loc[...]

Now do the same thing, but with `iloc`.

_Remember that iloc is exclusive!_

In [None]:
#EXERCISE

internet.iloc[...]

### 1.2 Uniqueness  <a name='subsection2'></a>

Suppose that we want to find out the number of unique continental regions in our data. The `.unique()` method allows us to check this.

There are two ways to accomplish this, one is using the "dot" notation, and the other using brackets. For the most part, we will stick to the second method as it can be easy to run into errors.

Method 1: **`df.column_label.unique( )`**

Method 2: **`df['column_label'].unique( )`**


In [None]:
internet['Continental region'].unique()

We can also use the method `.nunique()` to tell us how _many_ unique items we have rather which items, though this doesn't include missing or null values. An alternative way to compute this is __`len(df['column_label'].unique())`__, which **will** include missing values as a unique value.

In [None]:
# EXAMPLE

internet['Continental region'].nunique()

In the cell below, use pandas to find how many unique values there are in the "NO. OF Internet Plans" column, and assign it to `n_unique`. (For this exercise, assume that `nan` values count as their own unique value).

In [None]:
n_unique = ...

In [None]:
grader.check("q1")

### 1.3 Frequencies  <a name='subsection3'></a>

Say we want to find out how many instances of each continental region exists. In this case, we would use the `.value_counts()` method. This method returns the counts for the unique values in our column.

In [None]:
# EXAMPLE

internet['Continental region'].value_counts()

Now let's try to find out the counts for the number of internet plans.

In [None]:
# EXERCISE

# internet['...']....

### 1.4 Sorting  <a name='subsection4'></a>

Notice that this method sorts our values in decreasing order. What if you had an alternative sorting that you wanted to use? Maybe you want to sort by index, that is, by alphabetical order. In this case you would want to use the `sort_index()` method as seen below.

In [None]:
# EXAMPLE

internet['Continental region'].value_counts().sort_index()

If instead you wanted to sort by counts, but in ascending (from smallest to largest) order, you can use the `.sort_values()` method instead with the argument __ascending = True__.

In [None]:
# EXAMPLE

internet['Continental region'].value_counts().sort_values(ascending=True)

We can also use `sort_values()` without calling `value_counts()` first. In the cell below, try using `sort_values()` for the column 'Most expensive 1GB (USD)'. Have it so that it sorts in decreasing order (highest to lowest):

In [None]:
# EXERCISE

# internet['...']....

You might notice that we are getting some **NaN** values! This will be revisited in section 1.6.

### 1.5 Min, Max, Range  <a name='subsection5'></a>

Say that for our analysis we want to find out which country has the highest number of internet users and which country has the lowest number of internet users.

A good starting point might be to see what the __min__ and the __max__ are for our data. We can do this by using the functions `.min()` and `.max()` respectably.

In [None]:
# EXAMPLE

print('Min number of users is :',internet['Internet users'].min())
print('Max number of users is :',internet['Internet users'].max())

To get the range, all you need to do is subtract the min from the max! In the cell below, find the min, max and range of the number of internet users:

In [None]:
internet_users_max = ...
internet_users_min = ...
internet_users_range = ...

print("The range is ", internet_users_range)

In [None]:
grader.check("q2")

Now try to find the range for *Most expensive 1GB (USD)*

In [None]:
#EXERCISE

#YOUR CODE HERE


### 1.6 Missing Values  <a name='subsection6'></a>

A common problem that you will come across when analyzing data is **missing** data. You can check if your data set contains missing data by using the function `.isnull()`. This function returns **True** whenever a values is missing (`Null`, `NaN`) and **False** whenever it is not. We can combine this function with `.sum()` to add up all the values that are True  & False.

*In Python (as in most programming languages), True is represented by 1, and False by 0. So using the `.sum()` function allows us to treat these True/False as numerical values.*

In [None]:
# EXAMPLE

internet.isnull().sum()

Notice that for the example above we checked for the number of missing values in each of the columns? What if you only wanted to do it for one? You can use the same methods we discuss prior, that is bracket and dot notation.

In the cell below, find the number of null values in the columns *Population* and *NO. OF Internet Plans*.

In [None]:
n_null_pop = ...
n_null_num_plans = ...

In [None]:
grader.check("q3")

However, **`Null`** and **`NaN`** are not the only ways to represent missing value. Sometimes, the data providers may choose special numerical values to represent missing data. For example, a missing value could appear as `999` (commonly used in census or other data involving humans) or even `-1`. This can help many more types of code to read the data as the correct data type (like `int`), but also requires that users understand what these strange values mean.

In this data set, you might notice that instead of numerical values there are some notes - for example, if you look at the column *Average price of 1GB (USD)*, you will notice some countries with 'NO PROVIDERS' among other things.

**Discuss:** Would these count as null values / missing data? Why or why not?

### 1.7 GroupBy  <a name='subsection7'></a>

Dataframes can oftentimes be overwhelming due to the sheer quantity of data presented before you. Groupby is a function that helps you break down your dataframe and see patterns in your data.

**Note:** When you use the groupby function, you also have to select a function in order to indicate what you would like to be done to the grouped data.

In [None]:
# EXAMPLE
internet.groupby("Continental region").count()

Notice that in the example above we not only had to refer to the column that we wanted to group by, but we also applied the count function so that we would get an output.

In the example below, take a look at how you can group by multiple columns in your internet dataframe.

In [None]:
# EXAMPLE

internet.groupby(["Continental region", "Country"])[["NO. OF Internet Plans"]].min()

In the cell below, find the highest number of internet plans for each Continental region.

In [None]:
# EXERCISE

# internet...("...")["..."]...

## Booleans & Boolean Indexing <a name='section2'></a>

Suppose we only want to look at the countries that have less than 10 internet plans. We will use **boolean indexing** to create a DataFrame that meets this criteria.

Boolean indexing allows us to define what kind of data we want to output. For example, we can only select rows that correspond to a specific continental region or choose the rows that are below a selected number of internet plans.

We will use **comparison operators** in boolean indexing - below is the table from notebook 03:

|Operator| Meaning|
|--------|---------|
|< | less than |
|<= | less than or equal to|
|> | greater than |
|>= | greater than or equal to|
|!= | not equal to|
|== | equal to|

Often people use the term **filtering data** when using boolean indexing. It's easier to break-up boolean indexing in steps by first creating a filter that specifies your criteria, then passing that filter to your dataframe.

For example, let's say we only wanted to look at countries that have **less than 10 internet plans.**

First, let's create a filter using the appropriate comparison operator:

In [None]:
# EXAMPLE

internet_plan_filter = internet['NO. OF Internet Plans'] < 10
internet_plan_filter

Next, we will pass this filter into our original dataframe:

In [None]:
#EXAMPLE

internet[internet_plan_filter]

You can see that with the filter applied, we are only looking at data of the countries that fulfill the condition of having less than 10 internet plans.

In the cell below, create a filter to represent the rows in which there are at least 50 `NO. OF Internet Plans`. How many countries have 50 or more internet plans?

**Hint:** you can use our old friend `len()`.

In [None]:
internet_plan_filter = ...

n_at_least_50 = ...

In [None]:
grader.check("q4")

We can also see if rows of data fulfill multiple conditions. We can use `&` to unite them in our filter.

**Note**: if we want to use two or more specifications, we need to pass each of them in a separate set of parentheses. The structure should look like this:

`(argument 1) & (argument 2)`

In the example below, let's find countries that have 50 or more internet plans located in the continental region of Sub-Saharan Africa:

In [None]:
# EXAMPLE

internet_plan_filter = (internet['NO. OF Internet Plans'] >= 50) & (internet['Continental region'] == 'SUB-SAHARAN AFRICA')

internet[internet_plan_filter]

In the cell below, find countries in the continental region Western Europe that have a value less than 0.5 USD for the cheapest 1GB of internet for 30 days:

In [None]:
# EXERCISE

# internet_price_filter = (internet['...'] ...) & (internet['...'] ...)

# internet[...]

In the cell below, find countries in the region South America that has a population greater than 10 million (10000000) and have less than 50 internet plans.

In [None]:
# EXERCISE

internet_filter = ...

# internet[...]

---
Notebook developed by: Kseniya Usovich, Karla Palos, Alisa Bettale, Arianna Formenti, Sage Miller, Evan Neill