<a href="https://colab.research.google.com/github/dbro-dev/DataQuest_Courses/blob/master/019__Exploring_Data_with_pandas_-_Intermediate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MISSION 5: Exploring Data with pandas: Intermediate**

*Learn more techniques for selecting and analyzing data in pandas.*

In this mission, we will learn how to:

* Select columns, rows and individual items using their integer location.
* Use `pd.read_csv()` to read CSV files in pandas.
* Work with integer axis labels.
* How to use pandas methods to produce boolean arrays.
* Use boolean operators to combine boolean comparisons to perform more complex analysis.
* Use index labels to align data.
* Use aggregation to perform advanced analysis using loops.

## **1. Introduction**

We'll continue working with a data set from [Fortune](https://fortune.com/) magazine's 2017 [Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible. [Click here](https://github.com/dbro-dev/DataQuest_Courses/blob/master/datasets/f500.csv) or [here](https://drive.google.com/file/d/1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut/view?usp=sharing) for the current version used in this notebook (*as my Github username may change in the future*).

![Fortune_500_logo](https://s3.amazonaws.com/dq-content/291/fortune-500.jpg)

Below is the code to import pandas and use the pandas.read_csv() function to read the CSV into a dataframe and assign it to the variable name f500. 

```
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None


```

In Google Colab however, it is a bit more complicated to load a .csv to work with. The fields below show what is currently the best way to go about it:

In [2]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
id = "1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut"

In [4]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('f500.csv')

In [5]:
# Import code which resembles the original code above
import pandas as pd
import numpy as np

f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None

In [6]:
# replace 0 values in the "previous_rank" column with NaN
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

Select the `rank`, `revenues`, and `revenue_change` columns in `f500`. Then, use the `DataFrame.head()` method to select the first five rows. Assign the result to `f500_selection`.


In [7]:
f500_selection = f500[['rank', 'revenues', 'revenue_change']].head()
print(f500_selection)

                          rank  revenues  revenue_change
Walmart                      1    485873             0.8
State Grid                   2    315199            -4.4
Sinopec Group                3    267518            -9.1
China National Petroleum     4    262573           -12.3
Toyota Motor                 5    254694             7.7


Note how a few steps before we used this code to load the .csv file. 



```
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None
```
Using this code, the index axis labels are actually the values from the first column in the data set, company. 



The `index_col` parameter is an optional argument and should specify which column to use as the row labels for the dataframe. When we used a value of 0, we specified that we wanted to use the first column as the row labels.

If we remove the second line: `f500.index.name = None`. Both the column and index axes can have names assigned to them. However, we originally used the code below to access the name of the index axes and set it to None, so our dataframe didn't have a name for the index axis.

## **2. Reading CSV files with pandas**

The more conventional way to read in a dataframe is this:

In [8]:
f500 = pd.read_csv("f500.csv")

In [9]:
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan



There are two differences with this approach:

* The company column is now included as a regular column, instead of being used for the index.
* The index labels are now integers starting from 0.


## **3. Using `iloc` to select by integer position**

Recall that when we worked with a dataframe with string index labels, we used `loc[]` to select data:

![alt text](https://s3.amazonaws.com/dq-content/292/selection_loc.svg)

Just like in NumPy, we can also use integer positions to select data using `Dataframe.iloc[]` and `Series.iloc[]`. It's easy to get `loc[]` and `iloc[]` confused at first, but the easiest way is to remember the first letter of each method:

* `loc`: **l**abel based selection
* `iloc`: **i**nteger position based selection


Using `iloc[]` is almost identical to indexing with NumPy, with integer positions starting at `0` like ndarrays and Python lists. This is how we would perform the selection above using `iloc[]`:

![alt text](https://s3.amazonaws.com/dq-content/292/selection_iloc.svg)

As you can see, `DataFrame.iloc[]` behaves similarly to `DataFrame.loc[]`. The full syntax for `DataFrame.iloc[]`, in pseudocode, is:



```
df.iloc[row_index, column_index]
```



Practice:

1. Select just the fifth row of the `f500` dataframe. Assign the result to `fifth_row`.

In [10]:
fifth_row = f500.iloc[4]

2. Select the value in first row of the `company` column. Assign the result to `company_value`.

In [11]:
company_value = f500.iloc[0, 0]

## **4. Using `iloc` to select by integer position continued**

To select just the first column from our `f500` dataframe, we use `:` (a colon) to specify all rows, and then use the integer `0` to specify the first column:



In [12]:
first_column = f500.iloc[:,0]
print(first_column)

0                             Walmart
1                          State Grid
2                       Sinopec Group
3            China National Petroleum
4                        Toyota Motor
                    ...              
495    Teva Pharmaceutical Industries
496          New China Life Insurance
497         Wm. Morrison Supermarkets
498                               TUI
499                        AutoNation
Name: company, Length: 500, dtype: object


Slicing: select the rows between index positions one to four (inclusive)

In [13]:
second_to_sixth_rows = f500[1:5]



```
company  rank  revenues ... employees  total_stockholder_equity
1         State Grid     2    315199 ...    926067                    209456
2      Sinopec Group     3    267518 ...    713288                    106523
3  China National...     4    262573 ...   1512048                    301893
4       Toyota Motor     5    254694 ...    364445                    157210
```



In the example above, the row at index position `5` is not included, just as if we were slicing with a Python list or NumPy ndarray. Recall that `loc[]` handles slicing differently:

* With `loc[]`, the ending slice is included.
* With `iloc[]`, the ending slice is not included.


The table below summarizes how we can use `DataFrame.iloc[]` and `Series.iloc[]` to select by integer position:


|Select by integer position	| Explicit Syntax	| Shorthand Convention|
| --- | --- | --- |
Single column from dataframe |	df.iloc[:,3]	| 
List of columns from dataframe |	df.iloc[:,[3,5,6]]	| 
Slice of columns from dataframe |	df.iloc[:,3:7]	| 
Single row from dataframe |	df.iloc[20]	| 
List of rows from dataframe |	df.iloc[[0,3,8]]	| 
Slice of rows from dataframe |	df.iloc[3:5] |	df[3:5]
Single items from series|	s.iloc[8] |	s[8]
List of item from series|	s.iloc[[2,8,1]] |	s[[2,8,1]]
Slice of items from series|	s.iloc[5:10]|	s[5:10]






**Practice:**
1. Select the first three rows of the `f500` dataframe. Assign the result to `first_three_rows`.


In [14]:
first_three_rows = f500.iloc[:3]
print(first_three_rows)

         company  rank  ...  employees  total_stockholder_equity
0        Walmart     1  ...    2300000                     77798
1     State Grid     2  ...     926067                    209456
2  Sinopec Group     3  ...     713288                    106523

[3 rows x 17 columns]


2. Select the first and seventh rows and the first five columns of the `f500` dataframe. Assign the result to `first_seventh_row_slice`.

In [15]:
first_seventh_row_slice = f500.iloc[[0,6], :5]
print(first_seventh_row_slice)

             company  rank  revenues  revenue_change  profits
0            Walmart     1    485873             0.8  13643.0
6  Royal Dutch Shell     7    240033           -11.8   4575.0


## **5. Using pandas methods to create boolean masks**

Besides >, <, and == there are other pandas methods that return boolean masks, two examples are:
* `Series.isnull()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html)
* `Series.notnull()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html)

First, let's use the `Series.isnull()` method to view rows with null values in the `revenue_change` column:

In [16]:
rev_is_null = f500["revenue_change"].isnull()
print(rev_is_null.head())

0    False
1    False
2    False
3    False
4    False
Name: revenue_change, dtype: bool


Using Series.isnull() resulted in a **boolean series**. Just like in NumPy, we can use this series to filter our dataframe, `f500`:

In [17]:
rev_change_null = f500[rev_is_null]
print(rev_change_null[["company","country","sector"]])

                        company  country      sector
90                       Uniper  Germany      Energy
180  Hewlett Packard Enterprise      USA  Technology


We can confirm that the two companies with missing values for the revenue_change column are Uniper, a German energy company, and Hewlett Parkard Enterprise, an American technology company.

Let's use what we've learned to find the null values in the `previous_rank` column next:

Use the `Series.isnull()` method to select all rows from `f500` that have a null value for the `previous_rank` column. Select only the `company`, `rank`, and `previous_rank` columns. Assign the result to `null_previous_rank`.



In [18]:
null_previous_rank = f500[f500["previous_rank"].isnull()][["company","rank", "previous_rank"]]

print(null_previous_rank.head())

                    company  rank  previous_rank
48    Legal & General Group    49            NaN
90                   Uniper    91            NaN
123       Dell Technologies   124            NaN
138  Anbang Insurance Group   139            NaN
140         Albertsons Cos.   141            NaN


## **6. Working with Integer Labels**

Always think carefully about whether you want to select by *label* or *integer position*. Use `DataFrame.loc[]` or `DataFrame.iloc[]` accordingly. 

![.iloc vs .loc](https://s3.amazonaws.com/dq-content/292/integer_labels_2.svg)

Practice:

Assign the first five rows of the `null_previous_rank` dataframe to the variable `top5_null_prev_rank` by choosing the correct method out of either `loc[]` or `iloc[]`.

In [19]:
null_previous_rank = f500[f500["previous_rank"].isnull()]

top5_null_prev_rank = null_previous_rank.iloc[:5]

print(top5_null_prev_rank)

                    company  rank  ...  employees  total_stockholder_equity
48    Legal & General Group    49  ...       8939                      8579
90                   Uniper    91  ...      12890                     12889
123       Dell Technologies   124  ...     138000                     13243
138  Anbang Insurance Group   139  ...      40707                     20372
140         Albertsons Cos.   141  ...     273000                      1371

[5 rows x 17 columns]


## **7. Pandas Index Alignment**

Now that we've identified the rows with null values in the `previous_rank` column, let's use the `Series.notnull()` method to exclude them from the next part of our analysis.

In [20]:
previously_ranked = f500[f500["previous_rank"].notnull()]

We can then create a `rank_change` column by subtracting the `rank` column from the `previous_rank` column:

In [21]:
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]

f500["rank_change"] = rank_change

print(rank_change.shape)
print(rank_change.tail(3))

(467,)
496   -70.0
497   -61.0
498   -32.0
dtype: float64


Above, we can see that our `rank_change` series has 467 rows. Since the last integer index label is 498, we know that our index labels no longer align with the integer positions. 

Suppose now we decided to add the `rank_change` series to the `f500` dataframe as a new column. Its index labels no longer match the index labels in `f500`, so how could this be done?

Simple. In pandas almost every operation will align on the index labels. Let's look at an example – below we have a dataframe named `food` and a series named `alt_name`:

![alt text](https://s3.amazonaws.com/dq-content/292/align_index_1_updated.svg)

The `food` dataframe and the `alt_name` series not only have a different number of items, but they also only have two of the same index labels - `corn` and `eggplant` - and they're in different orders. If we wanted to add `alt_name` as a new column in our `food` dataframe, we can use the following code:


```
food["alt_name"] = alt_name
```



When we do this, pandas will ignore the order of the `alt_name` series, and align on the index labels:

![alt text](https://s3.amazonaws.com/dq-content/292/align_index_2_updated.svg)

Pandas will also:

* **Discard** any items that have an **index that doesn't match the dataframe** (like arugula).
* Fill any remaining rows with **NaN**.

Below is the result:

![alt text](https://s3.amazonaws.com/dq-content/292/align_index_5_updated.svg)

The **pandas library will align on index at every opportunity**, no matter if our index labels are strings or integers - this makes working with data from different sources or working with data when we have removed, added, or reordered rows much easier than it would be otherwise.

## **8. Using Boolean Operators**

Boolean indexing is a powerful tool which allows us to select or exclude parts of our data based on their values. However, to answer more complex questions, we need to learn how to combine boolean arrays.

To recap, boolean arrays are created using any of the Python standard comparison operators: `==` (equal), `>` (greater than), `<` (less than), `!=` (not equal).

We combine boolean arrays using boolean operators. In Python, these boolean operators are `and`, `or`, and `not`. In pandas, the operators are slightly different:

pandas |	Python equivalent |	Meaning
--- | --- | ---
a & b |	a and b |	True if both a and b are True, else False
a I b |	a or b |	True if either a or b is True
~a |	not a |	True if a is False, else False

Let's look at an example using `f500_sel`, a small selection of our `f500` dataframe:

![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_1.svg)

Suppose we wanted to find the companies in `f500_sel` with more than 265 billion in revenue that are headquartered in China. We'll start by performing two boolean comparisons to produce two separate boolean arrays (the revenue column is already in millions).

![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_2.svg)

We then use the `&` operator to combine the two boolean arrays using boolean "and" logic:
![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_3.svg)

Lastly, we use the combined boolean array to perform selection on our dataframe:
![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_4.svg)

The result gives us two companies from f500_sel that are both Chinese and have over 265 billion in revenue. 

**Let's practice more complex selection using boolean operators:**


Select all companies with **revenues over 100 billion and negative profits** from the `f500` dataframe. The result should include all columns.

* Create a boolean array that selects the companies with revenues greater than 100 billion. Assign the result to `large_revenue`.
* Create a boolean array that selects the companies with profits less than 0. Assign the result to `negative_profits`.
* Combine `large_revenue` and `negative_profits`. Assign the result to `combined`.
*Use `combined` to filter `f500`, and show result for `company`, `revenues`, and `profits`. Assign the result to `big_rev_neg_profit`.


In [24]:
large_revenue = f500["revenues"] > 100000

negative_profits = f500["profits"] < 0

combined = large_revenue & negative_profits

final_cols = ["company","revenues", "profits"]

big_rev_neg_profit = f500.loc[combined,final_cols]

print(big_rev_neg_profit)

                company  revenues  profits
32  Japan Post Holdings    122990   -267.4
44              Chevron    107567   -497.0


## **9. Using Boolean Operators Continued**

Just like when we use a single boolean array to perform selection, **we don't need to use intermediate variables**. We can optimize our code by combining our two boolean arrays in a single line, instead of assigning them to the intermediate `large_revenue` and `negative_profits` variables first:



```
combined = (f500["revenues"] > 100000) & (f500["profits"] < 0)
```

Notice that we used parentheses around each of our boolean comparisons. This is very important — **our boolean operation will fail without parentheses**.



Lastly, instead of assigning the boolean arrays to `combined`, we can insert the comparison directly into our selection:



```
big_rev_neg_profit = f500[(f500["revenues"] > 100000) & (f500["profits"] < 0)]
```
Whether to perform this final step is very much a matter of taste. As always, your decision should be driven by what will make your code more readable.

Again:

pandas |	Python equivalent |	Meaning
--- | --- | ---
a & b |	a and b |	True if both a and b are True, else False
a I b |	a or b |	True if either a or b is True
~a |	not a |	True if a is False, else False


**Practice:**


* Select all rows for companies whose `country` value is either Brazil or Venezuela. Assign the result to `brazil_venezuela`.

In [25]:
brazil_venezuela = f500[(f500["country"] == "Brazil") | (f500["country"] == "Venezuela")]

print(brazil_venezuela)

                             company  ...  rank_change
74                         Petrobras  ...        -17.0
112            Itau Unibanco Holding  ...         46.0
150                  Banco do Brasil  ...        -36.0
153                   Banco Bradesco  ...         55.0
190                              JBS  ...         -6.0
369                             Vale  ...         47.0
441  Mercantil Servicios Financieros  ...          NaN
486                Ultrapar Holdings  ...        -13.0

[8 rows x 18 columns]


* Select the first five companies in the Technology sector for which the country is not the USA from the `f500` dataframe. Assign the result to `tech_outside_usa`.

In [26]:
tech_outside_usa = f500[(f500["sector"] == "Technology") & ~(f500["country"] == "USA")].head()

print(tech_outside_usa)

                         company  rank  ...  total_stockholder_equity  rank_change
14           Samsung Electronics    15  ...                    154376         -2.0
26    Hon Hai Precision Industry    27  ...                     33476         -2.0
70                       Hitachi    71  ...                     26632          8.0
82   Huawei Investment & Holding    83  ...                     20159         46.0
104                         Sony   105  ...                     22415          8.0

[5 rows x 18 columns]


## **10. Sorting Values**

Suppose we wanted to find the company that employs the most people in China. We can accomplish this by first selecting all of the rows where the `country` column equals `China`:

In [27]:
selected_rows = f500[f500["country"] == "China"]

Then, we can use the DataFrame.`sort_values()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) to sort the rows on the `employees` column. To do so, we pass the column name to the method:

In [28]:
sorted_rows = selected_rows.sort_values("employees")
print(sorted_rows[["company", "country", "employees"]].head())

                                company country  employees
204                         Noble Group   China       1000
458             Yango Financial Holding   China      10234
438  China National Aviation Fuel Group   China      11739
128                         Tewoo Group   China      17353
182            Amer International Group   China      17852


By default, the `sort_values()` method will sort the rows in *ascending* order — from smallest to largest.

To sort the rows in *descending* order instead, so the company with the largest number of employees appears first, we can set the `ascending` parameter to `False`:


In [29]:
sorted_rows = selected_rows.sort_values("employees", ascending=False)

print(sorted_rows[["company", "country", "employees"]].head())

                        company country  employees
3      China National Petroleum   China    1512048
118            China Post Group   China     941211
1                    State Grid   China     926067
2                 Sinopec Group   China     713288
37   Agricultural Bank of China   China     501368


Now, we can see that the Chinese company that employs the most people is China National Petroleum. Let's find the Japanese company with the most employees next.

In [33]:
japan = f500[f500["country"] == "Japan"]

japan_sorted = japan.sort_values("employees", ascending=False)

first_row = japan_sorted.iloc[0]

top_japanese_employer = first_row["company"]

# Alternatively, the last 2 steps can be combined
# top_japanese_employer = japan_sorted.iloc[0]["company"]

print(f"The top Japanese employer is {top_japanese_employer}.")

The top Japanese employer is Toyota Motor.


## **11. Using Loops with pandas**

Suppose we wanted to calculate the company that employs the most people in each of the 34 countries. Using the method from the last screen would be very inefficient, so we'll rely on a technique we haven't used yet with pandas - loops.

We've explicitly avoided using loops in pandas because one of the key benefits of pandas is that it has vectorized methods to work with data more efficiently. We'll learn more advanced techniques in later courses, but for now, we'll learn how to use loops for aggregation.

**Aggregation** is **where we apply a statistical operation to groups of our data**. 

Let's say that we wanted to calculate the average revenue for each country in the data set. Our process might look like this:

* Identify each unique country in the data set.
* For each country:
** Select only the rows corresponding to that country.
** Calculate the average revenue for those rows.

To identify the unique countries, we can use the `Series.unique()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html). This method returns an array of unique values from any series. Then, we can loop over that array and perform our operation. Here's what that looks like:

In [34]:
# Create an empty dictionary to store the results
avg_rev_by_country = {}

# Create an array of unique countries
countries = f500["country"].unique()

# Use a for loop to iterate over the countries
for c in countries:
    # Use boolean comparison to select only rows that
    # correspond to a specific country
    selected_rows = f500[f500["country"] == c]
    # Calculate the mean average revenue for just those rows
    mean = selected_rows["revenues"].mean()
    # Assign the mean value to the dictionary, using the
    # country name as the key
    avg_rev_by_country[c] = mean

The resulting dictionary is below (we've shown just the first few keys):

In [36]:
print(avg_rev_by_country)

{'USA': 64218.371212121216, 'China': 55397.880733944956, 'Japan': 53164.03921568627, 'Germany': 63915.0, 'Netherlands': 61708.92857142857, 'Britain': 51588.708333333336, 'South Korea': 49725.6, 'Switzerland': 51353.57142857143, 'France': 55231.793103448275, 'Taiwan': 46364.666666666664, 'Singapore': 54454.333333333336, 'Italy': 51899.57142857143, 'Russia': 65247.75, 'Spain': 40600.666666666664, 'Brazil': 52024.57142857143, 'Mexico': 54987.5, 'Luxembourg': 56791.0, 'India': 39993.0, 'Malaysia': 49479.0, 'Thailand': 48719.0, 'Australia': 33688.71428571428, 'Belgium': 45905.0, 'Norway': 45873.0, 'Canada': 31848.0, 'Ireland': 32819.5, 'Indonesia': 36487.0, 'Denmark': 35464.0, 'Saudi Arabia': 35421.0, 'Sweden': 27963.666666666668, 'Finland': 26113.0, 'Venezuela': 24403.0, 'Turkey': 23456.0, 'U.A.E': 22799.0, 'Israel': 21903.0}


We'll practice this pattern to calculate the company that employs the most people in each country.  Create a dictionary of the top employer in each country:

1. Create an empty dictionary, `top_employer_by_country` to store the results of the exercise.
2. Use the `Series.unique()` method to create an array of unique values from the `country` column.
3. Use a for loop to iterate over the array unique countries. In each iteration:

* Select only the rows that have a country name equal to the current iteration.
* Use `DataFrame.sort_values()` to sort those rows by the `employees` column in descending order.
* Select the first row from the sorted dataframe.
* Extract the `company` name from the index label company from the first row.
* Assign the results to the `top_employer_by_country` dictionary, using the country name as the key, and the company name as the value.



In [37]:
top_employer_by_country = {}

countries = f500["country"].unique()

for c in countries:
    selected_rows = f500[f500["country"] == c]
    sorted_rows = selected_rows.sort_values("employees", ascending=False)
    top_company = sorted_rows.iloc[0]["company"]
    top_employer_by_country[c] = top_company

print(top_employer_by_country)

{'USA': 'Walmart', 'China': 'China National Petroleum', 'Japan': 'Toyota Motor', 'Germany': 'Volkswagen', 'Netherlands': 'EXOR Group', 'Britain': 'Compass Group', 'South Korea': 'Samsung Electronics', 'Switzerland': 'Nestle', 'France': 'Sodexo', 'Taiwan': 'Hon Hai Precision Industry', 'Singapore': 'Flex', 'Italy': 'Poste Italiane', 'Russia': 'Gazprom', 'Spain': 'Banco Santander', 'Brazil': 'JBS', 'Mexico': 'America Movil', 'Luxembourg': 'ArcelorMittal', 'India': 'State Bank of India', 'Malaysia': 'Petronas', 'Thailand': 'PTT', 'Australia': 'Wesfarmers', 'Belgium': 'Anheuser-Busch InBev', 'Norway': 'Statoil', 'Canada': 'George Weston', 'Ireland': 'Accenture', 'Indonesia': 'Pertamina', 'Denmark': 'Maersk Group', 'Saudi Arabia': 'SABIC', 'Sweden': 'H & M Hennes & Mauritz', 'Finland': 'Nokia', 'Venezuela': 'Mercantil Servicios Financieros', 'Turkey': 'Koc Holding', 'U.A.E': 'Emirates Group', 'Israel': 'Teva Pharmaceutical Industries'}


## **12. Challenge: Calculating Return on Assets by Country**

Now it's time for a challenge to bring everything together! In this challenge we're going to add a new column to our dataframe, and then perform some aggregation using that new column.

The column we create is going to contain a metric called [return on assets](https://www.inc.com/encyclopedia/return-on-assets-roa.html) (ROA). ROA is a business-specific metric which indicates a company's ability to make profit using their available assets.

ROA = (profit) / (assets)

Once we've created the new column, we'll aggregate by sector, and find the company with the highest ROA from each sector. Like previous challenges, we'll provide some guidance in the hints, but try to complete it without them if you can.

Don't be discouraged if this challenge takes a few attempts to get correct. Working iteratively is a great way to work, and this challenge is more difficult than exercises you have previously completed.

**Instructions:**

1. Create a new column `roa` in the `f500` dataframe, containing the return on assets metric for each company.
2. Aggregate the data by the `sector` column, and create a dictionary `top_roa_by_sector`, with:
  * Dictionary keys with the sector name.
  * Dictionary values with the company name with the highest ROA value from that sector.


In [38]:
f500["roa"] = f500["profits"]/f500["assets"]

top_roa_by_sector = {}

sectors = f500["sector"].unique()

for s in sectors:
    selected_sectors = f500[f500["sector"] == s]
    sorted_sectors = selected_sectors.sort_values("roa",ascending=False)
    top_roa = sorted_sectors.iloc[0]["company"]
    top_roa_by_sector[s] = top_roa
    
print(top_roa_by_sector)

{'Retailing': 'H & M Hennes & Mauritz', 'Energy': 'National Grid', 'Motor Vehicles & Parts': 'Subaru', 'Financials': 'Berkshire Hathaway', 'Technology': 'Accenture', 'Wholesalers': 'McKesson', 'Health Care': 'Gilead Sciences', 'Telecommunications': 'KDDI', 'Engineering & Construction': 'Pacific Construction Group', 'Industrials': '3M', 'Food & Drug Stores': 'Publix Super Markets', 'Aerospace & Defense': 'Lockheed Martin', 'Food, Beverages & Tobacco': 'Philip Morris International', 'Household Products': 'Unilever', 'Transportation': 'Delta Air Lines', 'Materials': 'CRH', 'Chemicals': 'LyondellBasell Industries', 'Media': 'Disney', 'Apparel': 'Nike', 'Hotels, Restaurants & Leisure': 'McDonald’s', 'Business Services': 'Adecco Group'}
