<a href="https://colab.research.google.com/github/dbro-dev/DataQuest_Courses/blob/master/017__Introduction_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **COURSE 1/6: Pandas and NumPy Fundamentals**

*Learn how to use the pandas library to work with data.*

## **MISSION 3: Introduction to pandas**

In this mission, we learn:

* How pandas and NumPy combine to make working with data easier.
* About the two core pandas types: series and dataframes.
* How to select data from pandas objects using axis labels.

**1. Understanding pandas and NumPy**

Numpy is cool, but it has its shortcomings:
* The lack of support for column names forces us to frame questions as multi-dimensional array operations.
* Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
* There are lots of low level methods, but there are many common analysis patterns that don't have pre-built methods.

The pandas library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as **an extension of NumPy**. The underlying code for pandas uses the NumPy library extensively, which means the NumPy concepts you've been learning will come in handy as you begin to learn more about pandas.

The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:

* Axis values can have **string labels**, not just numeric ones.
* Dataframes **can contain columns with multiple data types: including integer, float, and strin**g.

![alt text](https://s3.amazonaws.com/dq-content/291/df_anatomy_static_resized.svg)

**2. Introduction to the Data**

For this course, we will work with a data set from [Fortune](https://fortune.com/) magazine's 2017 [Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible.

![Fortune_500_logo](https://s3.amazonaws.com/dq-content/291/fortune-500.jpg)

Below is the code to import pandas and use the `pandas.read_csv()` function to read the CSV into a dataframe and assign it to the variable name f500. 

```
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None

```





In Google Colab however, it is a bit more complicated to load a .csv to work with. The fields below show how it is done:

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
id = "1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut"

In [None]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('f500.csv')

In [None]:
# Import code which is similar to the original code above
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None

Like NumPy's ndarrays, pandas' dataframes have a `.shape` attribute which returns a tuple representing the dimensions of each axis of the object. We'll use that and Python's `type()` function to inspect the `f500` dataframe.

In [None]:
f500_type = type(f500)
print(f500_type)

f500_shape = f500.shape
print(f500_shape)

<class 'pandas.core.frame.DataFrame'>
(500, 16)


This lets us know our data has 500 rows and 16 columns, and is stored as a `pandas.core.frame.DataFrame` object — or just dataframe, the primary pandas data structure.

**3. Introducing DataFrames**

To view the first few rows of our dataframe, we can use the `DataFrame.head()` method. By default, it will return the first five rows of our dataframe. However, it also accepts an optional integer parameter, which specifies the number of rows.

We can use the `DataFrame.tail()` method to show us the last rows of our dataframe.

In [None]:
f500_head = f500.head(6)
print(f500_head)

f500_tail = f500.tail(8)
print(f500_tail)

                          rank  revenues  ...  employees  total_stockholder_equity
Walmart                      1    485873  ...    2300000                     77798
State Grid                   2    315199  ...     926067                    209456
Sinopec Group                3    267518  ...     713288                    106523
China National Petroleum     4    262573  ...    1512048                    301893
Toyota Motor                 5    254694  ...     364445                    157210
Volkswagen                   6    240264  ...     626715                     97753

[6 rows x 16 columns]
                                       rank  ...  total_stockholder_equity
Telecom Italia                          493  ...                     22366
Xiamen ITG Holding Group                494  ...                      1066
Xinjiang Guanghui Industry Investment   495  ...                      4563
Teva Pharmaceutical Industries          496  ...                     33337
New China Life Insura

**4. Introducing DataFrames Continued**

We can use the `DataFrame.dtypes` attribute (similar to NumPy's `ndarray.dtype` attribute) to return information about the types of each column.

If we wanted an overview of all the dtypes used in our dataframe, along with its shape and other information, we could use the `DataFrame.info()` method. Note that `DataFrame.info()` prints the information, rather than returning it, so we can't assign it to a variable.

In [None]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

**5. Selecting a Column From a DataFrame by Label**

Because our axes in pandas have labels, we can select data using those labels — unlike in NumPy, where we needed to know the exact index location. To do this, we can use the `DataFrame.loc[]` attribute. The syntax for `DataFrame.loc[]` is:
```
df.loc[row_label, column_label]
```
Notice that we use brackets ([ ]) instead of parentheses (( )) when selecting by location.


```
f500_selection.loc[:,"rank"]
```



We can also use the following shortcut to select a single column:
```
rank_col = f500_selection["rank"]
```



In [None]:
# Select the industry column. Assign the result to the variable name industries.

industries = f500["industry"]
industries_type = type(industries)
print(industries_type)

<class 'pandas.core.series.Series'>


A 1D pandas object is called a **series**,
a 2D pandas object is a **dataframe**.

**6. Introduction to Series**

You can think of a dataframe as a **collection of series objects**, which is similar to how pandas stores the data behind the scenes.

![alt text](https://s3.amazonaws.com/dq-content/291/df_exploded_resized.svg)
As we continue learning how to select data, pay attention to which objects are dataframes and which objects are series.

**7. Selecting Columns From a DataFrame by Label Continued**

A summary of the techniques we've learned so far is below:

|Select by Label |	Explicit Syntax |	Common Shorthand|
| --- | --- | ---|
| Single column from dataframe |	`df.loc[:,"col1"]` |	`df["col1"]` |
| List of columns from dataframe |	`df.loc[:,["col1", "col7"]]` |	`df[["col1", "col7"]]` |
| Slice of columns from dataframe|	`df.loc[:,"col1":"col4"]` |	- |


Instructions:
1. Select the `country` column. Assign the result to the variable name countries.
2. In order, select the `revenues` and `years_on_global_500_list`
3. In order, select all columns from `ceo` up to and including `sector`. Assign the result to the variable name `ceo_to_sector`.



In [None]:
countries = f500["country"] # Or f500.loc[:,"country"]
revenues_years = f500[["revenues", "years_on_global_500_list"]] # Or f500.loc[:,["revenues", "years_on_global_500_list"]]
ceo_to_sector = f500.loc[:,"ceo":"sector"]

**8. Selecting Rows From a DataFrame by Label**

|Select by Label |	Explicit Syntax |	Common Shorthand|
| --- | --- | ---|
| Single row from dataframe |	`df.loc["row4"]` 	| |
| List of rows from dataframe |	`df.loc[["row1", "row8"]]` 	| |
|Slice of rows from dataframe |	`df.loc["row3":"row5"]` |	`df["row3":"row5"]`|



```
# Select a single row
single_row = f500_selection.loc["Sinopec Group"]

# Select a list of rows
list_rows = f500_selection.loc[["Toyota Motor", "Walmart"]]

# Select a slice object with labels
slice_rows = f500_selection["State Grid":"Toyota Motor"]
```




1. Create a new variable `toyota`, with:
* Just the row with index `Toyota Motor`.
* All columns.
2. Create a new variable, `drink_companies`, with:
* Rows with indicies `Anheuser-Busch InBev`, `Coca-Cola`, and `Heineken Holding`, in that order.
* All columns.
3.Create a new variable, `middle_companies` with:
* All rows with indicies from `Tata Motors` to `Nationwide`, inclusive.
* All columns from rank to country, inclusive.


In [None]:
toyota = f500.loc["Toyota Motor"]

drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"]]

middle_companies = f500.loc["Tata Motors":"Nationwide", "rank":"country"]

**9. Series vs Dataframes**

![alt text](https://s3.amazonaws.com/dq-content/291/df_series_s_updated.svg)
![alt text](https://s3.amazonaws.com/dq-content/291/df_series_df_updated.svg)

**10. Value Counts Method**

Because series and dataframes are two distinct objects, they have their own unique methods. Let's look at an example of a series method next - the `Series.value_counts()` method.

In [None]:
sectors = f500["sector"]
sectors_value_counts = sectors.value_counts()
print(sectors_value_counts)

Financials                       118
Energy                            80
Technology                        44
Motor Vehicles & Parts            34
Wholesalers                       28
Health Care                       27
Food & Drug Stores                20
Transportation                    19
Telecommunications                18
Retailing                         17
Materials                         16
Food, Beverages & Tobacco         16
Industrials                       15
Aerospace & Defense               14
Engineering & Construction        13
Chemicals                          7
Hotels, Restaurants & Leisure      3
Household Products                 3
Business Services                  3
Media                              3
Apparel                            2
Name: sector, dtype: int64


In [None]:
countries = f500['country']
countries_counts = countries.value_counts()
print(countries_counts)

USA             132
China           109
Japan            51
France           29
Germany          29
Britain          24
South Korea      15
Switzerland      14
Netherlands      14
Canada           11
Spain             9
Australia         7
Italy             7
India             7
Brazil            7
Taiwan            6
Russia            4
Ireland           4
Singapore         3
Sweden            3
Mexico            2
Turkey            1
Venezuela         1
Denmark           1
Malaysia          1
Luxembourg        1
Norway            1
Thailand          1
U.A.E             1
Finland           1
Belgium           1
Israel            1
Saudi Arabia      1
Indonesia         1
Name: country, dtype: int64


**11. Selecting Items from a Series by Label**

As with dataframes, we can use `Series.loc[]` to select items from a series using single labels, a list, or a slice object. We can also omit `loc[]` and use bracket shortcuts for all three:

|Select by Label |	Explicit Syntax |	Shorthand Convention |
| --- | --- | --- |
|Single item from series |	`s.loc["item8"]` |	`s["item8"]` |
| List of items from series |	`s.loc[["item1","item7"]]` |	`s[["item1","item7"]]`|
|Slice of items from series |	`s.loc["item2":"item4"]` |	`s["item2":"item4"]` |






In [None]:
india = countries_counts["India"]
print(india)

7


In [None]:
north_america = countries_counts[["USA", "Canada", "Mexico"]]
print(north_america)

USA       132
Canada     11
Mexico      2
Name: country, dtype: int64


**12. Summary Challenge**

|Select by Label |	Explicit Syntax |	Common Shorthand|
| --- | --- | ---|
| Single column from dataframe |	`df.loc[:,"col1"]` |	`df["col1"]` |
| List of columns from dataframe |	`df.loc[:,["col1", "col7"]]` |	`df[["col1", "col7"]]` |
| Slice of columns from dataframe|	`df.loc[:,"col1":"col4"]` |	- |
| --- | --- | ---|
| Single row from dataframe |	`df.loc["row4"]` 	| |
| List of rows from dataframe |	`df.loc[["row1", "row8"]]` 	| |
|Slice of rows from dataframe |	`df.loc["row3":"row5"]` |	`df["row3":"row5"]`|
| --- | --- | --- |
|Single item from series |	`s.loc["item8"]` |	`s["item8"]` |
| List of items from series |	`s.loc[["item1","item7"]]` |	`s[["item1","item7"]]`|
|Slice of items from series |	`s.loc["item2":"item4"]` |	`s["item2":"item4"]` |

Instructions:

By selecting data from f500:

1. Create a new variable `big_movers`, with:
* Rows with indices `Aviva`, `HP`, `JD.com`, and `BHP Billiton`, in that order.
* The `rank` and `previous_rank` columns, in that order.
2. Create a new variable, `bottom_companies` with:
* All rows with indices from `National Grid` to `AutoNation`, inclusive.
* The `rank`, `sector`, and `country` columns.

In [None]:
big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank", "previous_rank"]]

print(big_movers)

              rank  previous_rank
Aviva           90            279
HP             194             48
JD.com         261            366
BHP Billiton   350            168


In [None]:
bottom_companies = f500.loc["National Grid":"AutoNation",["rank", "sector", "country"]]

print(bottom_companies)

                                       rank              sector  country
National Grid                           491              Energy  Britain
Dollar General                          492           Retailing      USA
Telecom Italia                          493  Telecommunications    Italy
Xiamen ITG Holding Group                494         Wholesalers    China
Xinjiang Guanghui Industry Investment   495         Wholesalers    China
Teva Pharmaceutical Industries          496         Health Care   Israel
New China Life Insurance                497          Financials    China
Wm. Morrison Supermarkets               498  Food & Drug Stores  Britain
TUI                                     499   Business Services  Germany
AutoNation                              500           Retailing      USA




---

In the next mission, we'll continue to learn about exploring data in pandas, including:

* How to select data from pandas objects using boolean arrays.
* How to assign data using labels and boolean arrays.
* How to create new rows and columns in pandas.
* New methods to make data analysis easier in pandas.
