# Pandas basics
---

`NumPy` provides fundamental structures and tools that makes working with data easier, but there are several things that limit its usefulness as a single tool when working with data:

* The lack of support for column names forces us to frame the questions we want to answer as multi-dimensional array operations.
* Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
* There are lots of low level methods, however there are many common analysis patterns that don't have pre-built methods.

The `pandas` library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an extension of NumPy.

In [1]:
import pandas as pd

With pandas we'll inspect the f500.csv dataset. Pandas has a method to read csv ([`pandas.read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)). Pandas has a similar attribute `.shape`, like NumPy.

In [2]:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
f500_type = type(f500)
f500_shape = f500.shape

In [3]:
print(f500_type, f500_shape)

<class 'pandas.core.frame.DataFrame'> (500, 16)


[`pandas.DataFrame()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) objects/type, or just dataframes, is the primary pandas data structure (the second one is series). DataFrames have 2 dimensions (rows, columns).

We can use the [`DataFrame.dtypes`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) attribute (similar to NumPy's [`ndarray.dtype`](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.dtype.html#numpy.ndarray.dtype) attribute) to return information about the types of each column. Let's see what this would return for our selection of data above:

In [4]:
print(f500.dtypes)

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object


A few handy methods we can use to get some high-level information about our dataframe:

* If we wanted to view the first few rows of our dataframe, we can use the [`DataFrame.head()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method, which returns the first five rows of our dataframe. The `DataFrame.head()` method also accepts an optional integer parameter which specifies the number of rows. We could use `f500.head(10)` to return the first 10 rows of our f500 dataframe.
* Similar in function to `DataFrame.head()`, we can use the [`DataFrame.tail()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) method, to shows us the last rows of our dataframe. The `DataFrame.tail()` method accepts an optional integer parameter to specify the number of rows, defaulting to five.
* If we wanted to get an overview of all the dtypes used in our dataframe, along with its shape and some extra information, we could use the [`DataFrame.info()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info) method. Note that `DataFrame.info()` prints the information, rather than returning it, so we can't assign it to a variable.

Let's practice using these three new methods:

In [5]:
f500_head = f500.head(6)
f500_tail = f500.tail(8)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 66.4+ KB


In [6]:
f500_head

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753


In [7]:
f500_tail

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Telecom Italia,493,21941,-17.4,1999.4,74295,,Flavio Cattaneo,Telecommunications,Telecommunications,404,Italy,"Milan, Italy",http://www.telecomitalia.com,18,61227,22366
Xiamen ITG Holding Group,494,21930,34.3,35.6,12161,-25.1,Xu Xiaoxi,Trading,Wholesalers,0,China,"Xiamen, China",http://www.itgholding.com.cn,1,18454,1066
Xinjiang Guanghui Industry Investment,495,21919,31.1,251.8,31957,49.9,Shang Jiqiang,Trading,Wholesalers,0,China,"Urumqi, China",http://www.guanghui.com,1,65616,4563
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


Because our axes in pandas have labels, we can select data using those labels, unlike in NumPy where we needed to know the exact index location. To do this, we use the [`DataFrame.loc[]`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) method.

The syntax for the `DataFrame.loc[]` method is:

```python
df.loc[row, column]
```

Where row and column refer to row and column labels respectively, and can be one of:

* A single label
* A list or array of labels
* A slice object with labels
* A boolean array

Here's an example of using this method:

In [8]:
industries = f500.loc[:,"industry"]
previous = f500.loc[:,["rank","previous_rank","years_on_global_500_list"]]
financial_data = f500.loc[:,"revenues":"profit_change"]

In [9]:
industries.head()

Walmart                        General Merchandisers
State Grid                                 Utilities
Sinopec Group                     Petroleum Refining
China National Petroleum          Petroleum Refining
Toyota Motor                Motor Vehicles and Parts
Name: industry, dtype: object

In [10]:
previous.head()

Unnamed: 0,rank,previous_rank,years_on_global_500_list
Walmart,1,1,23
State Grid,2,2,17
Sinopec Group,3,4,19
China National Petroleum,4,3,17
Toyota Motor,5,8,23


In [11]:
financial_data.head()

Unnamed: 0,revenues,revenue_change,profits,assets,profit_change
Walmart,485873,0.8,13643.0,198825,-7.2
State Grid,315199,-4.4,9571.3,489838,-6.2
Sinopec Group,267518,-9.1,1257.9,310726,-65.0
China National Petroleum,262573,-12.3,1867.5,585619,-73.7
Toyota Motor,254694,7.7,16899.3,437575,-12.3


Instead of `df.loc[:,"col1"]` you can use `df["col1"]` to select columns. This works for single columns and lists of columns but not for column slices.

Instead of `df.loc[:,"col1"]` you can use `df.col1`. This shortcut does not work for labels that contain spaces or special characters.

Here are some examples:

In [12]:
countries = f500["country"]
revenues_years = f500[["revenues", "years_on_global_500_list"]]
ceo_to_sector = f500.loc[:,"ceo":"sector"]

**Series** is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series, and anytime you see a 2D pandas object, it will be a dataframe.

A dataframe can be seen as being a collection of series objects, which is similar to how pandas stores the data behind the scenes. Single row or single column can be seen as series objects if they contain one dtype. Series objects also contain indeces, so we can use `.loc[]` method as well (though using just square brackets will be more convenient). Here's an example:

In [13]:
ceos = f500['ceo']
ceos = f500["ceo"]
walmart = ceos["Walmart"]
apple_to_samsung = ceos["Apple":"Samsung Electronics"]
oil_companies = ceos[["Exxon Mobil", "BP", "Chevron"]]

Let's see the type of a row and a column from our dataset:

In [22]:
a_row = f500.loc["Walmart"]
a_column = f500.loc[:, "revenues"]
print(type(a_row))
print(type(a_column))
print(a_row)

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
rank                                             1
revenues                                    485873
revenue_change                                 0.8
profits                                      13643
assets                                      198825
profit_change                                 -7.2
ceo                            C. Douglas McMillon
industry                     General Merchandisers
sector                                   Retailing
previous_rank                                    1
country                                        USA
hq_location                        Bentonville, AR
website                     http://www.walmart.com
years_on_global_500_list                        23
employees                                  2300000
total_stockholder_equity                     77798
Name: Walmart, dtype: object


We can see that `a_row` is a series object and the dtype of `a_row` is `object`, which means that all the items from the row were transformed to string (in order to have only one data type in the whole series object). Let's use all of these methods and try to extract some rows, columns and combinations of them:

In [23]:
drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"], :]
big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank", "previous_rank"]]
middle_companies = f500.loc["Tata Motors":"Nationwide", "rank":"country"]

How to quickly analyze a dataset? First of all, we can use [`Series.describe()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.describe.html) method, which returns some descriptive statistics on the data contained within a specific pandas series. Let's look at an example:

In [24]:
revs = f500["revenues"]
revs.describe()

count       500.000000
mean      55416.358000
std       45725.478963
min       21609.000000
25%       29003.000000
50%       40236.000000
75%       63926.750000
max      485873.000000
Name: revenues, dtype: float64

The method tells us how many non-null values are contained in the series, the mean and [standard deviation](https://www.quora.com/What-is-standard-deviation-1), along with the minimum, maximum and [quartile](https://en.wikipedia.org/wiki/Quartile) values.

We can also skip variable declaration and just use the `.describe()` method using `.loc[]` (or just square brackets) method:

```python
f500["revenues"].describe()
```

In [26]:
f500["assets"].describe()

count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64

Because the values for this column are too long to display neatly, pandas has displayed them in **E-notation**, a type of scientific notation. 

Basically, in `N.nnnnnne+PP`:

* `N` means *integer*
* `n` means *decimal*
* `e` means *ten*
* `+PP` means the *power* of ten

So `2.436323e+05` is the same as `2.436323 * 10 ** 5`.

We can use `.describe()` method on a series object with strings, the result will be different:

In [27]:
f500["country"].describe()

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object

DataFrame objects also have a [DataFrame.describe()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method that returns these same statistics for every column. One difference is that you need to specify manually if you want to see the statistics for the non-numeric columns. By default, `DataFrame.describe()` will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the `include=['O']` parameter when using the dataframe version of describe:

In [28]:
f500.describe(include=["O"])

Unnamed: 0,ceo,industry,sector,country,hq_location,website
count,500,500,500,500,500,500
unique,500,58,21,34,235,500
top,Zhang Zongyan,Banks: Commercial and Savings,Financials,USA,"Beijing, China",http://www.lenovo.com
freq,1,51,118,132,56,1


In [29]:
f500.describe()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


In [30]:
f500.describe(include="all")

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500,500,500,500.0,500,500,500,500.0,500.0,500.0
unique,,,,,,,500,58,21,,34,235,500,,,
top,,,,,,,Zhang Zongyan,Banks: Commercial and Savings,Financials,,USA,"Beijing, China",http://www.lenovo.com,,,
freq,,,,,,,1,51,118,,132,56,1,,,
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,,,,222.134,,,,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,,,,146.941961,,,,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,,,,0.0,,,,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,,,,92.75,,,,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,,,,219.5,,,,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,,,,347.25,,,,23.0,168917.2,37828.5


One more thing: the `DataFrame.describe()` method returns a DataFrame object, while the `Series.describe()` method returns a Series object.

Here's some examples of implementing the methods:

In [31]:
profits_desc = f500["profits"].describe()
revenue_and_employees_desc = f500[["revenues", "employees"]].describe()
all_desc = f500.describe(include="all")

As pandas library uses NumPy a lot, panas' objects have many similar methods (like `.max()`, `.min()`, `.mean()`, `.median()`, `.mode()`, `.sum()`, etc.). Vectorized operations work with Series as well.

[`Series.value_counts()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) is a useful method, which displays each unique non-null value from a series, with a count of the number of times that value is used:

In [32]:
top5_countries = f500["country"].value_counts().head()
top5_previous_rank = f500["previous_rank"].value_counts().head()
max_f500 = f500.max()
similar_max_f500 = f500.max(axis = 0, numeric_only = True) # basically same as max_f500

Assigning values in pandas works almost the same as in NumPy. Here are some practical examples:

In [33]:
f500.loc[:, "revenues_b"] = f500.loc[:, "revenues"] / 1000 # adding a new column
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"

In [37]:
kr_bool = f500["country"] == "South Korea" # creates a boolean Series object
top_5_kr = f500[kr_bool].head() # shows top 5 Korean companies
top_5_kr_ranks = f500.loc[kr_bool, "rank"].head() # shows ranks of these companies
top_5_kr_ranks = top_5_kr["rank"] # same, but more efficient

By the way, remember the `f500_tail`? Let's inspect the `previous_rank` column there (because actually there are some problems with values):

In [39]:
f500_tail

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Telecom Italia,493,21941,-17.4,1999.4,74295,,Flavio Cattaneo,Telecommunications,Telecommunications,404,Italy,"Milan, Italy",http://www.telecomitalia.com,18,61227,22366
Xiamen ITG Holding Group,494,21930,34.3,35.6,12161,-25.1,Xu Xiaoxi,Trading,Wholesalers,0,China,"Xiamen, China",http://www.itgholding.com.cn,1,18454,1066
Xinjiang Guanghui Industry Investment,495,21919,31.1,251.8,31957,49.9,Shang Jiqiang,Trading,Wholesalers,0,China,"Urumqi, China",http://www.guanghui.com,1,65616,4563
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


We can see that some of the companies have a 0 value as their previous rank, which probably means that they were not included in the previous Fortune 500 list. This can cause some problems if we calculate some statistics using these values, so it will be wiser to change 0's to `None`'s (or `np.nan`'s).

In [40]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan # don't forget to select the column or this will change values of every column!
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

In [41]:
prev_rank_after # integers are changed to floats because of using NumPy NaN

NaN       33
 471.0     1
 234.0     1
 125.0     1
 166.0     1
Name: previous_rank, dtype: int64

In [42]:
prev_rank_before

0      33
159     1
147     1
148     1
149     1
Name: previous_rank, dtype: int64

Now, let's summarize everything we learnt by calculating a specific statistic or attribute of each of the three most common countries from our f500 dataframe:

First, let's see the three most common countries from the dataframe:

In [43]:
top_3_countries = f500["country"].value_counts().head(3)
top_3_countries

USA      132
China    109
Japan     51
Name: country, dtype: int64

In [44]:
top_3_countries = f500["country"].value_counts().head(3)
cities_usa = f500.loc[f500["country"] == "USA", "hq_location"].value_counts().head() 
# creates a series containing counts of the five most common
# Headquarter Location cities for companies headquartered in the USA
sector_china = f500.loc[f500["country"] == "China", "sector"].value_counts().head(3) 
# creates a series containing counts of the three most common sectors
# for companies headquartered in the China
mean_employees_japan = f500.loc[f500["country"] == "Japan", "employees"].mean() 
# creates a float object containing the mean average number of employees
#for companies headquartered in Japan

That's it for now!