# 1. Understanding pandas and NumPy

### Although NumPy provides fundamental structures and tools that make working with data easier, there are several things that limit its usefulness:

* The lack of support for column names forces us to frame questions as multi-dimensional array operations.
* Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
* There are lots of low level methods, but there are many common analysis patterns that don't have pre-built methods.

 * Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively
 **The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:**

* Axis values can have string labels, not just numeric ones.
* Dataframes can contain columns with multiple data types: including integer, float, and string.

# 2. Introduction to the Data

we'll work with a data set from [Fortune](https://fortune.com/) magazine's [2017 Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017)

In [1]:
#import pandas module
import pandas as pd 

# read f500 dataset file
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None
f500_type=type(f500)
f500_shape=f500.shape

# 3. Introducing DataFrames

* To view the first few rows of our dataframe, we can use the DataFrame.head(no_of_rows) method. 
* To view the last few rows of our dataframe, we can use the DataFrame.tail(np_of_rowws) method. 

In [2]:
# To view upper few rows
f500_head=f500.head(6)
f500_tail=f500.tail(7)

In [3]:
f500_head

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753


In [4]:
f500_tail

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Xiamen ITG Holding Group,494,21930,34.3,35.6,12161,-25.1,Xu Xiaoxi,Trading,Wholesalers,0,China,"Xiamen, China",http://www.itgholding.com.cn,1,18454,1066
Xinjiang Guanghui Industry Investment,495,21919,31.1,251.8,31957,49.9,Shang Jiqiang,Trading,Wholesalers,0,China,"Urumqi, China",http://www.guanghui.com,1,65616,4563
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


# 4. Introducing DataFrames Continued

* Another feature that makes pandas better for working with data is that dataframes can contain more than one data type

* We can use the `DataFrame.dtypes` attribute (similar to NumPy's ndarray.dtype attribute) to return information about the types of each column. 
*  Pandas uses NumPy dtypes for numeric columns, including integer64.
* There is also a type we haven't seen before, `object`, which is used for columns that have data that doesn't fit into any other dtypes. This is almost always used for columns containing string values.
* When we import data, pandas will attempt to guess the correct dtype for each column.

In [5]:
f500.dtypes

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object

* If we wanted an overview of all the dtypes used in our dataframe, along with its shape and other information, we could use the DataFrame.info() method. 
* Note that DataFrame.info() prints the information, rather than returning it, so we can't assign it to a variable.

In [6]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 66.4+ KB


# 5. Selecting a Column From a DataFrame by Label

* Because our axes in pandas have labels, we can select data using those labels — unlike in NumPy, where we needed to know the exact index location. 
* To do this, we can use the DataFrame.loc[] attribute. The syntax for DataFrame.loc[] is:

`df.loc[row_label, column_label]`

## ToDo:
* Select the industry column. Assign the result to the variable name industries.
* Use Python's type() function to assign the type of industries to industries_type.

In [7]:
industries=f500.loc[:,'industry']
type(industries)

pandas.core.series.Series

# 6. Introduction to Series

**Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.**

* In fact, you can think of a dataframe as a collection of series objects, which is similar to how pandas stores the data behind the scenes

# 7. Selecting Columns From a DataFrame by Label Continued

* we use a `list of labels` to select specific columns


<block><pre>

A summary of the techniques we've learned so far is below:

Select by Label	           Explicit Syntax	            Common Shorthand
Single column	           df.loc[:,"col1"]       	        df["col1"]
List of columns	           df.loc[:,["col1", "col7"]]	    df[["col1", "col7"]]
Slice of columns	       df.loc[:,"col1":"col4"]	
<block></pre>

## TODO:
* Select the country column. Assign the result to the variable name countries.
* In order, select the revenues and years_on_global_500_list columns. Assign the result to the variable name revenues_years.
* In order, select all columns from ceo up to and including sector. Assign the result to the variable name ceo_to_sector.

In [8]:
countries=f500['country']
revenues_years=f500[['revenues','years_on_global_500_list']]
ceo_to_sector=f500.loc[:,'ceo':'sector']

In [9]:
countries

Walmart                                                 USA
State Grid                                            China
Sinopec Group                                         China
China National Petroleum                              China
Toyota Motor                                          Japan
Volkswagen                                          Germany
Royal Dutch Shell                               Netherlands
Berkshire Hathaway                                      USA
Apple                                                   USA
Exxon Mobil                                             USA
McKesson                                                USA
BP                                                  Britain
UnitedHealth Group                                      USA
CVS Health                                              USA
Samsung Electronics                             South Korea
Glencore                                        Switzerland
Daimler                                 

In [10]:
revenues_years

Unnamed: 0,revenues,years_on_global_500_list
Walmart,485873,23
State Grid,315199,17
Sinopec Group,267518,19
China National Petroleum,262573,17
Toyota Motor,254694,23
Volkswagen,240264,23
Royal Dutch Shell,240033,23
Berkshire Hathaway,223604,21
Apple,215639,15
Exxon Mobil,205004,23


In [11]:
ceo_to_sector

Unnamed: 0,ceo,industry,sector
Walmart,C. Douglas McMillon,General Merchandisers,Retailing
State Grid,Kou Wei,Utilities,Energy
Sinopec Group,Wang Yupu,Petroleum Refining,Energy
China National Petroleum,Zhang Jianhua,Petroleum Refining,Energy
Toyota Motor,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts
Volkswagen,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts
Royal Dutch Shell,Ben van Beurden,Petroleum Refining,Energy
Berkshire Hathaway,Warren E. Buffett,Insurance: Property and Casualty (Stock),Financials
Apple,Timothy D. Cook,"Computers, Office Equipment",Technology
Exxon Mobil,Darren W. Woods,Petroleum Refining,Energy


# 8. Selecting Rows From a DataFrame by Label

`df.loc[row_label, column_label]`

## TODO:
By selecting data from f500:
* Create a new variable toyota, with:
  * Just the row with index Toyota Motor.
  * All columns.
* Create a new variable, drink_companies, with:
  * Rows with indicies Anheuser-Busch InBev, Coca-Cola, and Heineken Holding, in that order.
  * All columns.
* Create a new variable, middle_companies with:
  * All rows with indicies from Tata Motors to Nationwide, inclusive.
  * All columns from rank to country, inclusive.

In [12]:
#select single single row
toyota=f500.loc['Toyota Motor',:]

#select list of rows
drink_companies=f500.loc[['Anheuser-Busch InBev','Coca-Cola','Heineken Holding']]

# select slice of rows
middle_companies=f500.loc['Tata Motors':'Nationwide','rank':'country']

In [13]:
toyota

rank                                                   5
revenues                                          254694
revenue_change                                       7.7
profits                                          16899.3
assets                                            437575
profit_change                                      -12.3
ceo                                          Akio Toyoda
industry                        Motor Vehicles and Parts
sector                            Motor Vehicles & Parts
previous_rank                                          8
country                                            Japan
hq_location                                Toyota, Japan
website                     http://www.toyota-global.com
years_on_global_500_list                              23
employees                                         364445
total_stockholder_equity                          157210
Name: Toyota Motor, dtype: object

In [14]:
drink_companies

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Anheuser-Busch InBev,206,45905,5.3,1241.0,258381,-85.0,Carlos Brito,Beverages,"Food, Beverages & Tobacco",211,Belgium,"Leuven, Belgium",http://www.ab-inbev.com,12,206633,71339
Coca-Cola,235,41863,-5.5,6527.0,87270,-11.2,James B. Quincey,Beverages,"Food, Beverages & Tobacco",206,USA,"Atlanta, GA",http://www.coca-colacompany.com,23,100300,23062
Heineken Holding,468,23044,-0.7,861.5,41469,-18.9,Jean-Francois van Boxmeer,Beverages,"Food, Beverages & Tobacco",459,Netherlands,"Amsterdam, Netherlands",http://www.theheinekencompany.com,11,73525,6958


In [15]:
middle_companies

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country
Tata Motors,247,40329,-4.2,1111.6,42162,-34.0,Guenter Butschek,Motor Vehicles and Parts,Motor Vehicles & Parts,226,India
Aluminum Corp. of China,248,40278,6.0,-282.5,75089,,Yu Dehui,Metals,Materials,262,China
Mitsui,249,40275,1.6,2825.3,103231,,Tatsuo Yasunaga,Trading,Wholesalers,245,Japan
Manulife Financial,250,40238,49.4,2209.7,537461,28.9,Donald A. Guloien,"Insurance: Life, Health (stock)",Financials,394,Canada
China Minsheng Banking,251,40234,-5.2,7201.6,848389,-1.8,Zheng Wanchun,Banks: Commercial and Savings,Financials,221,China
China Pacific Insurance (Group),252,40193,2.2,1814.9,146873,-35.7,Huo Lianhong,"Insurance: Life, Health (stock)",Financials,251,China
American Airlines Group,253,40180,-2.0,2676.0,51274,-64.8,W. Douglas Parker,Airlines,Transportation,236,USA
Nationwide,254,40074,-0.4,334.3,197790,-42.4,Stephen S. Rasmussen,Insurance: Property and Casualty (Mutual),Financials,241,USA


# 9. Series vs Dataframes

<block><pre>
                                      Column                            row
 1. Select single                    df['col']                          df.loc['row']
 2. Select list of                   df[['col1','col2','col3']]         df.loc[['row1','row2','row3']]
 3. Select slice of                  df[:,'col1':'col5']                 df['row1':'row3']

<block></pre>
* where single column or row is Series and more than one column or rows are Dataframe objects.

# 10. Value Counts Method

* Because series and dataframes are two distinct objects, they have their own unique methods.

 * `Series.value_counts()` method. This method displays each unique non-null value in a column and their counts in order.

## TODO:
* Select the country column in the f500_sel dataframe. Assign it to a variable named countries.
* Use the Series.value_counts() method to return the value counts for countries. Assign the results to country_counts.

In [16]:
countries=f500['country']
country_counts=countries.value_counts()

In [17]:
country_counts

USA             132
China           109
Japan            51
Germany          29
France           29
Britain          24
South Korea      15
Switzerland      14
Netherlands      14
Canada           11
Spain             9
Italy             7
Brazil            7
Australia         7
India             7
Taiwan            6
Russia            4
Ireland           4
Singapore         3
Sweden            3
Mexico            2
Venezuela         1
Thailand          1
Turkey            1
Saudi Arabia      1
Norway            1
U.A.E             1
Finland           1
Belgium           1
Denmark           1
Malaysia          1
Luxembourg        1
Israel            1
Indonesia         1
Name: country, dtype: int64

# 11. Selecting Items from a Series by Label

<block><pre>

Select by Label	                            Explicit Syntax	                Shorthand Convention
Single item from series	                     s.loc["item8"]	                   s["item8"]
List of items from series	                 s.loc[["item1","item7"]]	       s[["item1","item7"]]
Slice of items from series	                 s.loc["item2":"item4"]	           s["item2":"item4"] 

<block></pre>

## TODO
From the pandas series countries_counts:
* Select the item at index label India. Assign the result to the variable name india.
* In order, select the items with index labels USA, Canada, and Mexico. Assign the result to the variable name north_america.

In [18]:
india=country_counts['India']

In [19]:
north_america=country_counts[['USA','Canada','Mexico']]

In [20]:
india

7

In [21]:
north_america

USA       132
Canada     11
Mexico      2
Name: country, dtype: int64

# 12. Summary Challenge

<block><pre>
Select by Label                                	Explicit Syntax             	Shorthand Convention
Single column from dataframe	                  df.loc[:,"col1"]	              df["col1"]
List of columns from dataframe                    df.loc[:,["col1","col7"]]        df[["col1","col7"]]
Slice of columns from dataframe	                  df.loc[:,"col1":"col4"]	       -
Single row from dataframe	                      df.loc["row4"]	               -
List of rows from dataframe	                      df.loc[["row1", "row8"]]	       -
Slice of rows from dataframe	                  df.loc["row3":"row5"]	          df["row3":"row5"]
Single item from series	                          s.loc["item8"]	              s["item8"]
List of items from series	                      s.loc[["item1","item7"]]	      s[["item1","item7"]]
Slice of items from series	                      s.loc["item2":"item4"]	      s["item2":"item4"]


<block></pre>

## TODO
By selecting data from f500:

* Create a new variable big_movers, with:
  * Rows with indices Aviva, HP, JD.com, and BHP Billiton, in that order.
  * The rank and previous_rank columns, in that order.
* Create a new variable, bottom_companies with:
   * All rows with indices from National Grid to AutoNation, inclusive.
   * The rank, sector, and country columns.

In [22]:
big_movers=f500.loc[['Aviva','HP','JD.com','BHP Billiton'],['rank','previous_rank']]
bottom_companies=f500.loc['National Grid':'AutoNation',['rank','sector','country']]

In [23]:
big_movers

Unnamed: 0,rank,previous_rank
Aviva,90,279
HP,194,48
JD.com,261,366
BHP Billiton,350,168


In [24]:
bottom_companies

Unnamed: 0,rank,sector,country
National Grid,491,Energy,Britain
Dollar General,492,Retailing,USA
Telecom Italia,493,Telecommunications,Italy
Xiamen ITG Holding Group,494,Wholesalers,China
Xinjiang Guanghui Industry Investment,495,Wholesalers,China
Teva Pharmaceutical Industries,496,Health Care,Israel
New China Life Insurance,497,Financials,China
Wm. Morrison Supermarkets,498,Food & Drug Stores,Britain
TUI,499,Business Services,Germany
AutoNation,500,Retailing,USA


**In this mission, we learned:**

* How pandas and NumPy combine to make working with data easier.
* About the two core pandas types: series and dataframes.
* How to select data from pandas objects using axis labels.