<a href="https://colab.research.google.com/github/MathewLipman/Work-Samples/blob/main/NumPY_and_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Numpy Limitations: 

It doesn't support column names, so we must frame questions as multi-dimensional array operations.

It only allows for one data type per ndarray, complicating the handling of mixed numeric and string data.

While there are many low-level methods, some common analysis patterns lack pre-built methods.

Pandas:

Pandas is an incredibly versatile and user-friendly Python library designed to make our data exploration and analysis journey both fun and efficient. 

It gets its name from the econometrics term "panel data." With pandas, we'll be able to easily manipulate, clean, and visualize data, all while enjoying the process. 

Pandas is not a replacement for NumPy, but rather an extension that builds upon its strengths. Since pandas' underlying code relies heavily on NumPy, our newly acquired skills will be invaluable as we explore this exciting new library.

Pandas dataframes are Pandas' answer to NumPY's 2d ndarrays but axis values can have string labels not just numerico nes and dataframes can contain columns with multiple data types including: integer, float and string.

We are importing our global fortune 500 dataset 'f500.csv':

company: The company's name.

rank: Global 500 rank for the company.

revenues: Company's total revenue for the fiscal year, in millions of dollars (USD).

revenue_change: Percentage change in revenue between the current and prior fiscal year.

profits: Net income for the fiscal year, in millions of dollars (USD).
ceo: Company's Chief Executive Officer.

industry: The company's industry of operation.

sector: Sector in which the company operates.

previous_rank: Global 500 rank for the company for the prior year.

country: Country of the company's headquarters.


In [1]:
import pandas as pd
f500 = pd.read_csv('/content/drive/MyDrive/Dataset/f500.csv', index_col=0)
f500.index.name = None

the pandas module is able to access our csv through its path and if we check the type of f500 (the variable we stored pandas(pd) reading our csv) we can see it is in a pandas.core.frame.DataFrame

In [3]:
f500_type = type(f500)
print(f500_type)

<class 'pandas.core.frame.DataFrame'>


By printing the shape we can see we have 500 rows and 16 columns

In [6]:
f500_shape = f500.shape
print(f500_shape)


(500, 16)
<class 'pandas.core.frame.DataFrame'>


Now we are printing the first 3 rows (headers included seperately) of our dataset from the top

In [10]:
print(f500.head(3))

               rank  revenues  revenue_change  profits  assets  profit_change  \
Walmart           1    485873             0.8  13643.0  198825           -7.2   
State Grid        2    315199            -4.4   9571.3  489838           -6.2   
Sinopec Group     3    267518            -9.1   1257.9  310726          -65.0   

                               ceo               industry     sector  \
Walmart        C. Douglas McMillon  General Merchandisers  Retailing   
State Grid                 Kou Wei              Utilities     Energy   
Sinopec Group            Wang Yupu     Petroleum Refining     Energy   

               previous_rank country      hq_location                 website  \
Walmart                    1     USA  Bentonville, AR  http://www.walmart.com   
State Grid                 2   China   Beijing, China  http://www.sgcc.com.cn   
Sinopec Group              4   China   Beijing, China  http://www.sinopec.com   

               years_on_global_500_list  employees  total_sto

To view data from the bottom, instead of head we use tail:

In [12]:
print(f500.tail(3))

                           rank  revenues  revenue_change  profits  assets  \
Wm. Morrison Supermarkets   498     21741           -11.3    406.4   11630   
TUI                         499     21655            -5.5   1151.7   16247   
AutoNation                  500     21609             3.6    430.5   10060   

                           profit_change                 ceo  \
Wm. Morrison Supermarkets           20.4      David T. Potts   
TUI                                195.5   Friedrich Joussen   
AutoNation                          -2.7  Michael J. Jackson   

                                       industry              sector  \
Wm. Morrison Supermarkets  Food and Drug Stores  Food & Drug Stores   
TUI                             Travel Services   Business Services   
AutoNation                  Specialty Retailers           Retailing   

                           previous_rank  country          hq_location  \
Wm. Morrison Supermarkets            437  Britain    Bradford, Britain 

In [22]:
f500_top_6 = f500.head(6)
f500_bottom_8 = f500.tail(8)


To learn about the types of each column we can use dataframe.dtypes attribute, similar to NumPy's ndarray.dtype attribute. 

In [23]:
print(f500_top_6.dtypes)

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object


In [24]:
print(f500_bottom_8.dtypes)

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object


for an overview of all the dtypes used in our dataframe including shpae (rows and columns, we are using the example with only the top 6 rows). Also we do not need to use the print function for it to produce output.

In [26]:
f500_top_6.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Walmart to Volkswagen
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      6 non-null      int64  
 1   revenues                  6 non-null      int64  
 2   revenue_change            6 non-null      float64
 3   profits                   6 non-null      float64
 4   assets                    6 non-null      int64  
 5   profit_change             5 non-null      float64
 6   ceo                       6 non-null      object 
 7   industry                  6 non-null      object 
 8   sector                    6 non-null      object 
 9   previous_rank             6 non-null      int64  
 10  country                   6 non-null      object 
 11  hq_location               6 non-null      object 
 12  website                   6 non-null      object 
 13  years_on_global_500_list  6 non-null      int64  
 14  empl

Example now with the full data set

In [27]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

In [28]:
float64_dtype = 3
int64_dtype = 7
object_dtype = 6

Using df (dataframe name) .loc functionality we can input a row name or column name and return the desired output.

In [32]:

f500_top_6.loc[:, "rank"]

Walmart                     1
State Grid                  2
Sinopec Group               3
China National Petroleum    4
Toyota Motor                5
Volkswagen                  6
Name: rank, dtype: int64

In [35]:
f500_top_6.loc["Walmart", :]

rank                                             1
revenues                                    485873
revenue_change                                 0.8
profits                                    13643.0
assets                                      198825
profit_change                                 -7.2
ceo                            C. Douglas McMillon
industry                     General Merchandisers
sector                                   Retailing
previous_rank                                    1
country                                        USA
hq_location                        Bentonville, AR
website                     http://www.walmart.com
years_on_global_500_list                        23
employees                                  2300000
total_stockholder_equity                     77798
Name: Walmart, dtype: object

In [41]:
rank_col = f500_top_6["rank"]
print(rank_col)

Walmart                     1
State Grid                  2
Sinopec Group               3
China National Petroleum    4
Toyota Motor                5
Volkswagen                  6
Name: rank, dtype: int64


In [45]:
revenue_col = f500_top_6["revenues"]
print(revenue_col)

Walmart                     485873
State Grid                  315199
Sinopec Group               267518
China National Petroleum    262573
Toyota Motor                254694
Volkswagen                  240264
Name: revenues, dtype: int64


In [46]:
industries = f500_top_6["industry"]

In [47]:
industries_type = type(industries)

a one dimensional object is a series, a 2d object is a dataframe

In [48]:
print(industries)
print(industries_type)

Walmart                        General Merchandisers
State Grid                                 Utilities
Sinopec Group                     Petroleum Refining
China National Petroleum          Petroleum Refining
Toyota Motor                Motor Vehicles and Parts
Volkswagen                  Motor Vehicles and Parts
Name: industry, dtype: object
<class 'pandas.core.series.Series'>


using the loc function on our selection (f500_top_6) we can select the rows and columns (in this case all rows which is only the top 6 from dataset and then the specific columns we want ("country","rank")

In [52]:
f500_top_6.loc[:,["country","rank"]]

Unnamed: 0,country,rank
Walmart,USA,1
State Grid,China,2
Sinopec Group,China,3
China National Petroleum,China,4
Toyota Motor,Japan,5
Volkswagen,Germany,6


In [53]:
f500_top_6.loc[:,["country","revenues"]]

Unnamed: 0,country,revenues
Walmart,USA,485873
State Grid,China,315199
Sinopec Group,China,267518
China National Petroleum,China,262573
Toyota Motor,Japan,254694
Volkswagen,Germany,240264


In [54]:
f500_top_6[["country","rank"]]

Unnamed: 0,country,rank
Walmart,USA,1
State Grid,China,2
Sinopec Group,China,3
China National Petroleum,China,4
Toyota Motor,Japan,5
Volkswagen,Germany,6


this allows us to select columns rank through profits

In [59]:
f500_top_6.loc[:,"rank":"profits"]

Unnamed: 0,rank,revenues,revenue_change,profits
Walmart,1,485873,0.8,13643.0
State Grid,2,315199,-4.4,9571.3
Sinopec Group,3,267518,-9.1,1257.9
China National Petroleum,4,262573,-12.3,1867.5
Toyota Motor,5,254694,7.7,16899.3
Volkswagen,6,240264,1.5,5937.3


In [61]:
f500_top_6[["rank","profits"]]

Unnamed: 0,rank,profits
Walmart,1,13643.0
State Grid,2,9571.3
Sinopec Group,3,1257.9
China National Petroleum,4,1867.5
Toyota Motor,5,16899.3
Volkswagen,6,5937.3


In [65]:
countries = f500_top_6[["country"]]
print(countries)

                          country
Walmart                       USA
State Grid                  China
Sinopec Group               China
China National Petroleum    China
Toyota Motor                Japan
Volkswagen                Germany


In [66]:
revenue_years = f500_top_6[["revenues", "years_on_global_500_list"]]
print(revenue_years)

                          revenues  years_on_global_500_list
Walmart                     485873                        23
State Grid                  315199                        17
Sinopec Group               267518                        19
China National Petroleum    262573                        17
Toyota Motor                254694                        23
Volkswagen                  240264                        23


In [70]:
ceo_to_sector = f500_top_6.loc[:,"ceo":"sector"]
print(ceo_to_sector)

                                          ceo                  industry  \
Walmart                   C. Douglas McMillon     General Merchandisers   
State Grid                            Kou Wei                 Utilities   
Sinopec Group                       Wang Yupu        Petroleum Refining   
China National Petroleum        Zhang Jianhua        Petroleum Refining   
Toyota Motor                      Akio Toyoda  Motor Vehicles and Parts   
Volkswagen                    Matthias Muller  Motor Vehicles and Parts   

                                          sector  
Walmart                                Retailing  
State Grid                                Energy  
Sinopec Group                             Energy  
China National Petroleum                  Energy  
Toyota Motor              Motor Vehicles & Parts  
Volkswagen                Motor Vehicles & Parts  


Using Row Label to Output all Columns

this syntax works because if we do not input the columns it is treated as the column index is [:]

In [73]:
f500_top_6.loc["Walmart"]

rank                                             1
revenues                                    485873
revenue_change                                 0.8
profits                                    13643.0
assets                                      198825
profit_change                                 -7.2
ceo                            C. Douglas McMillon
industry                     General Merchandisers
sector                                   Retailing
previous_rank                                    1
country                                        USA
hq_location                        Bentonville, AR
website                     http://www.walmart.com
years_on_global_500_list                        23
employees                                  2300000
total_stockholder_equity                     77798
Name: Walmart, dtype: object

In [77]:
f500_top_6.loc["Toyota Motor", :]

rank                                                   5
revenues                                          254694
revenue_change                                       7.7
profits                                          16899.3
assets                                            437575
profit_change                                      -12.3
ceo                                          Akio Toyoda
industry                        Motor Vehicles and Parts
sector                            Motor Vehicles & Parts
previous_rank                                          8
country                                            Japan
hq_location                                Toyota, Japan
website                     http://www.toyota-global.com
years_on_global_500_list                              23
employees                                         364445
total_stockholder_equity                          157210
Name: Toyota Motor, dtype: object

In [78]:
f500_top_6.loc["Sinopec Group"]

rank                                             3
revenues                                    267518
revenue_change                                -9.1
profits                                     1257.9
assets                                      310726
profit_change                                -65.0
ceo                                      Wang Yupu
industry                        Petroleum Refining
sector                                      Energy
previous_rank                                    4
country                                      China
hq_location                         Beijing, China
website                     http://www.sinopec.com
years_on_global_500_list                        19
employees                                   713288
total_stockholder_equity                    106523
Name: Sinopec Group, dtype: object

In [79]:
single_row = f500_top_6.loc["Sinopec Group"]
print(type(single_row))
print(single_row)

<class 'pandas.core.series.Series'>
rank                                             3
revenues                                    267518
revenue_change                                -9.1
profits                                     1257.9
assets                                      310726
profit_change                                -65.0
ceo                                      Wang Yupu
industry                        Petroleum Refining
sector                                      Energy
previous_rank                                    4
country                                      China
hq_location                         Beijing, China
website                     http://www.sinopec.com
years_on_global_500_list                        19
employees                                   713288
total_stockholder_equity                    106523
Name: Sinopec Group, dtype: object


Selecting specific row values with all columns

In [83]:
list_rows = f500_top_6.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))
print(list_rows)

<class 'pandas.core.frame.DataFrame'>
              rank  revenues  revenue_change  profits  assets  profit_change  \
Toyota Motor     5    254694             7.7  16899.3  437575          -12.3   
Walmart          1    485873             0.8  13643.0  198825           -7.2   

                              ceo                  industry  \
Toyota Motor          Akio Toyoda  Motor Vehicles and Parts   
Walmart       C. Douglas McMillon     General Merchandisers   

                              sector  previous_rank country      hq_location  \
Toyota Motor  Motor Vehicles & Parts              8   Japan    Toyota, Japan   
Walmart                    Retailing              1     USA  Bentonville, AR   

                                   website  years_on_global_500_list  \
Toyota Motor  http://www.toyota-global.com                        23   
Walmart             http://www.walmart.com                        23   

              employees  total_stockholder_equity  
Toyota Motor     3644

selecting a range of rows with all columns

In [89]:
slice_rows = f500_top_6["State Grid":"Toyota Motor"]
print(type(slice_rows))
print(slice_rows)

<class 'pandas.core.frame.DataFrame'>
                          rank  revenues  revenue_change  profits  assets  \
State Grid                   2    315199            -4.4   9571.3  489838   
Sinopec Group                3    267518            -9.1   1257.9  310726   
China National Petroleum     4    262573           -12.3   1867.5  585619   
Toyota Motor                 5    254694             7.7  16899.3  437575   

                          profit_change            ceo  \
State Grid                         -6.2        Kou Wei   
Sinopec Group                     -65.0      Wang Yupu   
China National Petroleum          -73.7  Zhang Jianhua   
Toyota Motor                      -12.3    Akio Toyoda   

                                          industry                  sector  \
State Grid                               Utilities                  Energy   
Sinopec Group                   Petroleum Refining                  Energy   
China National Petroleum        Petroleum Refining 

Shortcut syntax only works if we are selecting specific rows or columns and we are more likely to select specific columns than rows, regardless if we are selecting specific rows and columns we need to use ".loc"

Select by Label	Explicit Syntax	Common Shorthand

Single row	df.loc["row1"]	None

List of rows	df.loc[["row1", "row5"]]	None

Slice of rows	df.loc["row1":"row5"]	df["row1":"row5"]

In [94]:
toyota = f500_top_6.loc["Toyota Motor"]
print(type(toyota))

<class 'pandas.core.series.Series'>


In [104]:
drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"]]
print(type(drink_companies))

<class 'pandas.core.frame.DataFrame'>


In [99]:
drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"]]
print(drink_companies)

                      rank  revenues  revenue_change  profits  assets  \
Anheuser-Busch InBev   206     45905             5.3   1241.0  258381   
Coca-Cola              235     41863            -5.5   6527.0   87270   
Heineken Holding       468     23044            -0.7    861.5   41469   

                      profit_change                        ceo   industry  \
Anheuser-Busch InBev          -85.0               Carlos Brito  Beverages   
Coca-Cola                     -11.2           James B. Quincey  Beverages   
Heineken Holding              -18.9  Jean-Francois van Boxmeer  Beverages   

                                         sector  previous_rank      country  \
Anheuser-Busch InBev  Food, Beverages & Tobacco            211      Belgium   
Coca-Cola             Food, Beverages & Tobacco            206          USA   
Heineken Holding      Food, Beverages & Tobacco            459  Netherlands   

                                 hq_location  \
Anheuser-Busch InBev         Leuv

Since we are printing an inclusive range now of rows we do not need to use a double bracket [[ as we do when we are selecting a list of rows

In [108]:
middle_companies = f500.loc["Tata Motors":"Nationwide"]
print(middle_companies)

                                 rank  revenues  revenue_change  profits  \
Tata Motors                       247     40329            -4.2   1111.6   
Aluminum Corp. of China           248     40278             6.0   -282.5   
Mitsui                            249     40275             1.6   2825.3   
Manulife Financial                250     40238            49.4   2209.7   
China Minsheng Banking            251     40234            -5.2   7201.6   
China Pacific Insurance (Group)   252     40193             2.2   1814.9   
American Airlines Group           253     40180            -2.0   2676.0   
Nationwide                        254     40074            -0.4    334.3   

                                 assets  profit_change                   ceo  \
Tata Motors                       42162          -34.0      Guenter Butschek   
Aluminum Corp. of China           75089            NaN              Yu Dehui   
Mitsui                           103231            NaN       Tatsuo Yasunag

Value count functionality in series (1d) using .value_count method
The output is automatically returned in descinding value for highest value counts

In [109]:
sectors = f500["sector"]
print(type(sectors))

<class 'pandas.core.series.Series'>


In [111]:
sectors_value_counts = sectors.value_counts()
print(sectors_value_counts)

Financials                       118
Energy                            80
Technology                        44
Motor Vehicles & Parts            34
Wholesalers                       28
Health Care                       27
Food & Drug Stores                20
Transportation                    19
Telecommunications                18
Retailing                         17
Food, Beverages & Tobacco         16
Materials                         16
Industrials                       15
Aerospace & Defense               14
Engineering & Construction        13
Chemicals                          7
Household Products                 3
Media                              3
Hotels, Restaurants & Leisure      3
Business Services                  3
Apparel                            2
Name: sector, dtype: int64


In [115]:
countries = f500["country"]
countries_value_count = countries.value_counts()
print(countries_value_count)

USA             132
China           109
Japan            51
Germany          29
France           29
Britain          24
South Korea      15
Netherlands      14
Switzerland      14
Canada           11
Spain             9
Australia         7
Brazil            7
India             7
Italy             7
Taiwan            6
Russia            4
Ireland           4
Singapore         3
Sweden            3
Mexico            2
Malaysia          1
Thailand          1
Belgium           1
Norway            1
Luxembourg        1
Indonesia         1
Denmark           1
Saudi Arabia      1
Finland           1
Venezuela         1
Turkey            1
U.A.E             1
Israel            1
Name: country, dtype: int64


Now we are seeing value country functionality with a dataframe (not a series that is 1d) we are including the sector and industry columns for all rows in our fortune 500 global data set

In [116]:
sector_industries = f500[["sector", "industry"]]
print(type(sector_industries))

<class 'pandas.core.frame.DataFrame'>


the resulting output looks different because the value count method returns a series object that uses a multiindex instead of a single index like in the series.
The counts returned are based on a combination of unique non-null values found in both the sector and industry columns

In [118]:
si_value_counts = sector_industries.value_counts()
print(si_value_counts)

sector                         industry                                      
Financials                     Banks: Commercial and Savings                     51
Motor Vehicles & Parts         Motor Vehicles and Parts                          34
Energy                         Petroleum Refining                                28
Financials                     Insurance: Life, Health (stock)                   24
Food & Drug Stores             Food and Drug Stores                              20
Energy                         Mining, Crude-Oil Production                      18
Financials                     Insurance: Property and Casualty (Stock)          18
Telecommunications             Telecommunications                                18
Energy                         Utilities                                         18
Wholesalers                    Trading                                           15
Health Care                    Pharmaceuticals                                   1

In [121]:
countries = f500["country"]
country_counts = countries.value_counts()
print(country_counts)
top_country = "USA"

USA             132
China           109
Japan            51
Germany          29
France           29
Britain          24
South Korea      15
Netherlands      14
Switzerland      14
Canada           11
Spain             9
Australia         7
Brazil            7
India             7
Italy             7
Taiwan            6
Russia            4
Ireland           4
Singapore         3
Sweden            3
Mexico            2
Malaysia          1
Thailand          1
Belgium           1
Norway            1
Luxembourg        1
Indonesia         1
Denmark           1
Saudi Arabia      1
Finland           1
Venezuela         1
Turkey            1
U.A.E             1
Israel            1
Name: country, dtype: int64


In [123]:
hq_locations = f500["hq_location"]
hql_counts = hq_locations.value_counts()
print(hql_counts)
top_hq_city = "Beijing"

Beijing, China         56
Tokyo, Japan           36
Paris, France          17
New York, NY           15
London, Britain        14
                       ..
Bangkok, Thailand       1
Midland, MI             1
Perth, Australia        1
Leuven, Belgium         1
Fort Lauderdale, FL     1
Name: hq_location, Length: 235, dtype: int64


Selecting the count for a single item or items rather than outputting the count for everything in a column or columns

In [125]:
countries = f500["country"]
country_counts = countries.value_counts()
india = country_counts["India"]
print(india)

7


In [127]:
north_america = country_counts[["USA", "Canada", "Mexico"]]
print(north_america)

USA       132
Canada     11
Mexico      2
Name: country, dtype: int64


In [128]:
print(type(north_america))

<class 'pandas.core.series.Series'>


In [130]:
japan_to_spain = country_counts["Japan":"Spain"]
print(type(japan_to_spain))

<class 'pandas.core.series.Series'>


In [131]:
print(japan_to_spain)

Japan          51
Germany        29
France         29
Britain        24
South Korea    15
Netherlands    14
Switzerland    14
Canada         11
Spain           9
Name: country, dtype: int64


Select by Label	Explicit Syntax	Common Shorthand

Single column from DataFrame	df.loc[:, "col1"]	df["col1"]

List of columns from DataFrame	df.loc[:, ["col1", "col7"]]	df[["col1", "col7"]]

Slice of columns from DataFrame	df.loc[:, "col1":"col4"]	None

Single row from DataFrame	df.loc["row1"]	None

List of rows from DataFrame	df.loc[["row1", "row5"]]	None

Slice of rows from DataFrame	df.loc["row1":"row5"]	df["row1":"row5"]

Single item from Series	s.loc["item8"]	s["item8"]

List of items from Series	s.loc[["item1", "item7"]]	s[["item1", "item7"]]

Slice of items from Series	s.loc["item2":"item4"]	s["item2":"item4"]

In [136]:
big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank", "previous_rank"]]
print(big_movers)

              rank  previous_rank
Aviva           90            279
HP             194             48
JD.com         261            366
BHP Billiton   350            168


In [138]:
bottom_companies = f500.loc["National Grid":"AutoNation", ["rank", "sector", "country"]]
print(bottom_companies)

                                       rank              sector  country
National Grid                           491              Energy  Britain
Dollar General                          492           Retailing      USA
Telecom Italia                          493  Telecommunications    Italy
Xiamen ITG Holding Group                494         Wholesalers    China
Xinjiang Guanghui Industry Investment   495         Wholesalers    China
Teva Pharmaceutical Industries          496         Health Care   Israel
New China Life Insurance                497          Financials    China
Wm. Morrison Supermarkets               498  Food & Drug Stores  Britain
TUI                                     499   Business Services  Germany
AutoNation                              500           Retailing      USA


In [141]:
revenue_giants = f500.loc[["Apple", "Industrial & Commercial Bank of China", "China Construction Bank", "Agricultural Bank of China"], "revenues" : "profit_change"]
print(revenue_giants)

                                       revenues  revenue_change  profits  \
Apple                                    215639            -7.7  45687.0   
Industrial & Commercial Bank of China    147675           -11.7  41883.9   
China Construction Bank                  135093            -8.7  34840.9   
Agricultural Bank of China               117275           -12.1  27687.8   

                                        assets  profit_change  
Apple                                   321686          -14.4  
Industrial & Commercial Bank of China  3473238           -5.0  
China Construction Bank                3016578           -4.0  
Agricultural Bank of China             2816039           -3.6  


In [142]:
print(f500.head(5))

                          rank  revenues  revenue_change  profits  assets  \
Walmart                      1    485873             0.8  13643.0  198825   
State Grid                   2    315199            -4.4   9571.3  489838   
Sinopec Group                3    267518            -9.1   1257.9  310726   
China National Petroleum     4    262573           -12.3   1867.5  585619   
Toyota Motor                 5    254694             7.7  16899.3  437575   

                          profit_change                  ceo  \
Walmart                            -7.2  C. Douglas McMillon   
State Grid                         -6.2              Kou Wei   
Sinopec Group                     -65.0            Wang Yupu   
China National Petroleum          -73.7        Zhang Jianhua   
Toyota Motor                      -12.3          Akio Toyoda   

                                          industry                  sector  \
Walmart                      General Merchandisers               Retailing

In [143]:
f500_head = f500.head(10)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

series_a + series_b - Addition

series_a - series_b - Subtraction

series_a * series_b - Multiplication (this is unrelated to the multiplications used in linear algebra).

series_a / series_b - Division

In this example we are calcuating rank change by subtracting the current rank from the previous rank = positive means an increase in rank

In [146]:
rank_change = f500["previous_rank"] - f500["rank"]
print(rank_change)

Walmart                             0
State Grid                          0
Sinopec Group                       1
China National Petroleum           -1
Toyota Motor                        3
                                 ... 
Teva Pharmaceutical Industries   -496
New China Life Insurance          -70
Wm. Morrison Supermarkets         -61
TUI                               -32
AutoNation                       -500
Length: 500, dtype: int64


Now we can also run these operations to do calculations from series:
Series.max()

Series.min()

Series.mean()

Series.median()

Series.mode()

Series.sum()

In [148]:
print(f500["revenues"].sum())

27708179


Now we can see the max and minium change in rank but based on the output there must be some incorrect data in rank or previous_rank columns

In [155]:
rank_change_max = rank_change.max()
rank_change_min = rank_change.min()

Using the describe function we can find even more information related to rank change which is created by subtracting the series rank from previous rank

In [156]:
rank_change.describe()

count    500.000000
mean     -28.366000
std      108.602823
min     -500.000000
25%      -28.250000
50%       -4.000000
75%        8.250000
max      226.000000
dtype: float64

Notice the output is in e-notation, a type of scientific notation:

Original Notation	Expanded Formula	Result

5.000000E+02	5.000000 * 10 ** 2	500

2.436323E+05	2.436323 * 10 ** 5

In [158]:
assets = f500["assets"]
print(assets.describe())

count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64


"country" column only contains string values but we can see the count, uniques, the top and freq (freq of most common value)

In [160]:
print(countries.describe())

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object


In [162]:
rank = f500["rank"]
print(rank.describe())
rank_desc = rank.describe()

count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64


In [164]:
previous_rank = f500["previous_rank"]
prev_rank_desc = previous_rank.describe()
print(prev_rank_desc)

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64


Rather than: 
countries = f500["country"]
countries_counts = countries.value_counts()

We can assign countries_counts directly using method chaining (a way to combine multiple methods together in a single line):

In [166]:
countries_count = f500["country"].value_counts()
print(countries_count)

USA             132
China           109
Japan            51
Germany          29
France           29
Britain          24
South Korea      15
Netherlands      14
Switzerland      14
Canada           11
Spain             9
Australia         7
Brazil            7
India             7
Italy             7
Taiwan            6
Russia            4
Ireland           4
Singapore         3
Sweden            3
Mexico            2
Malaysia          1
Thailand          1
Belgium           1
Norway            1
Luxembourg        1
Indonesia         1
Denmark           1
Saudi Arabia      1
Finland           1
Venezuela         1
Turkey            1
U.A.E             1
Israel            1
Name: country, dtype: int64


Another example of method chaining allowing

In [169]:
print(f500["country"].value_counts().loc["China"])

109


Now we are going to determine if the value zero occurs in previous rank and how often (this would explain why our rank change function did not function properly)

In [176]:
zero_previous_rank = f500["previous_rank"].value_counts().loc[0]
print(zero_previous_rank)

33


Series.max() and DataFrame.max()

Series.min() and DataFrame.min()

Series.mean() and DataFrame.mean()

Series.median() and DataFrame.median()

Series.mode() and DataFrame.mode()

Series.sum() and DataFrame.sum()

Dataframe.method(axis=0) calculates the results for each column (using the row axis)
Dataframe.method(axis=1) calculates the results for each row (using the column axis)

We are now using the median method (function) to calculate the median values for columns (indexs): revenues, profits

In [177]:
medians = f500[["revenues", "profits"]].median(axis=0)
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64


default value for axis parameter is = 0:

In [184]:
medians = f500[["revenues", "profits"]].median()
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64


In [185]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

In [187]:
max_f500 =f500[["rank", "revenues", "revenue_change", "profits", "assets", "profit_change","previous_rank", "years_on_global_500_list", "employees", "total_stockholder_equity"]].max()
print(max_f500)

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64


Rather than typing all the columns we want to include for max value look up we can use inside the max parameter "numeric_only=True" and this will only select integer or float values

In [190]:
max_f500 = f500.max(numeric_only=True)
print(max_f500)

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64


By default describe will only include statics on columns with numeric values

In [192]:
f500.describe()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


If want describe to return values from non numeric columns we use in the parameter for describe "(include=['O']) (that is a letter not a zero)

In [200]:
f500.describe(include=['O'])

Unnamed: 0,ceo,industry,sector,country,hq_location,website
count,500,500,500,500,500,500
unique,500,58,21,34,235,500
top,C. Douglas McMillon,Banks: Commercial and Savings,Financials,USA,"Beijing, China",http://www.walmart.com
freq,1,51,118,132,56,1


In [202]:
f500_desc = f500.describe()

In [209]:
print(f500.max(numeric_only=True))

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64
