# **INTRODUCTION**

-  Earlier in the introduction to pandas, we explored the use of pandas over NumPy and why it makes coding a whole lot easier than NumPy. 

- Just as a reminder, the **Fortune 500** dataset used in this course mentioned earlier can be found [here]( /https://github.com/Tess-hacker/THE-ULTIMATE-GUIDE-TO-UNDERSTANDING-NumPy-AND-PANDAS/blob/master/FORTUNE's%20500%20LIST.csv). Feel free to explore it!

- If you have been following the [previous introductory lesson](https://github.com/Tess-hacker/THE-ULTIMATE-GUIDE-TO-UNDERSTANDING-NumPy-AND-PANDAS/blob/master/INTRODUCTION%20TO%20PANDAS.ipynb), then you might not need this. Otherwise, let's do a bit of revision before we start.

- Let us print out the first 10 rows of the dataset and get the data type contained within our dataset. Ready? **LET'S GO!**

In [3]:
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None
f500_head = f500.head(10)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

## **VECTORIZED OPERATIONS**

1. Remember vectorized operations in NumPy? We can also do that for pandas and it can be done in an even smoother way.

2. We can do vectorization in pandas series the following way:

    - series_a + series_b - Addition
    - series_a - series_b - Subtraction
    - series_a * series_b - Multiplication (this is unrelated to the multiplications used in linear algebra).
    - series_a / series_b - Division
    
3. Remember that our dataset contains the previous and current year rank list. We can find the difference in the two categories of ranks using the vectorization method.


In [4]:
rank_column = f500['rank']
previous_rank = f500['previous_rank']
rank_change = previous_rank - rank_column
print (rank_change)

Walmart                             0
State Grid                          0
Sinopec Group                       1
China National Petroleum           -1
Toyota Motor                        3
                                 ... 
Teva Pharmaceutical Industries   -496
New China Life Insurance          -70
Wm. Morrison Supermarkets         -61
TUI                               -32
AutoNation                       -500
Length: 500, dtype: int64


- Based on the result above, we can see that from the first five ranking differences:

    - **Walmart and State Grid** are still on the same ranking spot as there is no difference
    - **Sinopec and Toyota Motor** went up on the rankings by the numbers shown
    - **China National** dropped in their ranking.
    
- As nice as it is to individually analyse the results of the ranking, it is not a feasible plan especially when we have an overwhelming volume of data to analyse.

- So, if we want to just get the highest and lowest ranked firm(s) amongst the whole data we have, there is only one way of getting that. Are you thinking what I am thinking?*wink*

- Let me show you!

In [5]:
rank_change =  f500["previous_rank"] - f500["rank"] #to calculate the rank change as we did earlier
rank_change_max = rank_change.max() #to get the highest ranked firm
rank_change_min = rank_change.min() #to get the lowest ranked firm
print (rank_change_max)
print (rank_change_min)
# now let's run the code

226
-500


- The result above shows that the firm whose ranking went up the most went up by 226 while the lowest ranked firm's ranking dropped by 500. That's what you would think, right?

- Oh well.... Think again

- So, according to the data dictionary, our data ranking should only fall within a scale of 1 and 500. So, even if we have the lowest ranked company, the ranking should be a -499 instead of a 500, correct? 

- To investigate this issue and get more familiar with our data, we need a new approach: the `Series.describe()`method. This method helps us to understand **how many non-null values are contained in the series, plus the mean, minimum, maximum, and other statistics** contained within the dataset.

- A major thing you should take note of is that **`series.describe()` method works differently on numeric and non-numeric data respectively**.

- Using this method, let us see the nature of the dataset we are working with. Ready?


In [6]:
# for the numeric column in the dataset
assets = f500["assets"]
print ("The numeric assets column gives us:")
print(assets.describe())
print ('\n')

# for the non-numeric column in the dataset
country = f500["country"]
print ("The country non-numeric column gives us:")
print(country.describe())

The numeric assets column gives us:
count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64


The country non-numeric column gives us:
count     500
unique     34
top       USA
freq      132
Name: country, dtype: object


- The first statistic, `count`, is the same as for both numeric and non-numeric columns, showing us the number of non-null(zero) values. 

- The other three statistics are new:

    - `unique`: Number of unique values in the series. In this case, it tells us that there are 34 different countries represented in the Fortune 500.
    - `top`: Most common value in the series. The **USA** is the country that headquarters the most Fortune 500 companies.
    - `freq`: Frequency of the most common value. Exactly 132 companies from the Fortune 500 are headquartered in the USA.

- Using this same approach, let us find out the nature of data contained within the `rank` and the `previous rank` columns in the dataset.

In [7]:
rank = f500["rank"]
rank_desc =rank.describe()
print ("The RANK column data contains:")
print (rank_desc)
print ('\n')
previous_rank = f500["previous_rank"]
prev_rank_desc = previous_rank.describe()
print ("The PREVIOUS RANK column data contains:")
print (previous_rank)

The RANK column data contains:
count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64


The PREVIOUS RANK column data contains:
Walmart                             1
State Grid                          2
Sinopec Group                       4
China National Petroleum            3
Toyota Motor                        8
                                 ... 
Teva Pharmaceutical Industries      0
New China Life Insurance          427
Wm. Morrison Supermarkets         437
TUI                               467
AutoNation                          0
Name: previous_rank, Length: 500, dtype: int64


## **METHOD CHAINING**

- The results of our RANK data shows us that the minimum value within our dataset is zero (0) which shouldn't be such. If our data is 500 in total, it should rank between 1 and 500 which implies that we shouldn't have a value of zero within the dataset. 

- So, the next step is now to confirm the number of zero(0) values we have within our dataset. One way we can do this is to use the `value.counts()` method. The typical way we can write this code is by:

    `countries = f500["country"]
    countries_counts = countries.value_counts()`

- If you try writing out this code, you'll get the result we intend to get, however, we can reduce the lenghtiness of our codes using the **method chaining** method which is the bone of contention here. **Method Chaining** is **a way to combine multiple methods together in a single line.**

- Let us see how we can reduce the codes above using this method:

In [8]:
# instead of initially assigning the 'country' column to a variable and then finding the value count, we can do this:
countries_counts = f500["country"].value_counts()
# we can even extend the code to select the country we want to get the value count for using the '.loc' method
print ("The number of times China appears is:")
print(f500["country"].value_counts().loc["China"])
print ('\n')
# we can use the method chaining approach to find the number of previous ranks that are null values. You just need to replicate the above code
zero_previous_rank = f500["previous_rank"].value_counts().loc[0] #without the method chaining approach
print ("The number of previous rank null values without the method chaining approach are:")
print (zero_previous_rank)
print ('\n')
print ("The number of previous rank null values with the method chaining approach are:")
print(f500["previous_rank"].value_counts().loc[0]) #with the method chaining approach

The number of times China appears is:
109


The number of previous rank null values without the method chaining approach are:
33


The number of previous rank null values with the method chaining approach are:
33


- Based on the analysis done above, we can see that some of the series have null values. However, we need to look beyond the **series** scope and start looking at the **dataframe** scope to ascertain all we need to know about our data.

- Recall in the last lesson [INTRODUCTION TO PANDAS](https://github.com/Tess-hacker/THE-ULTIMATE-GUIDE-TO-UNDERSTANDING-NumPy-AND-PANDAS/blob/master/INTRODUCTION%20TO%20PANDAS.ipynb) which I believe you must have gone through, we learnt about the differences between Series and Dataframes.

- The approach of finding the statistical summary we need from our dataset when applying it to Series and Dataframe differs slightly. While the median for the former can be calculated for instance using the `series.median()`, calculating the same for the latter has an adjustment to it: the addition of an *axis* parameter.

- In addition, when calculating these values for a dataframe and filling in the *axis* parameter, there are two categories of data that can be filled in:

    - To enable the code run across **columns**, the data passed to axis can be either the number **1** or the string **columns**
    - To enable the code run across **rows**, the data passed to axis can be either the number **0** or the string **index**

- Let us learn the differences in the table below:

  |**SERIES** | **DATAFRAME**|
  |  ---  | --- |
  |Series.max()| Dataframe.max(axis = 0 or 1) |
  |Series.min()| Dataframe.min(axis = 0 or 1) |
  |Series.median()| Dataframe.median(axis = 0 or 1) |
  |Series.mean()| Dataframe.mean(axis = 0 or 1) |
  |Series.mode()| Dataframe.mode(axis = 0 or 1) |
  |Series.sum()| Dataframe.sum(axis = 0 or 1) |

- Now, let us use the dataframe to find the maximum value for **only numeric columns** within the dataset.

In [9]:
max_f500 = f500.max(numeric_only=True)
print (max_f500)

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64


- When using the `Dataframe.describe()` method, by default, it only returns the data or information for just the numeric columns. However, if we want it to return data for just the **object columns**, we need to add more information into our code.

- Our describe function will look like this: 

    - `Dataframe.describe(include = ['0'])`
    
            
- Let's try getting the information for both the numeric and object columns below.

In [10]:
# for the object columns
print ("The outcome for the OBJECT columns are as follows:")
print (f500.describe(include = ['O']))
print ('\n')
print ("The outcome for the NUMERIC columns are as follows:")
print (f500.describe())


The outcome for the OBJECT columns are as follows:
             ceo                       industry      sector country  \
count        500                            500         500     500   
unique       500                             58          21      34   
top     Yu Dehui  Banks: Commercial and Savings  Financials     USA   
freq           1                             51         118     132   

           hq_location                 website  
count              500                     500  
unique             235                     500  
top     Beijing, China  http://www.siemens.com  
freq                56                       1  


The outcome for the NUMERIC columns are as follows:
             rank       revenues  revenue_change       profits        assets  \
count  500.000000     500.000000      498.000000    499.000000  5.000000e+02   
mean   250.500000   55416.358000        4.538353   3055.203206  2.436323e+05   
std    144.481833   45725.478963       28.549067   517

## **ASSIGNMENT WITH PANDAS**

- From the result above for the numeric columns, we can conclude that no awkward looking value stands out except for the zero values under the previous rank column.

- We do not want these values within our data and would like to assign null values to replace the zeros. This will help to indicate that the values are missing because based on our conclusion earlier in the lesson, a company with zero ranking has no rank at all. So, assigning a null value helps to show that **there's a rank for the firm(s) involved but the ranking(s) is/are missing.

- In doing this, we will be learning the process of **assigning with pandas**. Just like NumPy, we can assign values to a row or a column using the column or row title. Wanna know how this is done?

- Let's use the following examples:

In [11]:
# first, let us print the first five rows of the data and the first two columns
top5_rank_revenue = f500[["rank", "revenues"]].head()
print ("The first five ranked companies and their revenue are:")
print(top5_rank_revenue)
print ('\n')
# then let us assign values to the revenue column
top5_rank_revenue["revenues"] = 0
print ("The newly assigned values to revenue column are:")
print(top5_rank_revenue)
print ('\n')
# we can also assign values based on a particular row and column location. Let's see:
top5_rank_revenue.loc["State Grid", "revenues"] = 999
print ("The newly assigned values for State Grid are:")
print(top5_rank_revenue)

The first five ranked companies and their revenue are:
                          rank  revenues
Walmart                      1    485873
State Grid                   2    315199
Sinopec Group                3    267518
China National Petroleum     4    262573
Toyota Motor                 5    254694


The newly assigned values to revenue column are:
                          rank  revenues
Walmart                      1         0
State Grid                   2         0
Sinopec Group                3         0
China National Petroleum     4         0
Toyota Motor                 5         0


The newly assigned values for State Grid are:
                          rank  revenues
Walmart                      1         0
State Grid                   2       999
Sinopec Group                3         0
China National Petroleum     4         0
Toyota Motor                 5         0


- Now for a real challenge:

    - The company "Dow Chemical" has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.
    
- Are you up for it? *winks*

In [12]:
newdowceo = f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"
print ("The new CEO for Dow Chemical is:")
print(newdowceo)
print ('\n')
print ("The updated CEO database is:")
print (f500[["ceo"]])

The new CEO for Dow Chemical is:
Jim Fitterling


The updated CEO database is:
                                                ceo
Walmart                         C. Douglas McMillon
State Grid                                  Kou Wei
Sinopec Group                             Wang Yupu
China National Petroleum              Zhang Jianhua
Toyota Motor                            Akio Toyoda
...                                             ...
Teva Pharmaceutical Industries    Yitzhak Peterburg
New China Life Insurance                   Wan Feng
Wm. Morrison Supermarkets            David T. Potts
TUI                               Friedrich Joussen
AutoNation                       Michael J. Jackson

[500 rows x 1 columns]


## **BOOLEAN INDEXING WITH PANDAS**

- Now, we know how to assign values with pandas, we have one more step to go: understanding how to execute boolean indexing with pandas.

- Looks easy to assign values to different rows and columns in a dataset, right? But looking forward, you'll see that the process becomes cumbersome when you are dealing with a larger volume of data. With the help of boolean indexing, **we can assign desired values to specific rows and/or columns at once**.

- Using the boolean indexing and the `df.loc()` function, let us identify companies belonging to the "Motor Vehicles and Parts" industry in our f500 dataset.

In [16]:
# First, Create a boolean series that compares whether the values in the industry column from the f500 dataframe are equal to "Motor Vehicles and Parts"
motor_bool = f500["industry"] == "Motor Vehicles and Parts"
print ("The Fortune 500 companies under the Motor Vehicles and Parts category are:")
print (motor_bool)
print ('\n')
# then let us use the motor_bool series to index the country column; this way, we find out countries under the "Motor Vehicles & Parts" category
motor_countries = f500.loc[motor_bool,"country"]
print ("The countries under the Motor Vehicles and Parts category are:")
print (motor_countries)
print ('\n')
# let us count the number of countries under this category:
print ("The total number of countries in the Motor Vehicles and Parts category is:")
print (motor_countries.count())

The Fortune 500 companies under the Motor Vehicles and Parts category are:
Walmart                           False
State Grid                        False
Sinopec Group                     False
China National Petroleum          False
Toyota Motor                       True
                                  ...  
Teva Pharmaceutical Industries    False
New China Life Insurance          False
Wm. Morrison Supermarkets         False
TUI                               False
AutoNation                        False
Name: industry, Length: 500, dtype: bool


The countries under the Motor Vehicles and Parts category are:
Toyota Motor                                 Japan
Volkswagen                                 Germany
Daimler                                    Germany
General Motors                                 USA
Ford Motor                                     USA
Honda Motor                                  Japan
SAIC Motor                                   China
Nissan Motor          

- With our newly found knowledge, we can now take care of the main reason we learned this in the first place: to replace the zero values in previous rank with null values as we consider them to be missing. Right?

- Let us use the **method chaining** technique to combine pandas assignment and boolean indexing in assigning null values to our desired columns:

In [19]:
import numpy as np
# showing the previous rank column with an additional parameter: dropna = False which implies that the first 5 rows will be printing including null values
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
print ("The PREVIOUS RANK data before method chaining is as follows:")
print (prev_rank_before)
print ('\n')
# assigning the null values within our dataset using method chaining
f500.loc[f500["previous_rank"] == 0,"previous rank"] = np.nan
prev_rank_after= f500["previous_rank"].value_counts(dropna=False).head()
print ("The PREVIOUS RANK data after method chaining is as follows:")
print (prev_rank_after)

The PREVIOUS RANK data before method chaining is as follows:
NaN      33
471.0     1
234.0     1
125.0     1
166.0     1
Name: previous_rank, dtype: int64


The PREVIOUS RANK data after method chaining is as follows:
NaN      33
471.0     1
234.0     1
125.0     1
166.0     1
Name: previous_rank, dtype: int64


## **CREATING NEW COLUMNS**

- Now that we have learnt values assignment and boolean indexing, we need to understand the concept of creating new columns within the dataframe. New columns can be created by assigning values to that new column.

- Let us create a 'rank change' column within our dataset.

In [23]:
# create the rank change column which will be calculated by subtracting current rank from previous rank
f500["rank_change"] = f500["previous_rank"] - f500["rank"] 
print (f500)
print ('\n')
rank_change_desc = f500["rank_change"].describe()
print ("The description of the rank change column is as follows:")
print (rank_change_desc)

                                rank  revenues  revenue_change  profits  \
Walmart                            1    485873             0.8  13643.0   
State Grid                         2    315199            -4.4   9571.3   
Sinopec Group                      3    267518            -9.1   1257.9   
China National Petroleum           4    262573           -12.3   1867.5   
Toyota Motor                       5    254694             7.7  16899.3   
...                              ...       ...             ...      ...   
Teva Pharmaceutical Industries   496     21903            11.5    329.0   
New China Life Insurance         497     21796           -13.3    743.9   
Wm. Morrison Supermarkets        498     21741           -11.3    406.4   
TUI                              499     21655            -5.5   1151.7   
AutoNation                       500     21609             3.6    430.5   

                                assets  profit_change                  ceo  \
Walmart              

## **CLOSING EXERCISES**

- Let us now put what we have learnt so far into practice. 

    - First, let us print out the top 2 performing companies in the industry column in USA.
    - Then, let us find the top 3 performing companies in the sector column in China

In [24]:
# to perform the two tasks above, we will need the value_counts() method, the boolean indexing and the method chaining strategy (optional)
usa_bool = f500["industry"][f500["country"] == "USA"] #selecting the industries that fall under USA 
industry_usa = usa_bool.value_counts().head(2) # assigning those selected industries and selecting the first 2 performing industries in USA
print ("The first two performing industries in USA are:")
print (industry_usa)
print ('\n')
china_bool = f500["sector"][f500["country"] == "China"] # Selecting the sectors under China
sector_china = china_bool.value_counts().head(3) # assigning those selected Chinese sectors and selecting the first 3 performing sectors in China
print ("The first 3 performing sectors in China are:")
print (sector_china)

The first two performing industries in USA are:
Banks: Commercial and Savings               8
Insurance: Property and Casualty (Stock)    7
Name: industry, dtype: int64


The first 3 performing sectors in China are:
Financials     25
Energy         22
Wholesalers     9
Name: sector, dtype: int64


# **SUMMARY**

- In this lesson, we learned:

    - How to select data from pandas objects using boolean arrays.
    - How to assign data using labels and boolean arrays.
    - How to create new columns in pandas.
    - Many new methods to make data analysis easier in pandas.
    
- In the next lesson, we will be learning more advanced techniques in pandas.

- Remember: PRACTICE! PRACTICE!! PRACTICE!!! until you get it!

- Till the next lesson, **HAPPY CODING**
