# 1. Introduction to the Data

* Apart from axis values as string `labels` and `multiple data types` 
* It has many built-in methods and functions for common exploration and analysis tasks.

We'll continue working with a data set from [Fortune](https://fortune.com/) magazine's Global [500 list 2017](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible.

In [1]:
# import pandas module
import pandas as pd


#import dataset file 
f500=pd.read_csv('f500.csv')

### TODO
* use the `DataFrame.head()` and `DataFrame.info()` methods to refamiliarize ourselves with the data.

In [2]:
f500.head(7)

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
5,Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753
6,Royal Dutch Shell,7,240033,-11.8,4575.0,411275,135.9,Ben van Beurden,Petroleum Refining,Energy,5,Netherlands,"The Hague, Netherlands",http://www.shell.com,23,89000,186646


In [3]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 17 columns):
company                     500 non-null object
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(7)
memory usage: 66.5+ KB


# 2. Vectorized Operations

**Vectorization not only improves our code's performance, but also enables us to write code more quickly.**

* **Because pandas is an extension of NumPy, it also supports vectorized operations.**

Just like with NumPy, we can use any of the standard Python numeric operators with series, including:

* series_a + series_b - Addition
* series_a - series_b - Subtraction
* series_a * series_b - Multiplication (this is unrelated to the multiplications used in linear algebra).
* series_a / series_b - Division

## TODO:
* Subtract the values in the rank column from the values in the previous_rank column. Assign the result to rank_change.

In [4]:
rank_change=f500['previous_rank']-f500['rank']

In [5]:
rank_change[:5]

0    0
1    0
2    1
3   -1
4    3
dtype: int64

# 3. Series Data Exploration Methods

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):

* Series.max()
* Series.min()
* Series.mean()
* Series.median()
* Series.mode()
* Series.sum()

## TODO:
* Use the Series.max() method to find the maximum value for the rank_change series. Assign the result to the variable rank_change_max.
* Use the Series.min() method to find the minimum value for the rank_change series. Assign the result to the variable rank_change_min.

In [6]:
rank_change_max=rank_change.max()
rank_change_min=rank_change.min()

In [7]:
rank_change_max

226

In [8]:
rank_change_min

-500

### Observation:
* Maximum swing in rank is 226 and minimum drop in rank is by 500.
* However, according to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the rank column or previous_rank column.

# 4. Series Describe Method

* `Series.describe()` method that can help us more quickly investigate this issue.
* This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics


* **If we use describe() on a column that contains non-numeric values, we get some different statistics.**
  * count  : no of non null values
  * unique : no of unique values 
  * top    : most common value
  * freq   : freq of most common value

## TODO:
* Return a series of descriptive statistics for the rank column in f500.
  * Select the rank column. Assign it to a variable named rank.
  * Use the Series.describe() method to return a series of statistics for rank. Assign the result to rank_desc.
* Return a series of descriptive statistics for the previous_rank column in f500.
  * Select the previous_rank column. Assign it to a variable named prev_rank.
  * Use the Series.describe() method to return a series of statistics for prev_rank. Assign the result to prev_rank_desc.

In [9]:
rank=f500['rank']
rank_desc=rank.describe()

In [10]:
prev_rank=f500['previous_rank']
prev_rank_desc=prev_rank.describe()

In [11]:
rank_desc

count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64

In [12]:
prev_rank_desc

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64

### Observation:
However, this column should only have values between 1 and 500 (inclusive), so a value of 0 doesn't make sense. To investigate the possible cause of this issue, let's confirm the number of 0 values that appear in the previous_rank column.

# 5. Method Chaining

**method chaining — a way to combine multiple methods together in a single line.**

## TODO
* Use Series.value_counts() and Series.loc to return the number of companies with a value of 0 in the previous_rank column in the f500 dataframe.
* Assign the results to zero_previous_rank.

In [13]:
zero_previous_rank=f500['previous_rank'].value_counts().loc[0]
zero_previous_rank

33

# 6. Dataframe Exploration Methods

Because series and dataframes are two distinct objects, they have their own unique methods. However, there are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. Below are some examples:

* Series.max() and DataFrame.max()
* Series.min() and DataFrame.min()
* Series.mean() and DataFrame.mean()
* Series.median() and DataFrame.median()
* Series.mode() and DataFrame.mode()
* Series.sum() and DataFrame.sum()

**Unlike their series counterparts, dataframe methods require an axis parameter so we know which axis to calculate across.**

* While you can use integers to refer to the first and second axis, 
* **pandas dataframe methods also accept the strings `"index"` and `"columns"` for the axis parameter**

## TODO:
* Use the DataFrame.max() method to find the maximum value for only the numeric columns from f500 (you may need to check the documentation). Assign the result to the variable max_f500.

In [14]:
max_f500=f500.max(numeric_only=True)

In [15]:
max_f500

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64

# 7. Dataframe Describe Method

* One difference in series and dataframe describe() method is that we need to manually specify if you want to see the statistics for the non-numeric columns. 
* By default, DataFrame.describe() will return statistics for only numeric columns. 
* If we wanted to get just the object columns, we need to use the `include=['O']` parameter:

## TODO
* Return a dataframe of descriptive statistics for all of the numeric columns in f500. Assign the result to f500_desc.

In [16]:
# includes numeric columns

f500_desc=f500.describe()

In [17]:
f500_desc

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


In [18]:
# for object type columns only

f500_obj=f500.describe(include=['O'])
f500_obj

Unnamed: 0,company,ceo,industry,sector,country,hq_location,website
count,500,500,500,500,500,500,500
unique,500,500,58,21,34,235,500
top,Deutsche Bank,Lloyd C. Blankfein,Banks: Commercial and Savings,Financials,USA,"Beijing, China",http://www.altice.net
freq,1,1,51,118,132,56,1


# 8. Assignment with pandas

Previously, we concluded that companies with a rank of zero didn't have a rank at all. Next, we'll replace these values with a null value to clearly indicate that the value is missing.

We'll learn how to do two things so we can correct these values:

* **Perform assignment in pandas**
* **Use boolean indexing in pandas.**

* Just like in NumPy, the same techniques that we use to select data could be used for assignment. When we selected a whole column by label and used assignment, we assigned the value to every item in that column.

* By providing labels for both axes, we can assign them to a single value within our dataframe.

# TODO:
* The company "Dow Chemical" has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.

In [19]:
f500.loc['Dow Chemical','ceo']='Jim Fittering'

# 9. Using Boolean Indexing with pandas Objects

* While it's helpful to be able to replace specific values when we know the row label ahead of time, this can be cumbersome when we need to replace many values. Instead, we can `use boolean indexing to change all rows that meet the same criteria`, just like we did with NumPy.

## TODO:
* Create a boolean series, motor_bool, that compares whether the values in the industry column from the f500 dataframe are equal to "Motor Vehicles and Parts".
* Use the motor_bool boolean series to index the country column. Assign the result to motor_countries.


In [20]:
motor_bool=f500['industry'] =='Motor Vehicles and Parts'

In [21]:
motor_countries=f500.loc[motor_bool,'country'].value_counts()

In [22]:
motor_countries

Japan          10
China           7
Germany         6
France          3
South Korea     3
USA             2
India           1
Sweden          1
Canada          1
Name: country, dtype: int64

# 10. Using Boolean Arrays to Assign Values

**dropna=False parameter, which stops the `Series.value_counts(dropna=false)` method from excluding null values when it makes its calculation**

## TODO
* Use boolean indexing to update values in the previous_rank column of the f500 dataframe:
  * There should now be a value of np.nan where there previously was a value of 0.
  * It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.
* Create a new pandas series, prev_rank_after, using the same syntax that was used to create the prev_rank_before series.

In [23]:
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
prev_rank_before

0.0      33
471.0     1
234.0     1
125.0     1
166.0     1
Name: previous_rank, dtype: int64

In [24]:
import numpy as np
f500.loc[f500['previous_rank']==0,'previous_rank']=np.nan

In [25]:
prev_rank_after=f500['previous_rank'].value_counts(dropna=False).head()
print(prev_rank_after)

NaN      34
471.0     1
234.0     1
125.0     1
166.0     1
Name: previous_rank, dtype: int64


# 11. Creating New Columns

## TODO
* Add a new column named rank_change to the f500 dataframe by subtracting the values in the rank column from the values in the previous_rank column.
* Use the Series.describe() method to return a series of descriptive statistics for the rank_change column. Assign the result to rank_change_desc.

In [26]:
f500['rank_change']=f500['previous_rank']-f500['rank']
rank_change_desc=rank_change.describe()
rank_change.head(8)

0    0
1    0
2    1
3   -1
4    3
5    1
6   -2
7    3
dtype: int64

# 12. Challenge: Top Performers by Country

In [27]:
top_3_performs=f500['country'].value_counts().head(3)
top_3_performs

USA      132
China    109
Japan     51
Name: country, dtype: int64

## TODO
* Create a series, industry_usa, containing counts of the two most common values in the industry column for companies headquartered in the USA.
* Create a series, sector_china, containing counts of the three most common values in the sector column for companies headquartered in the China.

In [28]:
industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)
sector_china = f500["sector"][f500["country"] == "China"].value_counts().head(3)
mean_employees_japan = f500["employees"][f500["country"] == "Japan"].mean()

In [29]:
industry_usa

Banks: Commercial and Savings               8
Insurance: Property and Casualty (Stock)    7
Name: industry, dtype: int64

In [30]:
sector_china

Financials     25
Energy         22
Wholesalers     9
Name: sector, dtype: int64

In this mission, we learned:

* How to select data from pandas objects using boolean arrays.
* How to assign data using labels and boolean arrays.
* How to create new rows and columns in pandas.
* Many new methods to make data analysis easier in pandas.