#  Pandas - Exploring Data

Dataquest

In [1]:
import pandas as pd
import numpy as np

**OBS:**
<br>
In the last notebook, it was used a dataset f500 that was cleaned inside the study.
<br> 
Unlike the last one, in this notebook it will be used a slightly different dataset with the same name f500, that is already clean.

### Learning how indexing when importing csv files with [pd.read_csv()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) works:

✽ The index_col parameter specifies which column to use as the row labels. We use a value of 0 to specify that we want to use the first column.
<br> &emsp; This is diferent from the last time, since we dont have a column in the left that counts the index anymore.
<br> &emsp; This way the companies names will be the index, the other way the index labels were integers starting from 0

In [2]:
f500_test_string_indexing = pd.read_csv("f500.csv", index_col=0)

f500_test_string_indexing.iloc[:3,:3]

Unnamed: 0_level_0,rank,revenues,revenue_change
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Walmart,1,485873,0.8
State Grid,2,315199,-4.4
Sinopec Group,3,267518,-9.1


Notice that above the index labels is the text **company**.
<br>This is the value from the start of the first row of the CSV, effectively **the name of the first column**. 
<br>Pandas has used this value as the axis name for the index axis. Both the column and index axes can have names assigned to them. 
<br>**The next line of code removes that name:**

In [4]:
f500_test_string_indexing.index.name = None
f500_test_string_indexing.iloc[:3,:3]

Unnamed: 0,rank,revenues,revenue_change
Walmart,1,485873,0.8
State Grid,2,315199,-4.4
Sinopec Group,3,267518,-9.1


---

**Normal importing**
<br>The study on this notebok will happen with a numeric indexing, so now it will be imported in the way it will be realy used:

In [5]:
f500 = pd.read_csv("f500.csv")

#Transforming zero on 'previous_rank' column as NaN to make better results
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan 

f500.head(3)

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


<br>

### Comparison Operators in Pandas

<img src="comparison_operators_inpandas.jpg">

✽  Companies in f500_sel with more than 265 billion in revenue that are headquarted in China:

In [19]:
#If we were working with the dataset of the previous notebook,
#the country would be together with the city in a column named location,
#so a good syntax to analysi that situation would be: f500[“location”].str.endswith(“China”)

bool= (f500["country"] == "China") & (f500["revenues"]>265000 )
final=f500.loc[bool,"company"]
final.head()

1       State Grid
2    Sinopec Group
Name: company, dtype: object

✽ Companies with revenues over 100 billion and negative profits:

In [21]:
big_rev_neg_profit = f500[
                           (f500["revenues"]> 100000)
                            &
                            (f500["profits"] < 0)
                           ]
big_rev_neg_profit.head()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
32,Japan Post Holdings,33,122990,3.6,-267.4,2631385,-107.5,Masatsugu Nagato,"Insurance: Life, Health (stock)",Financials,37.0,Japan,"Tokyo, Japan",http://www.japanpost.jp,21,248384,91532
44,Chevron,45,107567,-18.0,-497.0,260078,-110.8,John S. Watson,Petroleum Refining,Energy,31.0,USA,"San Ramon, CA",http://www.chevron.com,23,55200,145556


✽ The first 5 companies in the Technology sector that are not headquartered in the USA:

In [25]:
tech_outside_usa = f500[(f500["country"] != "USA") & (f500["sector"]=="Technology")].head(5)
tech_outside_usa

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
14,Samsung Electronics,15,173957,-2.0,19316.5,217104,16.8,Oh-Hyun Kwon,"Electronics, Electrical Equip.",Technology,13.0,South Korea,"Suwon, South Korea",http://www.samsung.com,23,325000,154376
26,Hon Hai Precision Industry,27,135129,-4.3,4608.8,80436,-0.4,Terry Gou,"Electronics, Electrical Equip.",Technology,25.0,Taiwan,"New Taipei City, Taiwan",http://www.foxconn.com,13,726772,33476
70,Hitachi,71,84558,1.2,2134.3,86742,48.8,Toshiaki Higashihara,"Electronics, Electrical Equip.",Technology,79.0,Japan,"Tokyo, Japan",http://www.hitachi.com,23,303887,26632
82,Huawei Investment & Holding,83,78511,24.9,5579.4,63837,-5.0,Ren Zhengfei,Network and Other Communications Equipment,Technology,129.0,China,"Shenzhen, China",http://www.huawei.com,8,180000,20159
104,Sony,105,70170,3.9,676.4,158519,-45.1,Kazuo Hirai,"Electronics, Electrical Equip.",Technology,113.0,Japan,"Tokyo, Japan",http://www.sony.net,23,128400,22415


<br>

### Adding Serie to a DataFrame by Index

First we create a series (by performing vectorized subtraction only on rows without null value):<br>
* This time by making a diferece betwen the rank that the company had last year, by this year.

In [33]:
previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
rank_change.head()

0    0.0
1    0.0
2    1.0
3   -1.0
4    3.0
dtype: float64

Now, to assign this new series to our dataframe, we call our dataframe with a new column name and assigning the serie: 

In [35]:
f500["rank_change"] = rank_change
f500.tail(2)

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467.0,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006,-32.0
499,AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310,


f500.index.name = None  



First information worth mention is:
    pandas uses NumPy objects behind the scenes to store the data