# Introduction to Pandas

This notebook corresponds to mission 15 of [dataquest](https://www.dataquest.io).
<br>There is a short introduction to pandas [here](http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html) as well.

In [1]:
import pandas as pd
import numpy as np

We've been downloading CSV files, and there was a lot of ways of doing it, but the **most used** and the one that we will be using from now on is the [folowing](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): 

In [2]:
f500 = pd.read_csv('f500.csv')  #most used way

Is **important** to perceive that the function above returns a DataFrame object

In [3]:
type(f500)

pandas.core.frame.DataFrame

### Data cleaning:

To make it easier to understand our data we will clean some columns out.<br>
In order to do that we will use [df.drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)<br>
**Syntaxes:**<br>
* _Basic:_ **df = df.drop('column_name', axis=1)**
* _To delete the column without having to reassign df:_ **df.drop('column_name', axis=1, inplace=True)**
* _Finaly drop by column number instead of by column label:_ **df = df.drop(df.columns[[0, 1, 3]], axis=1)**

In [4]:
list_of_drops = [0,1] + list(range(3,8)) + list(range(9,25)) + list(range(26,31)) + list(range(32,35)) + list(range(36,39)) + [40] + list(range(42,59)) + [60,61]

f500.drop(f500.columns[list_of_drops], axis=1, inplace=True)

<br>**Now lets rename the columns**<br>
[df.rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

In [5]:
f500 = f500.rename(columns={
                    f500.columns.values[0]:"revenue",
                    f500.columns.values[1]:"profit",
                    f500.columns.values[2]:"industry",
                    f500.columns.values[3]:"sector",
                    f500.columns.values[4]:"location",
                    f500.columns.values[5]:"years_on_top500",
                    f500.columns.values[6]:"employees"                   
})

<br>

### Data FastView
To verify if it worked we can use a really handy method [head()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html):<br>
**OBS:** autommaticaly prints 5 rows, but we can pass how many rows we want as a parameter ex: df.head(10)

In [6]:
f500.head()

Unnamed: 0,revenue,profit,industry,sector,location,years_on_top500,employees,title
0,485873,13643.0,General Merchandisers,Retailing,"Bentonville, AR",23,2300000,Walmart
1,315199,9571.3,Utilities,Energy,"Beijing, China",17,926067,State Grid
2,267518,1257.9,Petroleum Refining,Energy,"Beijing, China",19,713288,Sinopec Group
3,262573,1867.5,Petroleum Refining,Energy,"Beijing, China",17,1512048,China National Petroleum
4,254694,16899.3,Motor Vehicles and Parts,Motor Vehicles & Parts,"Toyota, Japan",23,364445,Toyota Motor


✽When head show us the first rows, we can use [tail()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) to see the last rows as well.

In [7]:
f500.tail(3)

Unnamed: 0,revenue,profit,industry,sector,location,years_on_top500,employees,title
497,21741,406.4,Food and Drug Stores,Food & Drug Stores,"Bradford, Britain",13,77210,Wm. Morrison Supermarkets
498,21655,1151.7,Travel Services,Business Services,"Hanover, Germany",23,66779,TUI
499,21609,430.5,Specialty Retailers,Retailing,"Fort Lauderdale, FL",12,26000,AutoNation


---

## Panda Structures:

**DataFrame & Series**
Dataframes are two dimensional pandas objects, the pandas equivalent of a Numpy 2D ndarray. Unlike NumPy, pandas does not use the same type for 1D and 2D arrays. The other is called series.

## • DataFrame

<img src="df_anatomy.svg" style="width:500px; float:left;">

* In Red: Just like a 2D ndarray, there are two axes, however each axis of a dataframe has a specific name. The first axis is called index, and the second axis is called columns.
* In Blue: Our axis values have string labels, not just numeric locations.
* In Green: Our dataframe contains columns with multiple dtypes: integer, float, and string.

### Information

✽ **Types** [df.dtypes](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.dtype.html#numpy.ndarray.dtype) show us the types:

In [8]:
print(f500.dtypes)

revenue              int64
profit             float64
industry            object
sector              object
location            object
years_on_top500      int64
employees            int64
title               object
dtype: object


✽ We have seen the float64 dtype before in NumPy. Pandas uses NumPy dtypes for numeric columns, including integer64. There is also a type we haven't seen before, object, which is used for columns that have data that doesn't fit into any other dtypes. This is almost always used for columns containing string values.

✽ **INFO** The same way that dtype showed us the types, we can see types and more information using [dt.info()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)<br>
**OBS:** this is a commun way to analyse the data as soon as we read it.

In [9]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
revenue            500 non-null int64
profit             499 non-null float64
industry           500 non-null object
sector             500 non-null object
location           500 non-null object
years_on_top500    500 non-null int64
employees          500 non-null int64
title              500 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 31.3+ KB


<br>

## Indexing

In DataFrame there is 2 ways to select a especific element:<br>

* **Interger Indexing - iloc**<br>
Rows and Columns index by numbers

In [10]:
f500.iloc[9][2]


'Petroleum Refining'

* **Name Indexing - loc**<br>
Rows are indexed by interger<br>
But Columns are indexed by their names

In [11]:
f500.loc[0:3,"title"]

0                     Walmart
1                  State Grid
2               Sinopec Group
3    China National Petroleum
Name: title, dtype: object

In [12]:
f500.loc[0:3,["industry","title"]]

Unnamed: 0,industry,title
0,General Merchandisers,Walmart
1,Utilities,State Grid
2,Petroleum Refining,Sinopec Group
3,Petroleum Refining,China National Petroleum


In [13]:
f500.loc[0:3,"industry":"title"]

Unnamed: 0,industry,sector,location,years_on_top500,employees,title
0,General Merchandisers,Retailing,"Bentonville, AR",23,2300000,Walmart
1,Utilities,Energy,"Beijing, China",17,926067,State Grid
2,Petroleum Refining,Energy,"Beijing, China",19,713288,Sinopec Group
3,Petroleum Refining,Energy,"Beijing, China",17,1512048,China National Petroleum



* **Obs1:** When selecting just 1 row or just 1 column in a DataFrame, it returns a serie:
* **Obs2:** There is a third way whitch is **.ix()**, but is deprecated.

<br>

## • Series

Series is the pandas type for one-dimensional objects.

In [14]:
print(type(f500["title"]))

<class 'pandas.core.series.Series'>


Anytime you see a 1D pandas object, it will be a series, and anytime you see a 2D pandas object, it will be a dataframe.<br>
You might like to think of a dataframe as being a collection of series objects, which is similar to how pandas stores the data behind the scenes.

<img src="df_exploded.svg" style="width:500px; float:mid;">

<img src="seriesDataframes.jpg" >

✽ Selecting a single row

In [15]:
single_row = f500.loc[2]
print(type(single_row))
print(single_row)

<class 'pandas.core.series.Series'>
revenue                        267518
profit                         1257.9
industry           Petroleum Refining
sector                         Energy
location               Beijing, China
years_on_top500                    19
employees                      713288
title                   Sinopec Group
Name: 2, dtype: object


✽ To select a list of rows:

In [16]:
list_rows = f500.loc[[1,2]]
print(type(list_rows))
print(list_rows)

<class 'pandas.core.frame.DataFrame'>
   revenue  profit            industry  sector        location  \
1   315199  9571.3           Utilities  Energy  Beijing, China   
2   267518  1257.9  Petroleum Refining  Energy  Beijing, China   

   years_on_top500  employees          title  
1               17     926067     State Grid  
2               19     713288  Sinopec Group  


✽  selection using slices:

In [17]:
slice_rows = f500[1:5]
print(type(slice_rows))
print(slice_rows)

<class 'pandas.core.frame.DataFrame'>
   revenue   profit                  industry                  sector  \
1   315199   9571.3                 Utilities                  Energy   
2   267518   1257.9        Petroleum Refining                  Energy   
3   262573   1867.5        Petroleum Refining                  Energy   
4   254694  16899.3  Motor Vehicles and Parts  Motor Vehicles & Parts   

         location  years_on_top500  employees                     title  
1  Beijing, China               17     926067                State Grid  
2  Beijing, China               19     713288             Sinopec Group  
3  Beijing, China               17    1512048  China National Petroleum  
4   Toyota, Japan               23     364445              Toyota Motor  


---

## Indexing resume in DataFrame and Series:

<img src="indexcomplet.jpg" >

**OBS:**<br>
While the _"Other Shorthand"_ presented on the table is rarely used,<br>
The _"Common Shorthand"_ is **frequently used**.

# Methods:

## Basic Methods:

* Series.max() and DataFrame.max()
* Series.min() and DataFrame.min()
* Series.mean() and DataFrame.mean()
* Series.median() and DataFrame.median()
* Series.mode() and DataFrame.mode()
* Series.sum() and DataFrame.sum()

Example:
For instance, if is wanted to find the median (middle) value for the revenues and profits columns, we could use the following code:<br>
~~~
medians = f500[["revenues", "profits"]].median(axis=0)
we could also use .median(axis="index")
~~~
_OBS:_ The default value for the axis parameter with these methods is axis=0, so in the case above you can use only .median(), but for refreshing the axis:

<img src="axis_param.svg" style="width:480px"/>

In [18]:
#Every method above has different parameters that helps on solving a great variety of problems, example:
f500.max(numeric_only = True)  #max numeric only, will return only the max value of the int or float columns

revenue             485873.0
profit               45687.0
years_on_top500         23.0
employees          2300000.0
dtype: float64

## Index of the max value
As we can see above, its easy to find wich is the max value of something, but if we are interested in analyse the companie with most employees for example, we need to know the row index to access this information, and the [.idxmax()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) is up to the task. 

In [19]:
max_employees_index = f500["employees"].idxmax()
f500.iloc[ max_employees_index , :]

revenue                           485873
profit                             13643
industry           General Merchandisers
sector                         Retailing
location                 Bentonville, AR
years_on_top500                       23
employees                        2300000
title                            Walmart
Name: 0, dtype: object

## Describe:
returns some descriptive statistics on the data contained within a specific pandas series or dataframe:<br>
[series.describe()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.describe.html)<br>
[df.describe()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

In [20]:
#First we select a serie/row
revenue = f500["revenue"]
#now the method can be used
revenue.describe()

#Or can be used in one line:
#f500["highlights/0/value"].describe()

count       500.000000
mean      55416.358000
std       45725.478963
min       21609.000000
25%       29003.000000
50%       40236.000000
75%       63926.750000
max      485873.000000
Name: revenue, dtype: float64

By default, DataFrame.describe() will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the include=['O'] parameter when using the dataframe version of describe:

In [21]:
print(f500.describe(include=['O']))

                             industry      sector        location  \
count                             500         500             500   
unique                             58          21             235   
top     Banks: Commercial and Savings  Financials  Beijing, China   
freq                               51         118              56   

                    title  
count                 500  
unique                500  
top     China Electronics  
freq                    1  


Another difference is that Series.describe() returns a series object, where DataFrame.describe() returns a dataframe object.

In [22]:
profits_desc = f500["profit"].describe()
revenue_and_employees_desc = f500[["revenue", "employees"]].describe()

print(profits_desc)
print()
print(revenue_and_employees_desc)

count      499.000000
mean      3055.203206
std       5171.981071
min     -13038.000000
25%        556.950000
50%       1761.600000
75%       3954.000000
max      45687.000000
Name: profit, dtype: float64

             revenue     employees
count     500.000000  5.000000e+02
mean    55416.358000  1.339983e+05
std     45725.478963  1.700878e+05
min     21609.000000  3.280000e+02
25%     29003.000000  4.293250e+04
50%     40236.000000  9.291050e+04
75%     63926.750000  1.689172e+05
max    485873.000000  2.300000e+06


In [23]:
#Descriptive statistics for every column in the f500 dataframe (include parameter)
all_desc = f500.describe(include="all")
all_desc

Unnamed: 0,revenue,profit,industry,sector,location,years_on_top500,employees,title
count,500.0,499.0,500,500,500,500.0,500.0,500
unique,,,58,21,235,,,500
top,,,Banks: Commercial and Savings,Financials,"Beijing, China",,,China Electronics
freq,,,51,118,56,,,1
mean,55416.358,3055.203206,,,,15.036,133998.3,
std,45725.478963,5171.981071,,,,7.932752,170087.8,
min,21609.0,-13038.0,,,,1.0,328.0,
25%,29003.0,556.95,,,,7.0,42932.5,
50%,40236.0,1761.6,,,,17.0,92910.5,
75%,63926.75,3954.0,,,,23.0,168917.2,


## Counts
This method displays each unique non-null value from a series, with a count of the number of times that value is used.<br>
[series.value.counts()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)
**This method is only for series**

In [24]:
#First we select the sector column
#By selecting this column we got a series type, which allows the use of value_counts
#By using value_counts, the code counts the number of time which sector appears and organize then in order
#By using head is selected the first 5 elements of the series, which has the biggest counted values.
#Than is printed.

print(f500["sector"].value_counts().head())


Financials                118
Energy                     80
Technology                 44
Motor Vehicles & Parts     34
Wholesalers                28
Name: sector, dtype: int64


A important parameter of this method is **dropna** which dont count NaN (not a number).
<br>Sometimes tables use zero to represent none, we can change this zeros per np.nan, and then use the counts/head technique like these:
<br> **df["someColumn"].value_counts(dropna=False).head()**

But is important to notice that pandas although supports most of numpy types, NaN is different, so pandas change all integers to float when it comes across a NaN type.

---

# Assigning values

In [25]:
#Creating a dataFrame that we can modify
top5_rank_revenue = f500[["title", "revenue"]].head()
print(top5_rank_revenue)
print(type(top5_rank_revenue))

                      title  revenue
0                   Walmart   485873
1                State Grid   315199
2             Sinopec Group   267518
3  China National Petroleum   262573
4              Toyota Motor   254694
<class 'pandas.core.frame.DataFrame'>


<br>✽ Assigning value to a entire column:

In [26]:
top5_rank_revenue["revenue"] = 0
print(top5_rank_revenue)

                      title  revenue
0                   Walmart        0
1                State Grid        0
2             Sinopec Group        0
3  China National Petroleum        0
4              Toyota Motor        0


<br>✽ Providing labels for both axes, we can assign to a single value within our dataframe:
<br> In this example, will be used [**.index()**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html)

In [27]:
#Lets say we need want to change 'State Grid' revenue for 999
#For this we need to pass 2 parameters (row and column)
#Column is easy, its revenue,
#But the row, we need a index value that indicates which row in title column has a 'State Grid' value
#For this we will use .index[] 
top5_rank_revenue.loc[f500.index[f500['title'] == 'State Grid'], "revenue"] = 999
print(top5_rank_revenue)

                      title  revenue
0                   Walmart        0
1                State Grid      999
2             Sinopec Group        0
3  China National Petroleum        0
4              Toyota Motor        0


<br>✽  **Important observation**, Assigning false value and creating a new column or rowIf we assign a value using a index or column label that does not exist, pandas will create a new row or column in our dataframe

In [28]:
top5_rank_revenue["WRONG_Column"] = 7
print(top5_rank_revenue)

                      title  revenue  WRONG_Column
0                   Walmart        0             7
1                State Grid      999             7
2             Sinopec Group        0             7
3  China National Petroleum        0             7
4              Toyota Motor        0             7


In [29]:
top5_rank_revenue.loc["WRONG_ROW", "revenue"] = 7
print(top5_rank_revenue)

                              title  revenue  WRONG_Column
0                           Walmart      0.0           7.0
1                        State Grid    999.0           7.0
2                     Sinopec Group      0.0           7.0
3          China National Petroleum      0.0           7.0
4                      Toyota Motor      0.0           7.0
WRONG_ROW                       NaN      7.0           NaN


<br>✽ **Boolean indexing** assigning

In [30]:
motor_bool = f500["industry"] == "Motor Vehicles and Parts"
print(motor_bool.head())

0    False
1    False
2    False
3    False
4     True
Name: industry, dtype: bool


In [31]:
#Creating a dataFrame as a example:
df = pd.DataFrame({'name': ['Kylie', 'Rahul','Michael', 'Sarah'],
                            'age': [12,8,5,8]})
print(df.head())

      name  age
0    Kylie   12
1    Rahul    8
2  Michael    5
3    Sarah    8


In [32]:
#who has 8 years?
num_bool = df["age"] == 8
print(num_bool)
print("----------------------------")

result = df[num_bool]
print(result)
print("----------------------------")

#If I only wanted their names?
names_eight = df.loc[num_bool, 'name']
print(names_eight)

0    False
1     True
2    False
3     True
Name: age, dtype: bool
----------------------------
    name  age
1  Rahul    8
3  Sarah    8
----------------------------
1    Rahul
3    Sarah
Name: name, dtype: object


<br>_Example_: We want to find out which are the 5 most common cities for companies belonging to the 'Motor Vehicles and Parts' industry

In [33]:
#First we separate by industry
motor_industry_bool = f500["industry"] == 'Motor Vehicles and Parts' 
only_motors = f500[motor_industry_bool]


#Now we use the count/head technique to see the top 5 cities
print(only_motors["location"].value_counts().head())

Tokyo, Japan          3
Seoul, South Korea    3
Kariya, Japan         2
Stuttgart, Germany    2
Wuhan, China          1
Name: location, dtype: int64


<br>

## Shortcut to some analyses
In order to simplify this task we will use the direct indexing -> df.item

✽ Imagine we want the mean number of employees in Beijing, China.

In [34]:
f500.employees[f500["location"] == "Beijing, China"].mean()

241890.19642857142

<br>✽ Now the 3 most sectors in Beijing for companies:

In [35]:
f500.sector[f500["location"] == "Beijing, China"].value_counts().head(3)

Financials                    13
Energy                        10
Engineering & Construction     7
Name: sector, dtype: int64

<br>✽ Which is the mean frequency of Beijing companies on the 500 top list:

In [36]:
f500.years_on_top500[f500["location"] == "Beijing, China"].mean()


8.696428571428571