## Pandas - DataFrame
Probably the most important data structure of pandas is the DataFrame. It's a tabular structure tightly integrated with Series.

In [21]:
import pandas as pd
import numpy as np

In [22]:
df = pd.DataFrame( {'Population': [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
                   'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
                   'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
                   'HDI': [0.913, 0.888, 0.0916, 0.873, 0.891, 0.907, 0.915],
                   'Continent': ['America', 'Europe', 'Europe', 'Europe', 'Asia', 'Europe', 'America']})
#                  ,columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'] )
# the column attribute is optional

In [23]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.0916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


DataFrame is also have indexes. Pandas assigned a numeric, autoincremental index automatically to each 'row' in our DataFrame. To change
the indexes to the country name, use index like series.

In [24]:
df.index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'US']

In [25]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.0916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
UK,64.511,2950039,242495,0.907,Europe
US,318.523,17348075,9525067,0.915,America


In [26]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

In [28]:
df.size

35

In [30]:
df.shape # not included index column

(7, 5)

In [31]:
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.782657
std,97.24997,5494020.0,4576187.0,0.305102
min,35.467,1785387.0,242495.0,0.0916
25%,62.308,2500716.0,329225.0,0.8805
50%,64.511,2950039.0,377930.0,0.891
75%,104.0005,4238402.0,5082873.0,0.91
max,318.523,17348080.0,9984670.0,0.915


In [32]:
df.dtypes

Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object

In [34]:
df.dtypes.value_counts()

int64      2
float64    2
object     1
dtype: int64

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to US
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


### Indexing, Selection and Slicing
Individual columns in the DataFrame can be selected with regular indexing. Each column is represented as a Series.
* The **df[]** access the dataframe at vertical side (or columns)
* The **loc[first_argument] and iloc[first_argument]** access the dataframe at horizontal side(row)
* The **loc[first_arg, second_arg] and iloc[first_arg, second_arg]** access the dataframe from both side(row and column)

In [44]:
df['Population']

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: Population, dtype: float64

Note that the index of the returned Series is the same as the DataFrame one. And is name is the name of the column. If you're working
on a notebook and want to see a more DataFrame like format you can use the 'to_frame' method.

In [41]:
df['Population'].to_frame()

Unnamed: 0,Population
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
UK,64.511
US,318.523


Multiple columns can be choosen same as numpy and series. And in this case, the result is different with DataFrame.(like to_frame())

In [43]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
UK,64.511,2950039
US,318.523,17348075


Slicing work differently, it acts at **'row level'** and can be counter intuitive.

In [45]:
df[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.0916,Europe


**Row level** works better with **loc** and **iloc** which are recommended over regular 'direct' slicing (df[:])

And if the row access is one, the result will be transposed with the DataFrame

In [78]:
df.loc['Italy'] # result be transposed,

Population       60.665
GDP             2167744
Surface Area     301336
HDI               0.873
Continent        Europe
Name: Italy, dtype: object

In [79]:
df.loc['Italy':'US']

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
UK,64.511,2950039,242495,0.907,Europe
US,318.523,17348075,9525067,0.915,America


And with the second argument, you can add the column(s) that you want to get:

In [52]:
df.loc['Italy', 'Population'], df.loc['Italy':'US', ['Population', 'GDP']]

(60.665,
        Population       GDP
 Italy      60.665   2167744
 Japan     127.061   4602367
 UK         64.511   2950039
 US        318.523  17348075)

In [85]:
df.iloc[0,3]

0.913

In [89]:
df.iloc[[0,3,-1]] # with 3 and greater element, must use [[]]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
Italy,60.665,2167744,301336,0.873,Europe
US,318.523,17348075,9525067,0.915,America


In [90]:
df.iloc[0:3,3], df.iloc[0:3, 0:3], df.iloc[0:3]

(Canada     0.9130
 France     0.8880
 Germany    0.0916
 Name: HDI, dtype: float64,
          Population      GDP  Surface Area
 Canada       35.467  1785387       9984670
 France       63.951  2833687        640679
 Germany      80.940  3874437        357114,
          Population      GDP  Surface Area     HDI Continent
 Canada       35.467  1785387       9984670  0.9130   America
 France       63.951  2833687        640679  0.8880    Europe
 Germany      80.940  3874437        357114  0.0916    Europe)

In [75]:
df.iloc[0::2] # start location (0), jump over:2

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
Germany,80.94,3874437,357114,0.0916,Europe
Japan,127.061,4602367,377930,0.891,Asia
US,318.523,17348075,9525067,0.915,America


> **RECOMMENDED**: Always use loc and iloc to avoid ambiguity, specially with numeric indexes in DataFrame

### Conditional Selection (boolean arrays)
We saw conditional selection applied to Series and it'll work in the same way for DataFrame, or DataFrame is a collection of Series.

In [93]:
df['Population'] > 70

Canada     False
France     False
Germany     True
Italy      False
Japan       True
UK         False
US          True
Name: Population, dtype: bool

In [96]:
df.loc[df['Population']>70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.0916,Europe
Japan,127.061,4602367,377930,0.891,Asia
US,318.523,17348075,9525067,0.915,America


In [101]:
df.loc[df['Population']>80, ['Population', 'GDP']]

Unnamed: 0,Population,GDP
Germany,80.94,3874437
Japan,127.061,4602367
US,318.523,17348075


### Dropping stuff
Opposed to the concept of selection, we have 'dropping'. Insted of pointing out which values you'd like to select, you could point
which ones you'd like to **drop**

In [102]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.0916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
UK,64.511,2950039,242495,0.907,Europe
US,318.523,17348075,9525067,0.915,America


In [113]:
df.drop('Canada')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.0916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
UK,64.511,2950039,242495,0.907,Europe
US,318.523,17348075,9525067,0.915,America


In [120]:
df.drop(columns='Population'), df.drop(columns= ['Population', 'GDP'])

(              GDP  Surface Area     HDI Continent
 Canada    1785387       9984670  0.9130   America
 France    2833687        640679  0.8880    Europe
 Germany   3874437        357114  0.0916    Europe
 Italy     2167744        301336  0.8730    Europe
 Japan     4602367        377930  0.8910      Asia
 UK        2950039        242495  0.9070    Europe
 US       17348075       9525067  0.9150   America,
          Surface Area     HDI Continent
 Canada        9984670  0.9130   America
 France         640679  0.8880    Europe
 Germany        357114  0.0916    Europe
 Italy          301336  0.8730    Europe
 Japan          377930  0.8910      Asia
 UK             242495  0.9070    Europe
 US            9525067  0.9150   America)

In [124]:
df.drop(['Italy', 'Japan'], axis=0) # or axis = rows

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.0916,Europe
UK,64.511,2950039,242495,0.907,Europe
US,318.523,17348075,9525067,0.915,America


In [125]:
df.drop(['Population', 'GDP'], axis=1) # or axis = columns

Unnamed: 0,Surface Area,HDI,Continent
Canada,9984670,0.913,America
France,640679,0.888,Europe
Germany,357114,0.0916,Europe
Italy,301336,0.873,Europe
Japan,377930,0.891,Asia
UK,242495,0.907,Europe
US,9525067,0.915,America


All these drop example return new DataFrame. If you'd like to modify it **in place**, you can use the **inplace** attribute to actually modify original DataFrame

In [None]:
df.drop(columns='Language', inplace=True)

### Operations
Normally, the operation don't change the underlying DataFrame, if you want to use the change, you should assign a new name for the result set that you want. Eg: new_name= df_operations

In [127]:
df[['Population', 'GDP']] * 100 # you can use another operation : + - / ** ...

Unnamed: 0,Population,GDP
Canada,3546.7,178538700
France,6395.1,283368700
Germany,8094.0,387443700
Italy,6066.5,216774400
Japan,12706.1,460236700
UK,6451.1,295003900
US,31852.3,1734807500


Operation with Series work at a column level, broadcasting down the rows (which can be counter intuitive)

In [129]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])

In [132]:
df[['GDP', 'HDI']] + crisis

Unnamed: 0,GDP,HDI
Canada,785387.0,0.613
France,1833687.0,0.588
Germany,2874437.0,-0.2084
Italy,1167744.0,0.573
Japan,3602367.0,0.591
UK,1950039.0,0.607
US,16348075.0,0.615


### Modifying DataFrames
It simple and intuitive, you can add columns, or replace values for columns without issues

In [135]:
langs = pd.Series(['French', 'German', 'Italian'], index= ['France', 'Germany', 'Italy'], name = 'language')
langs

France      French
Germany     German
Italy      Italian
Name: language, dtype: object

In [136]:
df['Language'] = langs

In [137]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.0916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
UK,64.511,2950039,242495,0.907,Europe,
US,318.523,17348075,9525067,0.915,America,


### Replacing values per columns

In [147]:
df['Language'] = 'English'
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,English
Germany,80.94,3874437.0,357114.0,0.0916,Europe,English
Italy,60.665,2167744.0,301336.0,0.873,Europe,English
Japan,127.061,4602367.0,377930.0,0.891,Asia,English
UK,64.511,2950039.0,242495.0,0.907,Europe,English
US,318.523,17348075.0,9525067.0,0.915,America,English
0,,,,,,English


In [145]:
df.loc['Canada'] = 'Canadian'

In [146]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,Canadian
France,63.951,2833687.0,640679.0,0.888,Europe,English
Germany,80.94,3874437.0,357114.0,0.0916,Europe,English
Italy,60.665,2167744.0,301336.0,0.873,Europe,English
Japan,127.061,4602367.0,377930.0,0.891,Asia,English
UK,64.511,2950039.0,242495.0,0.907,Europe,English
US,318.523,17348075.0,9525067.0,0.915,America,English
0,,,,,,Canadian


In [149]:
df.loc['Canada','Language'] = 'English'
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,English
Germany,80.94,3874437.0,357114.0,0.0916,Europe,English
Italy,60.665,2167744.0,301336.0,0.873,Europe,English
Japan,127.061,4602367.0,377930.0,0.891,Asia,English
UK,64.511,2950039.0,242495.0,0.907,Europe,English
US,318.523,17348075.0,9525067.0,0.915,America,English
0,,,,,,English


### Renaming Columns

In [167]:
df.rename(columns = {'HDI':'Human Development Index',
                    'Anual Popcorn Consumption' :'APC'},
                      index = {'US': 'United State',
                              'UK': 'United Kingdom',
                              'Argentia': 'AG'})

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.0916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe
United State,318.523,17348075.0,9525067.0,0.915,America


In [169]:
df.rename(index = lambda x:x.lower())

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
canada,35.467,1785387.0,9984670.0,0.913,America
france,63.951,2833687.0,640679.0,0.888,Europe
germany,80.94,3874437.0,357114.0,0.0916,Europe
italy,60.665,2167744.0,301336.0,0.873,Europe
japan,127.061,4602367.0,377930.0,0.891,Asia
uk,64.511,2950039.0,242495.0,0.907,Europe
us,318.523,17348075.0,9525067.0,0.915,America


In [171]:
df.rename(index = str.upper)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
CANADA,35.467,1785387.0,9984670.0,0.913,America
FRANCE,63.951,2833687.0,640679.0,0.888,Europe
GERMANY,80.94,3874437.0,357114.0,0.0916,Europe
ITALY,60.665,2167744.0,301336.0,0.873,Europe
JAPAN,127.061,4602367.0,377930.0,0.891,Asia
UK,64.511,2950039.0,242495.0,0.907,Europe
US,318.523,17348075.0,9525067.0,0.915,America


### Adding values

Using append() method

In [180]:
df.append(pd.Series({'Population':3, 'GDP':5}, name='China'))

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.0916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
UK,64.511,2950039.0,242495.0,0.907,Europe
US,318.523,17348075.0,9525067.0,0.915,America
China,3.0,5.0,,,


Or you can directly set the new index and values to the DataFrame

In [190]:
df.loc['China'] = pd.Series({'Population':1_400_000_000, 'Continent': 'Asia'})

In [191]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.0916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
UK,64.511,2950039.0,242495.0,0.907,Europe
US,318.523,17348075.0,9525067.0,0.915,America
China,1400000000.0,,,,Asia


In [192]:
df.drop('China', inplace=True)

In [194]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.0916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
UK,64.511,2950039.0,242495.0,0.907,Europe
US,318.523,17348075.0,9525067.0,0.915,America


### More radical index changes

In [195]:
df.reset_index()

Unnamed: 0,index,Population,GDP,Surface Area,HDI,Continent
0,Canada,35.467,1785387.0,9984670.0,0.913,America
1,France,63.951,2833687.0,640679.0,0.888,Europe
2,Germany,80.94,3874437.0,357114.0,0.0916,Europe
3,Italy,60.665,2167744.0,301336.0,0.873,Europe
4,Japan,127.061,4602367.0,377930.0,0.891,Asia
5,UK,64.511,2950039.0,242495.0,0.907,Europe
6,US,318.523,17348075.0,9525067.0,0.915,America


In [196]:
df.set_index('Population')

Unnamed: 0_level_0,GDP,Surface Area,HDI,Continent
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
35.467,1785387.0,9984670.0,0.913,America
63.951,2833687.0,640679.0,0.888,Europe
80.94,3874437.0,357114.0,0.0916,Europe
60.665,2167744.0,301336.0,0.873,Europe
127.061,4602367.0,377930.0,0.891,Asia
64.511,2950039.0,242495.0,0.907,Europe
318.523,17348075.0,9525067.0,0.915,America


In [197]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.0916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
UK,64.511,2950039.0,242495.0,0.907,Europe
US,318.523,17348075.0,9525067.0,0.915,America


### Creating columns from other column
Altering a DataFrame often involves combining different oclumns into another. For example, in our Countries analysis, we could try to calculate the 'GDP per capita', which is just, GDP/ Population

In [198]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387.0
France,63.951,2833687.0
Germany,80.94,3874437.0
Italy,60.665,2167744.0
Japan,127.061,4602367.0
UK,64.511,2950039.0
US,318.523,17348075.0


The regular pandas away of expressing that, is just dividing each series:

In [200]:
df['GDP']/ df['Population']

Canada     50339.385908
France     44310.284437
Germany    47868.013343
Italy      35733.025633
Japan      36221.712406
UK         45729.239975
US         54464.120330
dtype: float64

The result of that operation is just another series that you can add to original DataFrame.

In [201]:
df['GDP per capita'] = df['GDP'] / df['Population']

In [202]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,GDP per capita
Canada,35.467,1785387.0,9984670.0,0.913,America,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,44310.284437
Germany,80.94,3874437.0,357114.0,0.0916,Europe,47868.013343
Italy,60.665,2167744.0,301336.0,0.873,Europe,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,36221.712406
UK,64.511,2950039.0,242495.0,0.907,Europe,45729.239975
US,318.523,17348075.0,9525067.0,0.915,America,54464.12033


### Statistical info

You've already seen the describe method, which gives you a good sumary of the DataFrame. Let's explore other methods in more details:

In [206]:
df.head()

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,GDP per capita
Canada,35.467,1785387.0,9984670.0,0.913,America,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,44310.284437
Germany,80.94,3874437.0,357114.0,0.0916,Europe,47868.013343
Italy,60.665,2167744.0,301336.0,0.873,Europe,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,36221.712406
