# Pandas


## Importing convetion

In [None]:
import pandas as pd
import numpy as np

In [None]:
!pip install pandas --upgrade --user

In [None]:
pd.__version__

# What is Pandas?

Pandas can be thought as an enhanced version of numpy arrays. In this case, the rows and columns can be identified with labels instead of just simple integer indices.

There are **three** main pandas elements we **need** to understand.
1. Pandas Series
2. Pandas DataFrame
3. Index

# The Pandas Series

A pandas series is a one-dimensional (**1-D**) indexed array.

In [None]:
np.random.randint(1,10,size=2)

In [None]:
np.array([1,2,3])

In [None]:
pd

In [None]:
pd.Series()

## Creating a pandas Series from a list

You'll start to recognize a pandas Series by its visual 

In [None]:
np.array([1,2,3])

In [None]:
pd.Series([5,8,3])

`dtype` means the `data type` of what is inside your pandas Series.

In [None]:
# As in lists, you don't need to have all of the same type inside a pandas series

pd.Series(['a', 2, 3])

When you see `dtype: object`, it usually means you have a `str` inside your `Series`

In [None]:
data = pd.Series([10,23,3,43,25,136])

In [None]:
data

In [None]:
type(data)

So, the `type` of `data` is a `pandas...Series` and the types of the data inside the `pandas.Series` is `int`

## Accessing elements 

Can be done like a numpy array. 

In [None]:
data

In [None]:
data[0]

In [None]:
data[0:6]

In [None]:
data[:]

Em resumo: pandas series pode ser considerado uma numpy array de 1-D

### What is the difference then? Numpy array vs Pandas Series

Mostly the index notation.

Numpy arrays only have the **implicit** index associated with its location. By using a **explicit** index notation, Pandas Series are much more flexible. For example:

## Index don't need to be numbers.

In [31]:
my_series = pd.Series(index=['Andre','Raiana','Guilherme','Gisele','Raiana','Gabriel'],data=[1,2,3,5,7,9]) #index argument
my_series

Andre        1
Raiana       2
Guilherme    3
Gisele       5
Raiana       7
Gabriel      9
dtype: int64

## Index don't need to be in sequence

In [24]:
data = pd.Series(data=[1,2,3,4], 
                 index=[1,7,4313,19])

In [25]:
data

1       1
7       2
4313    3
19      4
dtype: int64

### Then how can I access these pandas series?

In [26]:
my_series

Andre        1
Raiana       2
Guilherme    3
Gisele       5
Raiana       7
Gabriel      9
dtype: int64

In [27]:
my_series['Raiana']

Raiana    2
Raiana    7
dtype: int64

In [28]:
my_series['Raiana'].mean()

4.5

**NOTE:** One can think of a pandas series, then, as a form of dictionary, in which the indexes are keys and the rows are the values

In [32]:
my_series

Andre        1
Raiana       2
Guilherme    3
Gisele       5
Raiana       7
Gabriel      9
dtype: int64

In [33]:
my_series.keys()

Index(['Andre', 'Raiana', 'Guilherme', 'Gisele', 'Raiana', 'Gabriel'], dtype='object')

In [37]:
my_series.values

array([1, 2, 3, 5, 7, 9], dtype=int64)

In [35]:
my_series * 3

Andre         3
Raiana        6
Guilherme     9
Gisele       15
Raiana       21
Gabriel      27
dtype: int64

## Creating a pandas series from a dict.

In [38]:
my_dict = {'RODRIGO': 20, 
           'ANDRE':10}

In [39]:
my_dict

{'RODRIGO': 20, 'ANDRE': 10}

In [40]:
pd.Series(my_dict)

RODRIGO    20
ANDRE      10
dtype: int64

But what about > 1-D?


# Pandas DataFrame


Pandas Dataframes can be thought as a generalization of **2-D** numpy arrays. However, again, they bring flexibility on both the indices and column names.

In [41]:
pd.DataFrame()

In [42]:
type(pd.DataFrame())

pandas.core.frame.DataFrame

## Pandas DataFrame can be thought as a group of Pandas Series

In [43]:
my_dict = {'RODRIGO': 25, 
           'ANDRE':28}

data = pd.Series(my_dict)

In [44]:
data

RODRIGO    25
ANDRE      28
dtype: int64

In [45]:
another_dict = {'RODRIGO': 177,'ANDRE': 175}

data_2 = pd.Series(another_dict)

In [46]:
data_2

RODRIGO    177
ANDRE      175
dtype: int64

In [47]:
my_series

Andre        1
Raiana       2
Guilherme    3
Gisele       5
Raiana       7
Gabriel      9
dtype: int64

In [48]:
my_series2 = pd.Series(['Professor','TA','Prof','Aluno','Aluna','Aluna'], index=['Andre','Raiana','Guilherme','Gisele','Raiana','Gabriel'])
my_series2

Andre        Professor
Raiana              TA
Guilherme         Prof
Gisele           Aluno
Raiana           Aluna
Gabriel          Aluna
dtype: object

In [49]:
my_dict = {'nota': my_series, 'cargo': my_series2}
my_dict

{'nota': Andre        1
 Raiana       2
 Guilherme    3
 Gisele       5
 Raiana       7
 Gabriel      9
 dtype: int64,
 'cargo': Andre        Professor
 Raiana              TA
 Guilherme         Prof
 Gisele           Aluno
 Raiana           Aluna
 Gabriel          Aluna
 dtype: object}

In [156]:
df_2=pd.DataFrame(my_dict)#.describe()

In [53]:
pd.DataFrame(my_dict).describe()

Unnamed: 0,nota
count,6.0
mean,4.5
std,3.082207
min,1.0
25%,2.25
50%,4.0
75%,6.5
max,9.0


# Create dataframe as a collection of Series

In [54]:
{'idade':data, 'altura':data_2}

{'idade': RODRIGO    25
 ANDRE      28
 dtype: int64,
 'altura': RODRIGO    177
 ANDRE      175
 dtype: int64}

In [77]:
my_dataframe = pd.DataFrame({'idade':data, 'altura muito louca ':data_2})
my_dataframe 

Unnamed: 0,idade,altura muito louca
RODRIGO,25,177
ANDRE,28,175


**NOTE:**: So a dataframe can be thought of as a dictionary, in which `keys` are the `column names` and `values` are the `pandas Series` themselves

In [71]:
my_dataframe['altura']

RODRIGO    177
ANDRE      175
Name: altura, dtype: int64

In [66]:
my_dataframe.altura

RODRIGO    177
ANDRE      175
Name: altura, dtype: int64

In [80]:
my_dataframe.columns= ['idade', 'altura']

In [78]:
new_columns=[]
for c in my_dataframe.columns:
    new_columns.append(c.strip())
my_dataframe.columns=new_columns

In [79]:
new_columns

['idade', 'altura muito louca']

# `Access` Methods: Accessing dataframes rows and columns

These are the correct way to access data in a dataframe. You can specify both row and column. You can also specify only row.

In [81]:
my_dataframe

Unnamed: 0,idade,altura
RODRIGO,25,177
ANDRE,28,175


## `dataframe.loc[row_name, col_name]`

In [116]:
my_dataframe.loc['RODRIGO', 'idade']

KeyError: 0

In [83]:
my_dataframe.loc[:, 'idade']

RODRIGO    25
ANDRE      28
Name: idade, dtype: int64

In [84]:
my_dataframe.loc['ANDRE', 'altura']

175

In [88]:
my_dataframe['altura']['ANDRE']# 

175

## `dataframe.iloc[row_number, col_number]`

In [89]:
my_dataframe

Unnamed: 0,idade,altura
RODRIGO,25,177
ANDRE,28,175


In [92]:
my_dataframe.iloc[0, 0]

25

In [93]:
my_dataframe.iloc[1, 1]

175

In [97]:
my_dataframe.iloc[-1, -1]

175

What is the difference of selecting a column via: `my_dataframe['idade']` vs `my_dataframe.loc[:, 'idade']`?

# Creating dataframes

## From a list in 1-D

In [100]:
my_list = [1,2,3]

In [101]:
np.array(my_list)

array([1, 2, 3])

In [102]:
np.array(my_list).shape

(3,)

In [103]:
pd.DataFrame(data=my_list)

Unnamed: 0,0
0,1
1,2
2,3


In [104]:
pd.DataFrame(data=my_list, columns=['notas'], index=['Andre','Rai','Rodrigo'])

Unnamed: 0,notas
Andre,1
Rai,2
Rodrigo,3


## From a list in > 1-D (let's remember numpy arrays here!)

In [105]:
my_list = [[1,2,3],
           [-5,-6,-7]]

In [106]:
np.array(my_list)

array([[ 1,  2,  3],
       [-5, -6, -7]])

In [107]:
np.array(my_list).shape

(2, 3)

In [108]:
df = pd.DataFrame(data=my_list, columns=['idade','peso','altura'])
df

Unnamed: 0,idade,peso,altura
0,1,2,3
1,-5,-6,-7


In [109]:
df.shape

(2, 3)

## From a dictionary composed by lists

In [110]:
pd.DataFrame({'ironhack_students': ['a','b','c'],
              'NOTA':[10, 10, 0]})

Unnamed: 0,ironhack_students,NOTA
0,a,10
1,b,10
2,c,0


## From a numpy array

In [111]:
a = np.random.random(size=(5, 4))
a

array([[0.31640789, 0.3495478 , 0.04542596, 0.90165774],
       [0.80883859, 0.32951372, 0.86457096, 0.13683244],
       [0.16568874, 0.69845172, 0.62149579, 0.30688947],
       [0.2058912 , 0.28072697, 0.50766449, 0.74809371],
       [0.49087636, 0.79384728, 0.20301596, 0.26710354]])

In [124]:
data = pd.DataFrame(a, columns=['altura', 'peso', 'idade', 'largura'])

In [113]:
data

Unnamed: 0,altura,peso,idade,largura
0,0.316408,0.349548,0.045426,0.901658
1,0.808839,0.329514,0.864571,0.136832
2,0.165689,0.698452,0.621496,0.306889
3,0.205891,0.280727,0.507664,0.748094
4,0.490876,0.793847,0.203016,0.267104


### Accessing rows and columns:

#### `.loc`

remember: `.loc` receives `[row_name, column_name]`

In [121]:
data

Unnamed: 0,altura,peso,idade,largura
67,0.316408,0.349548,0.045426,0.901658
423,0.808839,0.329514,0.864571,0.136832
214,0.165689,0.698452,0.621496,0.306889
532,0.205891,0.280727,0.507664,0.748094
321,0.490876,0.793847,0.203016,0.267104


In [123]:
# get third row and column `peso`: result should be 0.285021

data.loc[423, 'peso']

0.3295137230817692

In [125]:
# get entire third row

data.loc[2, :]

altura     0.165689
peso       0.698452
idade      0.621496
largura    0.306889
Name: 2, dtype: float64

In [126]:
# get entire `idade` column

# data['idade']
# data.idade
data.loc[:, 'idade']

0    0.045426
1    0.864571
2    0.621496
3    0.507664
4    0.203016
Name: idade, dtype: float64

In [128]:
data.loc[:,:]

Unnamed: 0,altura,peso,idade,largura
0,0.316408,0.349548,0.045426,0.901658
1,0.808839,0.329514,0.864571,0.136832
2,0.165689,0.698452,0.621496,0.306889
3,0.205891,0.280727,0.507664,0.748094
4,0.490876,0.793847,0.203016,0.267104


In [127]:
# get all rows from column `peso` up to `largura`

data.loc[:, 'peso':'largura']

Unnamed: 0,peso,idade,largura
0,0.349548,0.045426,0.901658
1,0.329514,0.864571,0.136832
2,0.698452,0.621496,0.306889
3,0.280727,0.507664,0.748094
4,0.793847,0.203016,0.267104


In [129]:
data.loc[:, ['peso','idade','largura']]

Unnamed: 0,peso,idade,largura
0,0.349548,0.045426,0.901658
1,0.329514,0.864571,0.136832
2,0.698452,0.621496,0.306889
3,0.280727,0.507664,0.748094
4,0.793847,0.203016,0.267104


In [130]:
data.loc[:3, ['peso','idade','largura']]

Unnamed: 0,peso,idade,largura
0,0.349548,0.045426,0.901658
1,0.329514,0.864571,0.136832
2,0.698452,0.621496,0.306889
3,0.280727,0.507664,0.748094


In [131]:
data.loc[:-1, ['peso','idade','largura']]

Unnamed: 0,peso,idade,largura


In [132]:
data.loc[:, 'idade']

0    0.045426
1    0.864571
2    0.621496
3    0.507664
4    0.203016
Name: idade, dtype: float64

In [134]:
data.loc[:, 'idade'].shape

(5,)

In [133]:
data.loc[:, ['idade']]

Unnamed: 0,idade
0,0.045426
1,0.864571
2,0.621496
3,0.507664
4,0.203016


In [135]:
data.loc[:, ['idade']].shape

(5, 1)

#### `.iloc`

In [136]:
data.iloc[0:4, :]

Unnamed: 0,altura,peso,idade,largura
0,0.316408,0.349548,0.045426,0.901658
1,0.808839,0.329514,0.864571,0.136832
2,0.165689,0.698452,0.621496,0.306889
3,0.205891,0.280727,0.507664,0.748094


In [137]:
data.iloc[0:4, 1:4]

Unnamed: 0,peso,idade,largura
0,0.349548,0.045426,0.901658
1,0.329514,0.864571,0.136832
2,0.698452,0.621496,0.306889
3,0.280727,0.507664,0.748094


In [138]:
data.iloc[-1, :]

altura     0.490876
peso       0.793847
idade      0.203016
largura    0.267104
Name: 4, dtype: float64

## Math operations

In [139]:
data = np.random.random(size=(8, 4))

In [140]:
df = pd.DataFrame(data, columns=['Andre','Rai','Rodrigo','Vamp'])

In [141]:
df

Unnamed: 0,Andre,Rai,Rodrigo,Vamp
0,0.599211,0.908881,0.358237,0.179483
1,0.367232,0.137365,0.492369,0.261942
2,0.427096,0.101282,0.706888,0.675554
3,0.025222,0.503995,0.120277,0.230367
4,0.700838,0.538928,0.477258,0.153079
5,0.516804,0.059523,0.44505,0.762666
6,0.308967,0.211154,0.567479,0.580374
7,0.57525,0.138184,0.979351,0.929731


In [144]:
df#.transpose()

Unnamed: 0,Andre,Rai,Rodrigo,Vamp
0,0.599211,0.908881,0.358237,0.179483
1,0.367232,0.137365,0.492369,0.261942
2,0.427096,0.101282,0.706888,0.675554
3,0.025222,0.503995,0.120277,0.230367
4,0.700838,0.538928,0.477258,0.153079
5,0.516804,0.059523,0.44505,0.762666
6,0.308967,0.211154,0.567479,0.580374
7,0.57525,0.138184,0.979351,0.929731


In [146]:
df.mean()

Andre      0.440078
Rai        0.324914
Rodrigo    0.518364
Vamp       0.471649
dtype: float64

In [149]:
df.mean(axis=1)

0    0.511453
1    0.314727
2    0.477705
3    0.219965
4    0.467526
5    0.446011
6    0.416993
7    0.655629
dtype: float64

In [152]:
df.std()

Andre      0.211203
Rai        0.298165
Rodrigo    0.251372
Vamp       0.301655
dtype: float64

In [163]:
df_2#.describe()

Unnamed: 0,nota,cargo
Andre,1,Professor
Raiana,2,TA
Guilherme,3,Prof
Gisele,5,Aluno
Raiana,7,Aluna
Gabriel,9,Aluna


In [162]:
df

Unnamed: 0,Andre,Rai,Rodrigo,Vamp
0,0.599211,0.908881,0.358237,0.179483
1,0.367232,0.137365,0.492369,0.261942
2,0.427096,0.101282,0.706888,0.675554
3,0.025222,0.503995,0.120277,0.230367
4,0.700838,0.538928,0.477258,0.153079
5,0.516804,0.059523,0.44505,0.762666
6,0.308967,0.211154,0.567479,0.580374
7,0.57525,0.138184,0.979351,0.929731


In [164]:
df.transpose().describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.511453,0.314727,0.477705,0.219965,0.467526,0.446011,0.416993,0.655629
std,0.315874,0.15117,0.280431,0.207078,0.229858,0.291351,0.185633,0.389075
min,0.179483,0.137365,0.101282,0.025222,0.153079,0.059523,0.211154,0.138184
25%,0.313548,0.230798,0.345642,0.096513,0.396213,0.348669,0.284513,0.465984
50%,0.478724,0.314587,0.551325,0.175322,0.508093,0.480927,0.438223,0.752491
75%,0.676629,0.398516,0.683388,0.298774,0.579405,0.57827,0.570703,0.942136
max,0.908881,0.492369,0.706888,0.503995,0.700838,0.762666,0.580374,0.979351


In [165]:
df

Unnamed: 0,Andre,Rai,Rodrigo,Vamp
0,0.599211,0.908881,0.358237,0.179483
1,0.367232,0.137365,0.492369,0.261942
2,0.427096,0.101282,0.706888,0.675554
3,0.025222,0.503995,0.120277,0.230367
4,0.700838,0.538928,0.477258,0.153079
5,0.516804,0.059523,0.44505,0.762666
6,0.308967,0.211154,0.567479,0.580374
7,0.57525,0.138184,0.979351,0.929731


In [166]:
df['Total'] = df['Andre'] + df['Vamp']
df

Unnamed: 0,Andre,Rai,Rodrigo,Vamp,Total
0,0.599211,0.908881,0.358237,0.179483,0.778694
1,0.367232,0.137365,0.492369,0.261942,0.629174
2,0.427096,0.101282,0.706888,0.675554,1.10265
3,0.025222,0.503995,0.120277,0.230367,0.255589
4,0.700838,0.538928,0.477258,0.153079,0.853917
5,0.516804,0.059523,0.44505,0.762666,1.27947
6,0.308967,0.211154,0.567479,0.580374,0.88934
7,0.57525,0.138184,0.979351,0.929731,1.504981


In [167]:
df[['Rai','Rodrigo']]

Unnamed: 0,Rai,Rodrigo
0,0.908881,0.358237
1,0.137365,0.492369
2,0.101282,0.706888
3,0.503995,0.120277
4,0.538928,0.477258
5,0.059523,0.44505
6,0.211154,0.567479
7,0.138184,0.979351


Unnamed: 0,Andre,Vamp,Total
0,0.599211,0.179483,0.778694
1,0.367232,0.261942,0.629174
2,0.427096,0.675554,1.10265
3,0.025222,0.230367,0.255589
4,0.700838,0.153079,0.853917
5,0.516804,0.762666,1.27947
6,0.308967,0.580374,0.88934
7,0.57525,0.929731,1.504981


In [None]:
df.mean()

In [None]:
df['Andre']

# Pandas Index

In [169]:
pd.Index([1,2,3])

Int64Index([1, 2, 3], dtype='int64')

In [170]:
df

Unnamed: 0,Andre,Rai,Rodrigo,Vamp,Total
0,0.599211,0.908881,0.358237,0.179483,0.778694
1,0.367232,0.137365,0.492369,0.261942,0.629174
2,0.427096,0.101282,0.706888,0.675554,1.10265
3,0.025222,0.503995,0.120277,0.230367,0.255589
4,0.700838,0.538928,0.477258,0.153079,0.853917
5,0.516804,0.059523,0.44505,0.762666,1.27947
6,0.308967,0.211154,0.567479,0.580374,0.88934
7,0.57525,0.138184,0.979351,0.929731,1.504981


In [172]:
list(df.index)

[0, 1, 2, 3, 4, 5, 6, 7]

In [173]:
df.index.values

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int64)

In [174]:
df.loc[0,'Rai']

0.90888139263271

In [175]:
df.iloc[0,1]

0.90888139263271

In [176]:
df['Rai'][0]

0.90888139263271

In [178]:
df.T.describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,0.564901,0.377616,0.602694,0.22709,0.544804,0.612703,0.511463,0.825499
std,0.298523,0.192132,0.37026,0.180041,0.263601,0.450105,0.265456,0.507754
min,0.179483,0.137365,0.101282,0.025222,0.153079,0.059523,0.211154,0.138184
25%,0.358237,0.261942,0.427096,0.120277,0.477258,0.44505,0.308967,0.57525
50%,0.599211,0.367232,0.675554,0.230367,0.538928,0.516804,0.567479,0.929731
75%,0.778694,0.492369,0.706888,0.255589,0.700838,0.762666,0.580374,0.979351
max,0.908881,0.629174,1.10265,0.503995,0.853917,1.27947,0.88934,1.504981


In [179]:
df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7
Andre,0.599211,0.367232,0.427096,0.025222,0.700838,0.516804,0.308967,0.57525
Rai,0.908881,0.137365,0.101282,0.503995,0.538928,0.059523,0.211154,0.138184
Rodrigo,0.358237,0.492369,0.706888,0.120277,0.477258,0.44505,0.567479,0.979351
Vamp,0.179483,0.261942,0.675554,0.230367,0.153079,0.762666,0.580374,0.929731
Total,0.778694,0.629174,1.10265,0.255589,0.853917,1.27947,0.88934,1.504981


In [180]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7
Andre,0.599211,0.367232,0.427096,0.025222,0.700838,0.516804,0.308967,0.57525
Rai,0.908881,0.137365,0.101282,0.503995,0.538928,0.059523,0.211154,0.138184
Rodrigo,0.358237,0.492369,0.706888,0.120277,0.477258,0.44505,0.567479,0.979351
Vamp,0.179483,0.261942,0.675554,0.230367,0.153079,0.762666,0.580374,0.929731
Total,0.778694,0.629174,1.10265,0.255589,0.853917,1.27947,0.88934,1.504981
