[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/rodreras/geopy_minicurso/HEAD)

# Introduction to Libraries - Part 1

In this notebook, we will quickly go over some important libraries that are used in data analysis. They are:
- `Pandas`, for data manipulation
- `Numpy`, for creating vectors and applying mathematical functions.   
- `Matplotlib`,  for plotting the data
- `Seaborn`, to make the plots more attractive.

This part of the course was translated from the free course [Python para Geológos](https://github.com/kevinalexandr19/manual-python-geologia),developed by the Geologist [Kevin Alexander Gómez](https://www.linkedin.com/in/kevin-alexander-g%C3%B3mez-2b0263111/) from Peru. The course has an [License MIT](https://github.com/kevinalexandr19/manual-python-geologia/blob/main/LICENSE), allowing the distribution and modification of the material.. 
To learn more, visit the repository  [Python para Geológos](https://github.com/kevinalexandr19/manual-python-geologia).

___

## Pandas

Pandas is a library, or in other words, a module of codes that enables Python to be more powerful. Its main goal is to perform data manipulation simply and quickly.
Data visualization is always in tables, in the form of  `Dataframes` or `Series`.

Being a very powerful library, a series of activities can be done:

- Explore, clean, and process data quickly and programmatically;
- ntegration of various data source formats  (`json`,`csv`,`sql`, etc.);
- Filter and select specific columns or rows;
- Perform basic quick calculations;
- Group and ungroup data;
- Among many other things.


___
**Attention!!**

- The cell below was a strategy I used to download the libraries. I tried other methods, but without success.. 

- Essentially,its a function I created that downloads all the libraries needed for the project, so it will be the first to be executed before anything else.

- It will take about 2 minutes to run. In the meantime, have a cup of coffee  ;)

- In other notebooks, if you see that there was an import error (ImportError: No module named request)

    - To solve this, add a new code cell
    - Type !pip install library_name
    - Wait for it to download, and then import the library. It was probably resolved.

In [1]:
from subprocesses import install_libraries

# Example usage
libraries = "pandas matplotlib earthpy rasterio folium mplstereo striplog geopandas pyrolite contextily"
install_libraries(libraries)

_____

Now lets get back to what matters.Pandas! 
Run the next cell to perform the import.

In [2]:
#to import the library
import pandas as pd
import os

### 1. Pandas Series

Lets create an object called `Series`.

In [3]:
au = pd.Series([5.0, 6.1, 4.2, 2.4, 8.3], index=["A", "B", "C", "D", "E"])


ag = pd.Series({"A": 51.2, "B": 62.7, "C": 54.8, "D": 47.1, "E": 40.3})
au, ag

(A    5.0
 B    6.1
 C    4.2
 D    2.4
 E    8.3
 dtype: float64,
 A    51.2
 B    62.7
 C    54.8
 D    47.1
 E    40.3
 dtype: float64)

In [4]:
#now get the indices of the series 
au.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [5]:
# this time, the values.
au.values

array([5. , 6.1, 4.2, 2.4, 8.3])

**CHALLENGE**

- Now create a Series for the Cu values.

|Amostra| Valor |
|-------|-------|
| A | 3.2 |
| B | 4.5 |
| C |  2.1 | 
| D | 4.8 |
| E | 5.4 |


In [6]:
#write your series for Cu
cu = pd.Series([3.2, 4.5, 2.1, 4.8, 5.4], index=["A", "B", "C", "D", "E"])
#now, create the output for the series.
cu

A    3.2
B    4.5
C    2.1
D    4.8
E    5.4
dtype: float64

In [7]:
# we can create a copy using the copy()
copy = cu.copy()
copy

A    3.2
B    4.5
C    2.1
D    4.8
E    5.4
dtype: float64

In [8]:
#just like with a python dictionary we can add and modify the values of a series

copy[0] = 0
copy["F"] = 10
copy

A     0.0
B     4.5
C     2.1
D     4.8
E     5.4
F    10.0
dtype: float64

In [9]:
index = ["A", "B", "C", "D", "E"]

dict(zip(index, [4.0, 6.1, 3.5, 6.4, 8.9]))

{'A': 4.0, 'B': 6.1, 'C': 3.5, 'D': 6.4, 'E': 8.9}

In [10]:
#Now,in addition to Zn,create one for Pb

index = ["A", "B", "C", "D", "E"]

zn = pd.Series(dict(zip(index, [4.0, 6.1, 3.5, 6.4, 8.9])))
pb = pd.Series(dict(zip(index, [4.6, 3.4, 9.8, 1.2, 3.3])))

In [11]:
#showing the results
display(zn)
display(pb)

A    4.0
B    6.1
C    3.5
D    6.4
E    8.9
dtype: float64

A    4.6
B    3.4
C    9.8
D    1.2
E    3.3
dtype: float64

In [12]:
#Within a series, we can slice the information
zn[1:3]

B    6.1
C    3.5
dtype: float64

In [13]:
pb["C": "E"]

C    9.8
D    1.2
E    3.3
dtype: float64

Two very important Pandas functions are .loc and .iloc. They are used to locate a term or search by position. Let's see below:

In [14]:
#I will search for a value that has "B"

zn.loc['B']

6.1

In [15]:
#Search for what is in position 1
zn.iloc[1]

6.1

### 2. Pandas Dataframe
A DataFrame, also abbreviated as df, groups a set of Series to generate a table with columns and rows. Let's use the Series we created to make a DataFrame.

In [16]:
df = pd.DataFrame({'Au':au,
                   'Ag':ag,
                   'Cu':cu,
                   'Zn':zn,
                   'Pb': pb
                   }
                  )
#to return without a row limit
df

Unnamed: 0,Au,Ag,Cu,Zn,Pb
A,5.0,51.2,3.2,4.0,4.6
B,6.1,62.7,4.5,6.1,3.4
C,4.2,54.8,2.1,3.5,9.8
D,2.4,47.1,4.8,6.4,1.2
E,8.3,40.3,5.4,8.9,3.3


In [17]:
#using the head function
df.head()

Unnamed: 0,Au,Ag,Cu,Zn,Pb
A,5.0,51.2,3.2,4.0,4.6
B,6.1,62.7,4.5,6.1,3.4
C,4.2,54.8,2.1,3.5,9.8
D,2.4,47.1,4.8,6.4,1.2
E,8.3,40.3,5.4,8.9,3.3


In [18]:
df.head(2)

Unnamed: 0,Au,Ag,Cu,Zn,Pb
A,5.0,51.2,3.2,4.0,4.6
B,6.1,62.7,4.5,6.1,3.4


In [19]:
#Challenge:
## How to get the end of the df?
df.tail(2)

Unnamed: 0,Au,Ag,Cu,Zn,Pb
D,2.4,47.1,4.8,6.4,1.2
E,8.3,40.3,5.4,8.9,3.3


In [20]:
#We can also make a copy of a DataFrame
df.copy()

Unnamed: 0,Au,Ag,Cu,Zn,Pb
A,5.0,51.2,3.2,4.0,4.6
B,6.1,62.7,4.5,6.1,3.4
C,4.2,54.8,2.1,3.5,9.8
D,2.4,47.1,4.8,6.4,1.2
E,8.3,40.3,5.4,8.9,3.3


#### 2.1 Indexes and Columns

Sometimes, its necessary to rename a column. For that, we use the 'rename()' function.

In [21]:
df.rename(
          columns={"Au": "Ouro"},
          index={"A": "AM1"}
          )

Unnamed: 0,Ouro,Ag,Cu,Zn,Pb
AM1,5.0,51.2,3.2,4.0,4.6
B,6.1,62.7,4.5,6.1,3.4
C,4.2,54.8,2.1,3.5,9.8
D,2.4,47.1,4.8,6.4,1.2
E,8.3,40.3,5.4,8.9,3.3


In [22]:
# In some cases, it's necessary to remove the index to perform operations..
## For this, we use the reset_index() function

df.reset_index()

Unnamed: 0,index,Au,Ag,Cu,Zn,Pb
0,A,5.0,51.2,3.2,4.0,4.6
1,B,6.1,62.7,4.5,6.1,3.4
2,C,4.2,54.8,2.1,3.5,9.8
3,D,2.4,47.1,4.8,6.4,1.2
4,E,8.3,40.3,5.4,8.9,3.3


In [23]:
#If we add the parameter drop = True, the old index will be discarded

df.reset_index(drop=True)

Unnamed: 0,Au,Ag,Cu,Zn,Pb
0,5.0,51.2,3.2,4.0,4.6
1,6.1,62.7,4.5,6.1,3.4
2,4.2,54.8,2.1,3.5,9.8
3,2.4,47.1,4.8,6.4,1.2
4,8.3,40.3,5.4,8.9,3.3


In [24]:
#If we want the change to be permanent, we need to use the parameter
## inplace = True

df.reset_index(drop = True, inplace = True)

In [25]:
df.head()

Unnamed: 0,Au,Ag,Cu,Zn,Pb
0,5.0,51.2,3.2,4.0,4.6
1,6.1,62.7,4.5,6.1,3.4
2,4.2,54.8,2.1,3.5,9.8
3,2.4,47.1,4.8,6.4,1.2
4,8.3,40.3,5.4,8.9,3.3


In [26]:
#However, we can add an index back:

df.index = ["AM1", "AM2", "AM3", "AM4", "AM5"]
df

Unnamed: 0,Au,Ag,Cu,Zn,Pb
AM1,5.0,51.2,3.2,4.0,4.6
AM2,6.1,62.7,4.5,6.1,3.4
AM3,4.2,54.8,2.1,3.5,9.8
AM4,2.4,47.1,4.8,6.4,1.2
AM5,8.3,40.3,5.4,8.9,3.3


In [27]:
#If needed, we can give a name to the index
df.index.name = 'Amostras'
df

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM1,5.0,51.2,3.2,4.0,4.6
AM2,6.1,62.7,4.5,6.1,3.4
AM3,4.2,54.8,2.1,3.5,9.8
AM4,2.4,47.1,4.8,6.4,1.2
AM5,8.3,40.3,5.4,8.9,3.3


In [28]:
#To see all the columns of a df, we use the columns function

df.columns

Index(['Au', 'Ag', 'Cu', 'Zn', 'Pb'], dtype='object')

In [29]:
#To know the data type of each column, there is the dtypes function, very important when exploring the data

df.dtypes

Au    float64
Ag    float64
Cu    float64
Zn    float64
Pb    float64
dtype: object

In [30]:
#Sometimes, it might be necessary to sort values in ascending or descending order

df.sort_values(by=['Ag'], ascending = True)

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM5,8.3,40.3,5.4,8.9,3.3
AM4,2.4,47.1,4.8,6.4,1.2
AM1,5.0,51.2,3.2,4.0,4.6
AM3,4.2,54.8,2.1,3.5,9.8
AM2,6.1,62.7,4.5,6.1,3.4


In [31]:
#Try writing the last code, but instead of ascending = True, put False
#What are the differences??

df.sort_values(by=['Ag'], ascending = False)

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM2,6.1,62.7,4.5,6.1,3.4
AM3,4.2,54.8,2.1,3.5,9.8
AM1,5.0,51.2,3.2,4.0,4.6
AM4,2.4,47.1,4.8,6.4,1.2
AM5,8.3,40.3,5.4,8.9,3.3


#### 2.2 Selection of Rows and Columns

We can select columns by their name. See below:

In [32]:
df['Au'] #df.Au

Amostras
AM1    5.0
AM2    6.1
AM3    4.2
AM4    2.4
AM5    8.3
Name: Au, dtype: float64

In [33]:
## and we can also select several with a list
df[['Au','Pb','Ag']]

Unnamed: 0_level_0,Au,Pb,Ag
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AM1,5.0,4.6,51.2
AM2,6.1,3.4,62.7
AM3,4.2,9.8,54.8
AM4,2.4,1.2,47.1
AM5,8.3,3.3,40.3


In [34]:
#Slicing also occurs in dataframes

df[1:]

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM2,6.1,62.7,4.5,6.1,3.4
AM3,4.2,54.8,2.1,3.5,9.8
AM4,2.4,47.1,4.8,6.4,1.2
AM5,8.3,40.3,5.4,8.9,3.3


In [35]:
df['AM3':]

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM3,4.2,54.8,2.1,3.5,9.8
AM4,2.4,47.1,4.8,6.4,1.2
AM5,8.3,40.3,5.4,8.9,3.3


In [36]:
#The loc and iloc functions are great tools in dataframes
df.loc[["AM1", "AM3"], ["Ag", "Pb"]]

Unnamed: 0_level_0,Ag,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1
AM1,51.2,4.6
AM3,54.8,9.8


In [37]:
##Challenge: Filter only samples AM5 and Au

df.loc[['AM5'], ['Au']]

Unnamed: 0_level_0,Au
Amostras,Unnamed: 1_level_1
AM5,8.3


In [38]:
df.loc["AM1": "AM4", :]

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM1,5.0,51.2,3.2,4.0,4.6
AM2,6.1,62.7,4.5,6.1,3.4
AM3,4.2,54.8,2.1,3.5,9.8
AM4,2.4,47.1,4.8,6.4,1.2


In [39]:
df.iloc[[0, 2], [1, 4]]

Unnamed: 0_level_0,Ag,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1
AM1,51.2,4.6
AM3,54.8,9.8


In [40]:
df.iloc[0:4, :]

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM1,5.0,51.2,3.2,4.0,4.6
AM2,6.1,62.7,4.5,6.1,3.4
AM3,4.2,54.8,2.1,3.5,9.8
AM4,2.4,47.1,4.8,6.4,1.2


In [41]:
#Logical expressions can also be used here...
df["Au"] > 5.0


Amostras
AM1    False
AM2     True
AM3    False
AM4    False
AM5     True
Name: Au, dtype: bool

In [42]:
#If you want it to return a dataframe, you need to do this:
df[df["Au"] > 5.0]

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AM2,6.1,62.7,4.5,6.1,3.4
AM5,8.3,40.3,5.4,8.9,3.3


#### 2.3 Modifying Rows and Columns

It's possible to remove rows and columns using the 'drop()' function, and then the parameter 'inplace = True', to make sure the change is saved.

In [43]:
df.drop(columns=["Au", "Ag"], index=["AM1", "AM2"])

Unnamed: 0_level_0,Cu,Zn,Pb
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AM3,2.1,3.5,9.8
AM4,4.8,6.4,1.2
AM5,5.4,8.9,3.3


In [44]:
#We can also add new columns
df["LigaMetalica"] = df["Cu"] + df["Zn"] + df["Pb"]
df

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb,LigaMetalica
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AM1,5.0,51.2,3.2,4.0,4.6,11.8
AM2,6.1,62.7,4.5,6.1,3.4,14.0
AM3,4.2,54.8,2.1,3.5,9.8,15.4
AM4,2.4,47.1,4.8,6.4,1.2,12.4
AM5,8.3,40.3,5.4,8.9,3.3,17.6


#### 2.4 Empty Values in Pandas

Let's create an empty column. For that, we'll need the help of the Numpy library.

In [45]:
import numpy as np

df['Null'] = np.nan
df

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb,LigaMetalica,Null
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AM1,5.0,51.2,3.2,4.0,4.6,11.8,
AM2,6.1,62.7,4.5,6.1,3.4,14.0,
AM3,4.2,54.8,2.1,3.5,9.8,15.4,
AM4,2.4,47.1,4.8,6.4,1.2,12.4,
AM5,8.3,40.3,5.4,8.9,3.3,17.6,


In [46]:
#To locate null values (NaN), we use the function isna

df.isna().sum()

Au              0
Ag              0
Cu              0
Zn              0
Pb              0
LigaMetalica    0
Null            5
dtype: int64

In [47]:
#To change all null values, there's the function below.
df.fillna("")


Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb,LigaMetalica,Null
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AM1,5.0,51.2,3.2,4.0,4.6,11.8,
AM2,6.1,62.7,4.5,6.1,3.4,14.0,
AM3,4.2,54.8,2.1,3.5,9.8,15.4,
AM4,2.4,47.1,4.8,6.4,1.2,12.4,
AM5,8.3,40.3,5.4,8.9,3.3,17.6,


In [48]:
#Finally, if you want to delete all NaN (which is not recommended from a statistical point of view),
#just use the dropna() function.

# axis = 1 to eliminate from columns

#axis = 0 to eliminate from rows

df.dropna(axis = 1)

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb,LigaMetalica
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AM1,5.0,51.2,3.2,4.0,4.6,11.8
AM2,6.1,62.7,4.5,6.1,3.4,14.0
AM3,4.2,54.8,2.1,3.5,9.8,15.4
AM4,2.4,47.1,4.8,6.4,1.2,12.4
AM5,8.3,40.3,5.4,8.9,3.3,17.6


In [49]:
df.dropna(axis=1, inplace=True)
df

Unnamed: 0_level_0,Au,Ag,Cu,Zn,Pb,LigaMetalica
Amostras,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AM1,5.0,51.2,3.2,4.0,4.6,11.8
AM2,6.1,62.7,4.5,6.1,3.4,14.0
AM3,4.2,54.8,2.1,3.5,9.8,15.4
AM4,2.4,47.1,4.8,6.4,1.2,12.4
AM5,8.3,40.3,5.4,8.9,3.3,17.6


#### 2.5  Saving and Loading Files

To save the information, first, we need to see if Pandas accepts exporting to the desired format (most likely it will work).

It's very common to export data directly to 'csv' format. For this, there's the 'to_csv()' function.

In [50]:
#Let's define the folder name and the path where the file will be saved
path = r"C:\Users\USER\PYTHON FOR DATASCIENCE UDEMY\YOUTUBE MACHINE LEARNING PROJECT\GEO_DATASCIENCE\LinkedIn_Geopy_Translation\Python_introduction\notebooks1"

In [51]:
#Now, saving the file
df.to_csv(path+'/'+'chemical_analysis..csv')

In [52]:
#To read, the process is quite similar

csv = pd.read_csv(os.path.join(path, 'chemical_analysis..csv'))
csv.head()

Unnamed: 0,Amostras,Au,Ag,Cu,Zn,Pb,LigaMetalica
0,AM1,5.0,51.2,3.2,4.0,4.6,11.8
1,AM2,6.1,62.7,4.5,6.1,3.4,14.0
2,AM3,4.2,54.8,2.1,3.5,9.8,15.4
3,AM4,2.4,47.1,4.8,6.4,1.2,12.4
4,AM5,8.3,40.3,5.4,8.9,3.3,17.6


#### 2.6 General Analysis

Sometimes, we just want to take a general look at the numbers and statistics.For that, there's the `describe()` function.

We'll also use another library to deal with folders and files on our computer. The library is called `os`. We'll use the `os.path.join` function to join the folder name stored in the variable `path` with the filename.

When Pandas reads the file, it will understand `./pathname/rochas.csv`.

if you want to better understand how this library works, which is quite important, [veja esse vídeo](https://www.youtube.com/watch?v=bgrRKmvP8As)

In [53]:
#Importing the os library
import os

In [54]:
# Reading the file 
rocks = pd.read_csv(os.path.join(path,
                                  'rochas.csv'))
rocks.head()

Unnamed: 0,Nome,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
0,Peridotita,45.16,1.56,8.79,0.97,44.47,0.1,0.02,0.13,0.1
1,Peridotita,45.97,2.94,8.9,2.83,39.89,0.17,0.04,0.13,0.19
2,Peridotita,46.91,3.62,8.23,2.73,39.55,0.14,0.01,0.13,0.11
3,Peridotita,44.96,2.01,9.04,1.1,43.39,0.1,0.02,0.13,0.11
4,Peridotita,45.24,0.73,7.92,0.42,46.79,0.01,0.01,0.11,0.03


In [55]:
rocks.describe()

Unnamed: 0,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
count,4564.0,4564.0,4564.0,4564.0,4564.0,4564.0,4564.0,4564.0,4564.0
mean,58.539551,11.066517,5.398737,3.084427,14.704665,2.521468,1.99712,0.098935,0.403606
std,11.57278,6.377444,3.003889,1.945397,18.636226,1.841039,1.62922,0.056822,0.353953
min,23.49,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,44.7,3.23,3.17,1.87,1.26,0.24,0.09,0.06,0.14
50%,64.05,14.819279,4.56,2.97,2.103881,3.175,2.27,0.09635,0.38
75%,67.62,15.85,7.66,4.0325,38.5525,3.97,3.29,0.13,0.56
max,79.9,25.4,20.72,25.265,53.1,7.9,8.13,1.21,4.78


In [56]:
#To obtain the unique values of the column, we use the unique() function

rocks.Nome.unique()

array(['Peridotita', 'Granodiorita'], dtype=object)

In [57]:
rocks.shape

(4564, 10)

In [58]:
rocks.sample(5)

Unnamed: 0,Nome,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
3474,Granodiorita,52.8,16.7,11.61,6.46,2.83,2.49,3.48,0.18,1.35
2553,Granodiorita,69.33,15.26,2.46,1.7,1.07,3.84,3.34,0.07,0.39
3130,Granodiorita,66.523276,14.345839,4.99,3.889787,1.654728,5.451999,0.37119,0.067563,0.751795
4166,Granodiorita,69.8,15.83,2.38,2.33,0.01,4.11,1.87,1.21,0.44
4525,Granodiorita,66.2,16.74,2.88,3.74,2.16,3.27,4.22,0.05,0.52


In [59]:
rocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4564 entries, 0 to 4563
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Nome    4564 non-null   object 
 1   SiO2    4564 non-null   float64
 2   Al2O3   4564 non-null   float64
 3   FeOT    4564 non-null   float64
 4   CaO     4564 non-null   float64
 5   MgO     4564 non-null   float64
 6   Na2O    4564 non-null   float64
 7   K2O     4564 non-null   float64
 8   MnO     4564 non-null   float64
 9   TiO     4564 non-null   float64
dtypes: float64(9), object(1)
memory usage: 356.7+ KB


#### 2.7 Agrupamento de dados

Para agrupar dados de acordo com uma ou diversas colunas, podemos usar o método `groupby()`. 
Essa função permite realizar operações sobre cada um dos grupos de forma independente.

In [60]:
rocks.groupby(['Nome'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019499ADFB50>

In [61]:
#Number of elements per group
rocks.groupby(['Nome']).count()

Unnamed: 0_level_0,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
Nome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Granodiorita,2993,2993,2993,2993,2993,2993,2993,2993,2993
Peridotita,1571,1571,1571,1571,1571,1571,1571,1571,1571


In [62]:
#Mean per group
rocks.groupby(['Nome']).mean()

Unnamed: 0_level_0,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
Nome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Granodiorita,66.478101,15.494279,3.736919,3.448311,1.618093,3.725909,2.983752,0.078362,0.517362
Peridotita,43.415374,2.630939,8.564758,2.391172,39.636627,0.226822,0.117431,0.138131,0.186882


In [63]:
#Amount of SiO2 per group
rocks.groupby(['Nome'])['FeOT'].mean()

Nome
Granodiorita    3.736919
Peridotita      8.564758
Name: FeOT, dtype: float64

In [64]:
#Median per group
rocks.groupby(['Nome']).median()

Unnamed: 0_level_0,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
Nome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Granodiorita,66.52,15.53,3.6,3.37,1.51,3.72,2.98,0.07,0.49
Peridotita,43.85,2.13,8.06,1.98,41.02,0.14,0.03,0.13,0.09


In [65]:
#Coefficient of variation per group
rocks.groupby(['Nome']).std() / rocks.groupby(['Nome']).mean()

Unnamed: 0_level_0,SiO2,Al2O3,FeOT,CaO,MgO,Na2O,K2O,MnO,TiO
Nome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Granodiorita,0.062405,0.079269,0.408062,0.446731,0.590787,0.245766,0.361306,0.657902,0.474438
Peridotita,0.06275,0.987772,0.297652,1.002265,0.194094,2.061584,2.805062,0.32168,2.255234


In [66]:
#Finally, if we want to perform various functions for specific columns,
#we use the agg method.

rocks.groupby('Nome').agg({"SiO2": [np.max, np.min], "Al2O3": [np.max, np.min]})

Unnamed: 0_level_0,SiO2,SiO2,Al2O3,Al2O3
Unnamed: 0_level_1,amax,amin,amax,amin
Nome,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Granodiorita,79.9,35.3,22.71,5.4
Peridotita,58.16,23.49,25.4,0.01
