## Data exploration and tidy data

### Objectives
* Tidy data, from theory to practice
* Exploring raw data
* Clean, tidy and preprocess the data

### Notes
* Online course resource in R [link](https://rmagno.eu/tdvr.oct.22/).



In [1]:
import os
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np

## 1. Raw data

In [10]:
# !rm -r 2025-tidy-python
!git clone https://github.com/Py-ualg/2025-tidy-python.git

Cloning into '2025-tidy-python'...
remote: Enumerating objects: 196, done.[K
remote: Counting objects: 100% (196/196), done.[K
remote: Compressing objects: 100% (186/186), done.[K
remote: Total 196 (delta 14), reused 185 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (196/196), 3.81 MiB | 9.19 MiB/s, done.
Resolving deltas: 100% (14/14), done.


In [11]:
# alternative to download single files
# !wget https://github.com/Py-ualg/2025-tidy-python/blob/main/r2py/raw-data-python/2020-01-18_area1.csv

In [3]:
os.listdir()

['.config', '2020-01-18_area1.csv', 'sample_data']

In [18]:
data_raw_path = '2025-tidy-python/r2py/raw-data-python/'

In [19]:
!ls

2020-01-18_area1.csv  2025-tidy-python	data-raw-python  sample_data


In [21]:
# quadrats01 <- readxl::read_excel(file.path(data_raw_path, "quadrats.xlsx"))
quadrats01  = pd.read_excel(os.path.join(data_raw_path, 'quadrats.xlsx'))
quadrats01.head()

Unnamed: 0.1,Unnamed: 0,Ria Formosa-q1,Ria Formosa-q2
0,Area (m2),250,360


In [22]:
# df1_q1 <- readr::read_csv(file.path(data_raw_path, "2020-01-04_q1.csv"))
df1_q1 = pd.read_csv(os.path.join(data_raw_path, '2020-01-04_q1.csv'))
df1_q1.head()

Unnamed: 0.1,Unnamed: 0,cl [cm],lcl [cm],fw [cm],species,longitude,is_gravid,rcl [cm],stage,sex,id,cw [cm],latitude,associated_species,depth [m],behaviour
0,0,9,9.29841,2,A. Farensis,-8.01873,False,1.311285,adult,female,1,23.307874,37.02606,"['Crab (Other Species)', 'Sea Sponge', 'Jellyf...",10,Crab moving quickly across rocks.
1,1,22,9.853438,4,u. olhanen.,,True,3.003715,juvenile,female,2,26.794421,,"['Small Fish', 'Crab (Other Species)', 'Algae'...",10,Pausing frequently during exploration.
2,2,28,10.614677,3,u. olhanen.,,False,0.97047,sub_adult,male,3,26.056546,,"['Shrimp', 'Polychaete Worm', 'Coral Fragment'...",8,Escaping from aggressive fish.
3,3,19,11.076178,3,A farensis,,False,2.991345,pre_puberty,male,4,27.184946,,"['Jellyfish', 'Polychaete Worm', 'Algae', 'Mus...",5,Displaying dominance by lifting body.
4,4,18,8.147143,2,A farensis,,True,0.711986,adult,female,5,23.136318,,"['Mussel', 'Algae', 'Hermit Crab', 'Crab (Othe...",7,Feeding on algae scraped from rocks.


In [23]:
df1_q1.describe()

Unnamed: 0.1,Unnamed: 0,cl [cm],lcl [cm],fw [cm],longitude,rcl [cm],id,cw [cm],latitude,depth [m]
count,369.0,369.0,369.0,369.0,1.0,369.0,369.0,369.0,1.0,369.0
mean,184.0,15.555556,9.423223,4.170732,-8.01873,5.095983,185.0,23.789286,37.02606,8.219512
std,106.665365,7.850005,1.590672,1.994724,,11.731126,106.665365,2.678329,,2.275352
min,0.0,5.0,5.145881,2.0,-8.01873,0.01258,1.0,16.880697,37.02606,1.0
25%,92.0,9.0,8.370462,3.0,-8.01873,0.574676,93.0,21.918503,37.02606,7.0
50%,184.0,15.0,9.487846,4.0,-8.01873,1.597585,185.0,23.693945,37.02606,8.0
75%,276.0,21.0,10.495103,5.0,-8.01873,4.714856,277.0,25.589346,37.02606,10.0
max,368.0,46.0,15.273053,12.0,-8.01873,129.663461,369.0,34.877735,37.02606,15.0


This is typically not what you do in `pandas`, we shouw it here, because of R counterpart of `dplyr::glimpse(df1_q1)`, in python you typically do `df1_q1.head()`, however for wider tables, transposition is not a stupid idea.

In [24]:
# dplyr::glimpse(df1_q1), supposedly transposed print
df1_q1.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,359,360,361,362,363,364,365,366,367,368
Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,359,360,361,362,363,364,365,366,367,368
cl [cm],9,22,28,19,18,14,15,35,46,24,...,29,27,11,19,15,6,11,23,5,16
lcl [cm],9.29841,9.853438,10.614677,11.076178,8.147143,7.832342,9.61202,6.511678,11.071921,8.794032,...,10.777038,7.559897,10.608741,12.039913,8.40014,9.03658,10.150895,10.815358,9.493213,8.895865
fw [cm],2,4,3,3,2,3,3,4,7,7,...,2,7,4,2,7,4,3,2,2,5
species,A. Farensis,u. olhanen.,u. olhanen.,A farensis,A farensis,U. olhanensis,A farensis,U. olhanensis,A. Farensis,u. olhanen.,...,u. olhanen.,A. Farensis,A farensis,A farensis,A farensis,A farensis,u. olhanen.,u. olhanen.,A. Farensis,U. olhanensis
longitude,-8.01873,,,,,,,,,,...,,,,,,,,,,
is_gravid,False,True,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,True
rcl [cm],1.311285,3.003715,0.97047,2.991345,0.711986,0.180093,2.12908,16.014595,0.479633,1.006575,...,2.577693,0.365068,3.109068,2.563043,1.260648,14.955587,0.608139,0.762496,2.181457,1.791813
stage,adult,juvenile,sub_adult,pre_puberty,adult,sub_adult,pre_puberty,sub_adult,juvenile,sub_adult,...,pre_puberty,pre_puberty,pre_puberty,adult,sub_adult,sub_adult,pre_puberty,sub_adult,sub_adult,pre_puberty
sex,female,female,male,male,female,male,female,male,male,,...,,male,female,male,,female,male,female,,female


In [26]:
# colnames(df1_q1)
df1_q1.columns  # index might be difficult to work with, get list by simply: list(df1_q1.columns)

Index(['Unnamed: 0', 'cl [cm]', 'lcl [cm]', 'fw [cm]', 'species', 'longitude',
       'is_gravid', 'rcl [cm]', 'stage', 'sex', 'id', 'cw [cm]', 'latitude',
       'associated_species', 'depth [m]', 'behaviour'],
      dtype='object')

In [27]:
 # in R: (nrow() and ncol())
 df1_q1.shape

(369, 16)

In [28]:
# Value counts in column
# table(df1_q1$stage)
df1_q1['stage'].value_counts()

Unnamed: 0_level_0,count
stage,Unnamed: 1_level_1
pre_puberty,112
sub_adult,107
adult,77
juvenile,73


In [29]:
# Unique values in column
# unique(df1_q1$sex)
df1_q1['sex'].unique()

array(['female', 'male', ' ', 'male or female', '?', '-', 'N/R', nan],
      dtype=object)

## 2. Data tidying

In [30]:
# add region, season, quadrat
df1_q1['region'] = "Ria Formosa"
df1_q1['quadrat'] = "q1"