## Data exploration and tidy data

### Objectives
* Tidy data, from theory to practice
* Exploring raw data

### Notes
* Online course resource in R [link](https://rmagno.eu/tdvr.oct.22/). Some commands in R are shown as first line comment in the cells.



In [1]:
import os
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np

## 1. Raw data

In [2]:
# !rm -r 2025-tidy-python
!git clone https://github.com/Py-ualg/2025-tidy-python.git

Cloning into '2025-tidy-python'...
remote: Enumerating objects: 447, done.[K
remote: Counting objects: 100% (447/447), done.[K
remote: Compressing objects: 100% (430/430), done.[K
remote: Total 447 (delta 31), reused 421 (delta 15), pack-reused 0 (from 0)[K
Receiving objects: 100% (447/447), 8.07 MiB | 14.10 MiB/s, done.
Resolving deltas: 100% (31/31), done.


In [None]:
# alternative to download single files
# !wget https://github.com/Py-ualg/2025-tidy-python/blob/main/r2py/raw-data-python/2020-01-18_area1.csv

In [3]:
data_raw_path = '2025-tidy-python/r2py/raw-data-python/'

In [4]:
!ls

2025-tidy-python  sample_data


In [5]:
# quadrats01 <- readxl::read_excel(file.path(data_raw_path, "quadrats.xlsx"))
quadrats01  = pd.read_excel(os.path.join(data_raw_path, 'quadrats.xlsx'))
quadrats01.head()

Unnamed: 0.1,Unnamed: 0,Ria Formosa-rf,Ria Alvor-ra
0,Area (m2),250,360


In [6]:
# df1_q1 <- readr::read_csv(file.path(data_raw_path, "2020-01-04_q1.csv"))
df1_ra = pd.read_csv(os.path.join(data_raw_path, '2020-01-18_ra.csv')).reset_index(drop=True)
# df1_rf <- readr::read_csv(file.path(data_raw_path, "2020-01-04_rf.csv"))
df1_rf = pd.read_csv(os.path.join(data_raw_path, '2020-01-18_rf.csv')).reset_index(drop=True)
df1_ra.head()

Unnamed: 0.1,Unnamed: 0,cl [mm],lcl [mm],fw [mm],species_name,longitude,is_gravid,rcl [mm],stage,sex,id,cw [mm],latitude,associated_species,depth [m],is_gravid?,behavior
0,0,350,126.518848,30,A farensis,-7.99163,False,377.889722,j,-,1,178.223501,37.01025,"['Shrimp', 'Coral Fragment', 'Sea Star']",7,False,Hiding under floating debris.
1,1,230,136.916206,20,u. olhanen.,,False,207.588759,s,?,2,187.290602,,"['Sea Urchin', 'Sea Sponge', 'Shrimp', 'Polych...",7,False,Guarding a small hole in sediment.
2,2,340,118.133667,60,A farensis,,False,33.971185,s,female,3,176.398011,,"['Polychaete Worm', 'Shrimp']",9,False,Hiding under floating debris.
3,3,150,136.691492,30,A farensis,,True,13.317438,s,male,4,166.202287,,"['Sea Anemone', 'Snail', 'Sea Star']",7,True,Searching for food in tidal pools.
4,4,430,121.840322,50,A farensis,,False,88.511637,p,?,5,158.550653,,"['Coral Fragment', 'Sea Sponge']",9,False,Resting in shaded rock crevice.


In [8]:
df1_ra.describe()

Unnamed: 0.1,Unnamed: 0,cl [mm],lcl [mm],fw [mm],longitude,rcl [mm],id,cw [mm],latitude,depth [m]
count,244.0,244.0,244.0,244.0,1.0,244.0,244.0,244.0,1.0,244.0
mean,121.5,262.336066,123.900276,42.540984,-7.99163,128.267084,122.5,163.879722,37.01025,7.971311
std,70.580923,114.488816,16.086648,19.607737,,227.456539,70.580923,18.337642,,2.01108
min,0.0,80.0,74.951513,20.0,-7.99163,0.689898,1.0,118.113491,37.01025,3.0
25%,60.75,187.5,113.821548,30.0,-7.99163,20.580134,61.75,151.855312,37.01025,7.0
50%,121.5,250.0,124.288266,40.0,-7.99163,49.459356,122.5,163.902512,37.01025,8.0
75%,182.25,330.0,134.098711,60.0,-7.99163,129.351979,183.25,176.867931,37.01025,9.0
max,243.0,640.0,163.033734,110.0,-7.99163,1500.0,244.0,208.258212,37.01025,14.0


**Exercise**: remove the "Unnamed: 0" column

In [9]:
# dplyr::glimpse(df1_ra), supposedly transposed print
df1_ra.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,234,235,236,237,238,239,240,241,242,243
Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,234,235,236,237,238,239,240,241,242,243
cl [mm],350,230,340,150,430,110,140,170,250,260,...,210,210,310,280,250,310,150,210,200,220
lcl [mm],126.518848,136.916206,118.133667,136.691492,121.840322,116.580744,126.996538,93.195881,125.565445,127.799161,...,115.160258,129.070878,153.073095,133.05555,124.892633,142.076687,148.89262,110.948542,118.346816,110.727005
fw [mm],30,20,60,30,50,20,60,30,30,30,...,30,30,30,40,20,50,20,60,40,30
species_name,A farensis,u. olhanen.,A farensis,A farensis,A farensis,A farensis,A farensis,A farensis,A. Farensis,U. olhanensis,...,A. Farensis,A. Farensis,A farensis,U. olhanensis,u. olhanen.,u. olhanen.,A. Farensis,A. Farensis,A. Farensis,A farensis
longitude,-7.99163,,,,,,,,,,...,,,,,,,,,,
is_gravid,False,False,False,True,False,False,False,False,False,False,...,True,False,True,False,False,False,False,False,False,False
rcl [mm],377.889722,207.588759,33.971185,13.317438,88.511637,476.401997,10.350024,1049.575661,69.16084,81.372097,...,525.126136,217.318233,89.714085,68.234667,338.765479,48.628353,137.532903,9.450327,248.140647,30.425664
stage,j,s,s,s,p,p,s,a,a,a,...,p,p,p,a,p,s,a,s,a,a
sex,-,?,female,male,?,,male,male,female,male or female,...,male,female,male,male,male,male,male,male,male,


This is typically not what you do in `pandas`, we shouw it here, because of R counterpart of `dplyr::glimpse(df1_ra)`, in python you typically do `df1_ra.head()`, however for wider tables, transposition is not a stupid idea.

In [10]:
# colnames(df1_ra)
df1_ra.columns  # index might be difficult to work with, get list by simply: list(df1_ra.columns)

Index(['Unnamed: 0', 'cl [mm]', 'lcl [mm]', 'fw [mm]', 'species_name',
       'longitude', 'is_gravid', 'rcl [mm]', 'stage', 'sex', 'id', 'cw [mm]',
       'latitude', 'associated_species', 'depth [m]', 'is_gravid?',
       'behavior'],
      dtype='object')

In [11]:
 # in R: (nrow() and ncol())
 df1_ra.shape

(244, 17)

In [12]:
# Value counts in column
# table(df1_ra$stage)
df1_ra['stage'].value_counts()

Unnamed: 0_level_0,count
stage,Unnamed: 1_level_1
s,96
p,74
a,49
j,25


In [14]:
df1_rf['stage'].value_counts()

Unnamed: 0_level_0,count
stage,Unnamed: 1_level_1
sub_adult,115
pre_puberty,111
juvenile,74
adult,62


In [13]:
# Unique values in column
# unique(df1_ra$sex)
df1_ra['sex'].unique()

array(['-', '?', 'female', 'male', nan, 'male or female', ' ', 'N/R'],
      dtype=object)

**Perhaps add something here**