### Data Wrangling

1. **Combining Data Frames**
    - Concatenating
    - Merging
2. **Data types** 
    - Type conversion
    - replacing values
3. **Sorting**

### 1a. Load fragmented penguin data set.
- One ``.csv`` file for each species
- How can we combine them all into one DataFrame?

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
df_list = []
for file in os.listdir():
    if file.startswith('penguins'):
        df = pd.read_csv(file)
        df_list.append(df)

penguins_df = pd.concat(df_list)

In [4]:
# number of rows in seperate csv files
df_lengths = list(map(len, df_list))
df_lengths

[68, 124, 152]

In [5]:
# total number of rows in the seperate files 
sum(df_lengths)

344

In [6]:
# number of rows in the big dataframe 
len(penguins_df)

344

In [7]:
penguins_df['Sample ID'].nunique()

344

In [8]:
penguins_df.head()

Unnamed: 0,studyName,Sample Number,Species,Sample ID,Region,Island,Individual ID,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Comments
0,PAL0708,1,Chinstrap,C_1,Anvers,Dream,N61A1,46.5,17.9,192.0,3500.0,
1,PAL0708,2,Chinstrap,C_2,Anvers,Dream,N61A2,50.0,19.5,196.0,3900.0,
2,PAL0708,3,Chinstrap,C_3,Anvers,Dream,N62A1,51.3,19.2,193.0,3650.0,
3,PAL0708,4,Chinstrap,C_4,Anvers,Dream,N62A2,45.4,18.7,188.0,3525.0,
4,PAL0708,5,Chinstrap,C_5,Anvers,Dream,N64A1,52.7,19.8,197.0,3725.0,


### 1b. Uh-oh! We're missing the sex data!
- Fortunately, that data exists in another `.csv` file. 
- How can we "merge" the sex data into our current DataFrame?

In [9]:
sex_df = pd.read_csv('sex_data_penguins.csv')

In [10]:

len(sex_df)

344

In [11]:
sex_df.head()

Unnamed: 0,Sex,Sample Number,Species
0,MALE,1,Adelie
1,FEMALE,2,Adelie
2,FEMALE,3,Adelie
3,,4,Adelie
4,FEMALE,5,Adelie


In [12]:
# Sample ID = first character of species + _ + sample number 
sex_df['Sample ID'] = sex_df['Species'].str[0] + '_' + sex_df['Sample Number'].astype(str)

In [13]:
sex_df['Sample ID'].nunique()

344

In [14]:
# drop the two columns we used to create the id 
sex_df.drop(['Sample Number', 'Species'], axis=1, inplace=True)

In [15]:
sex_df.head()

Unnamed: 0,Sex,Sample ID
0,MALE,A_1
1,FEMALE,A_2
2,FEMALE,A_3
3,,A_4
4,FEMALE,A_5


In [16]:
# implied here that it is an inner join 
final_df = pd.merge(left=penguins_df, right=sex_df, on='Sample ID', how='inner')

In [17]:
final_df.head()

Unnamed: 0,studyName,Sample Number,Species,Sample ID,Region,Island,Individual ID,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Comments,Sex
0,PAL0708,1,Chinstrap,C_1,Anvers,Dream,N61A1,46.5,17.9,192.0,3500.0,,FEMALE
1,PAL0708,2,Chinstrap,C_2,Anvers,Dream,N61A2,50.0,19.5,196.0,3900.0,,MALE
2,PAL0708,3,Chinstrap,C_3,Anvers,Dream,N62A1,51.3,19.2,193.0,3650.0,,MALE
3,PAL0708,4,Chinstrap,C_4,Anvers,Dream,N62A2,45.4,18.7,188.0,3525.0,,FEMALE
4,PAL0708,5,Chinstrap,C_5,Anvers,Dream,N64A1,52.7,19.8,197.0,3725.0,,MALE


### 2. Datatypes 

In [21]:
# check the data types of the data frame 
final_df.dtypes

studyName               object
Sample Number            int64
Species                 object
Sample ID               object
Region                  object
Island                  object
Individual ID           object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Comments                object
Sex                     object
dtype: object

In [22]:
# check the memory usage
final_df['Sex'].memory_usage()

5504

In [23]:
# check memory usage after memory usage
final_df['Sex'].astype('category').memory_usage()

3228

In [24]:
# check the time performance before conversion
%timeit final_df.groupby(['Sex'])['Sex'].count()

423 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [25]:
# Enforce data type 
final_df['Sex'] = final_df['Sex'].astype('category')

In [26]:
# check the time performance after conversion
%timeit final_df.groupby(['Sex'])['Sex'].count()

314 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [27]:
# replace the . with nan 
final_df['Sex'] = final_df['Sex'].replace('.', np.nan)

In [28]:
# another way of using replace -> dictionaries 
final_df['Sample Number'].replace({1 : '001', 2 : '002'})

0      001
1      002
2        3
3        4
4        5
      ... 
339    148
340    149
341    150
342    151
343    152
Name: Sample Number, Length: 344, dtype: object

### 3. Sorting

In [29]:
# sort by culmen 
final_df[['studyName', 'Culmen Length (mm)']].sort_values(by=['Culmen Length (mm)'])

Unnamed: 0,studyName,Culmen Length (mm)
334,PAL0910,32.1
290,PAL0809,33.1
262,PAL0809,33.5
284,PAL0809,34.0
200,PAL0708,34.1
...,...,...
169,PAL0910,55.9
17,PAL0708,58.0
101,PAL0708,59.6
187,PAL0910,


In [30]:
# sort by culmen 
final_df[['studyName', 'Culmen Length (mm)']].sort_values(by=['Culmen Length (mm)'], ascending=False)

Unnamed: 0,studyName,Culmen Length (mm)
101,PAL0708,59.6
17,PAL0708,58.0
169,PAL0910,55.9
63,PAL0910,55.8
183,PAL0910,55.1
...,...,...
262,PAL0809,33.5
290,PAL0809,33.1
334,PAL0910,32.1
187,PAL0910,
