# Working with bird Dataset: Data Preparation

In order to explore this functionality, we will import pandas library and use an iconic dataset: the **birds**.

In [8]:
import pandas as pd

birds_df = pd.read_csv('birds.csv')

- **DataFrame.info**: To start off, the `info()` method is used to print a summary of the content present in a `DataFrame`. Let's take a look at this dataset to see what we have:
```python


In [9]:
birds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 443 entries, 0 to 442
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                443 non-null    object 
 1   ScientificName      443 non-null    object 
 2   Category            443 non-null    object 
 3   Order               443 non-null    object 
 4   Family              443 non-null    object 
 5   Genus               443 non-null    object 
 6   ConservationStatus  443 non-null    object 
 7   MinLength           443 non-null    float64
 8   MaxLength           443 non-null    float64
 9   MinBodyMass         443 non-null    float64
 10  MaxBodyMass         443 non-null    float64
 11  MinWingspan         443 non-null    float64
 12  MaxWingspan         443 non-null    float64
dtypes: float64(6), object(7)
memory usage: 45.1+ KB


- **DataFrame.head()**: Next, to check the actual content of the `DataFrame`, we use the `head()` method. Let's see what the first few rows of our `birds` look like:

In [10]:
birds_df.head()

Unnamed: 0,Name,ScientificName,Category,Order,Family,Genus,ConservationStatus,MinLength,MaxLength,MinBodyMass,MaxBodyMass,MinWingspan,MaxWingspan
0,Black-bellied whistling-duck,Dendrocygna autumnalis,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Dendrocygna,LC,47.0,56.0,652.0,1020.0,76.0,94.0
1,Fulvous whistling-duck,Dendrocygna bicolor,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Dendrocygna,LC,45.0,53.0,712.0,1050.0,85.0,93.0
2,Snow goose,Anser caerulescens,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Anser,LC,64.0,79.0,2050.0,4050.0,135.0,165.0
3,Ross's goose,Anser rossii,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Anser,LC,57.3,64.0,1066.0,1567.0,113.0,116.0
4,Greater white-fronted goose,Anser albifrons,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Anser,LC,64.0,81.0,1930.0,3310.0,130.0,165.0


- **DataFrame.tail()**: Conversely, to check the last few rows of the `DataFrame`, we use the `tail()` method:
```python

In [11]:
birds_df.tail()

Unnamed: 0,Name,ScientificName,Category,Order,Family,Genus,ConservationStatus,MinLength,MaxLength,MinBodyMass,MaxBodyMass,MinWingspan,MaxWingspan
438,Blue grosbeak,Passerina caerulea,Cardinals/Allies,Passeriformes,Cardinalidae,Passerina,LC,14.0,19.0,26.0,31.5,26.0,29.0
439,Lazuli bunting,Passerina amoena,Cardinals/Allies,Passeriformes,Cardinalidae,Passerina,LC,13.0,15.0,13.0,18.0,22.0,22.0
440,Indigo bunting,Passerina cyanea,Cardinals/Allies,Passeriformes,Cardinalidae,Passerina,LC,11.5,15.0,11.2,21.4,18.0,23.0
441,Painted bunting,Passerina ciris,Cardinals/Allies,Passeriformes,Cardinalidae,Passerina,LC,12.0,14.0,13.0,19.0,21.0,23.0
442,Dickcissel,Spiza americana,Cardinals/Allies,Passeriformes,Cardinalidae,Spiza,LC,14.0,16.0,25.6,38.4,24.8,26.0


## Dealing with Missing Data
- **Detecting null values**: In `pandas`, the `isnull()` and `notnull()` methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using `numpy` for `NaN` values:

In [12]:
null_values = birds_df.isnull().sum()

print(null_values)

Name                  0
ScientificName        0
Category              0
Order                 0
Family                0
Genus                 0
ConservationStatus    0
MinLength             0
MaxLength             0
MinBodyMass           0
MaxBodyMass           0
MinWingspan           0
MaxWingspan           0
dtype: int64


- **Dropping null values**: Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. 

In [14]:
birds_df_cleaned = birds_df.dropna()

print(birds_df_cleaned)

                             Name          ScientificName  \
0    Black-bellied whistling-duck  Dendrocygna autumnalis   
1          Fulvous whistling-duck     Dendrocygna bicolor   
2                      Snow goose      Anser caerulescens   
3                    Ross's goose            Anser rossii   
4     Greater white-fronted goose         Anser albifrons   
..                            ...                     ...   
438                 Blue grosbeak      Passerina caerulea   
439                Lazuli bunting        Passerina amoena   
440                Indigo bunting        Passerina cyanea   
441               Painted bunting         Passerina ciris   
442                    Dickcissel         Spiza americana   

                  Category          Order        Family        Genus  \
0    Ducks/Geese/Waterfowl   Anseriformes      Anatidae  Dendrocygna   
1    Ducks/Geese/Waterfowl   Anseriformes      Anatidae  Dendrocygna   
2    Ducks/Geese/Waterfowl   Anseriformes      Anat

- **Filling null values**:  pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing.

In [16]:
birds_df_filled = birds_df.fillna(0)

print(birds_df_filled)

                             Name          ScientificName  \
0    Black-bellied whistling-duck  Dendrocygna autumnalis   
1          Fulvous whistling-duck     Dendrocygna bicolor   
2                      Snow goose      Anser caerulescens   
3                    Ross's goose            Anser rossii   
4     Greater white-fronted goose         Anser albifrons   
..                            ...                     ...   
438                 Blue grosbeak      Passerina caerulea   
439                Lazuli bunting        Passerina amoena   
440                Indigo bunting        Passerina cyanea   
441               Painted bunting         Passerina ciris   
442                    Dickcissel         Spiza americana   

                  Category          Order        Family        Genus  \
0    Ducks/Geese/Waterfowl   Anseriformes      Anatidae  Dendrocygna   
1    Ducks/Geese/Waterfowl   Anseriformes      Anatidae  Dendrocygna   
2    Ducks/Geese/Waterfowl   Anseriformes      Anat

## Removing duplicate data
- **Identifying duplicates: `duplicated`**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an earlier one.

In [18]:
duplicate_rows = birds_df[birds_df.duplicated()]

print(duplicate_rows)

Empty DataFrame
Columns: [Name, ScientificName, Category, Order, Family, Genus, ConservationStatus, MinLength, MaxLength, MinBodyMass, MaxBodyMass, MinWingspan, MaxWingspan]
Index: []


- **Dropping duplicates: `drop_duplicates`:** simply returns a copy of the data for which all of the `duplicated` values are `False`

In [19]:
birds_df_cleaned = birds_df.drop_duplicates()

print(birds_df_cleaned)

                             Name          ScientificName  \
0    Black-bellied whistling-duck  Dendrocygna autumnalis   
1          Fulvous whistling-duck     Dendrocygna bicolor   
2                      Snow goose      Anser caerulescens   
3                    Ross's goose            Anser rossii   
4     Greater white-fronted goose         Anser albifrons   
..                            ...                     ...   
438                 Blue grosbeak      Passerina caerulea   
439                Lazuli bunting        Passerina amoena   
440                Indigo bunting        Passerina cyanea   
441               Painted bunting         Passerina ciris   
442                    Dickcissel         Spiza americana   

                  Category          Order        Family        Genus  \
0    Ducks/Geese/Waterfowl   Anseriformes      Anatidae  Dendrocygna   
1    Ducks/Geese/Waterfowl   Anseriformes      Anatidae  Dendrocygna   
2    Ducks/Geese/Waterfowl   Anseriformes      Anat