# CDCS Summer School
# A Gentle Introduction to Coding for Data Analysis
## Session 11: Handle with Caution

---------------

### Learning objectives for this session:


At the end of this notebook you will know:

1. What the pandas library is.
2. Basic understanding of the Palmer Penguins dataset.
3. How to load data using pandas.
4. What a dataframe structure is.

--------

## 1. Pandas? I thought we were on about Penguins?

You are absolutely correct if you had the question above. For some reason (unknown to me but there probably is something online about it), a lot of Python related packages and features have animal related names. Pandas is an open-source data analysis and manipulation tool built on top of the Python programming language. It is widely used in data science, machine learning, and artificial intelligence due to its robust data structures and powerful data manipulation capabilities.

Key features of pandas include:

- **DataFrames**: A 2-dimensional labeled data structure with columns of potentially different types.
- **Series**: A 1-dimensional labeled array capable of holding any data type.
- **Indexing**: Easy and flexible selection and manipulation of data.
- **Data alignment**: Automatic and explicit data alignment for managing missing data.
- **Group by**: Splitting data into groups based on some criteria.
- **Data wrangling**: Combining, merging, and reshaping data sets.
- **I/O tools**: Tools for loading data from various file formats and databases.

Like other packages we have seen we have to import the package in each notebook when we want to use it. The convention with `pandas` is to import it with the 'nickname' `pd`.

In [2]:
# Import the package.
import pandas as pd

Below we start to explore the insides of the pandas package. It's not 100% necessary to understand every single one of the functions within the package, but in your pair try to get a general understanding of the different functions that are now at your disposal.

In [3]:
# Here we make a function which looks inside the package 'pd' ie. pandas, and then 
# prints all the different things that pandas can do.
def PandasFunctionInside():
    for option in dir(pd):
        print(option)

# Call the function to look.
PandasFunctionInside()

ArrowDtype
BooleanDtype
Categorical
CategoricalDtype
CategoricalIndex
DataFrame
DateOffset
DatetimeIndex
DatetimeTZDtype
ExcelFile
ExcelWriter
Flags
Float32Dtype
Float64Dtype
Grouper
HDFStore
Index
IndexSlice
Int16Dtype
Int32Dtype
Int64Dtype
Int8Dtype
Interval
IntervalDtype
IntervalIndex
MultiIndex
NA
NaT
NamedAgg
Period
PeriodDtype
PeriodIndex
RangeIndex
Series
SparseDtype
StringDtype
Timedelta
TimedeltaIndex
Timestamp
UInt16Dtype
UInt32Dtype
UInt64Dtype
UInt8Dtype
__all__
__builtins__
__cached__
__doc__
__docformat__
__file__
__git_version__
__loader__
__name__
__package__
__path__
__spec__
__version__
_built_with_meson
_config
_is_numpy_dev
_libs
_pandas_datetime_CAPI
_pandas_parser_CAPI
_testing
_typing
_version_meson
annotations
api
array
arrays
bdate_range
compat
concat
core
crosstab
cut
date_range
describe_option
errors
eval
factorize
from_dummies
get_dummies
get_option
infer_freq
interval_range
io
isna
isnull
json_normalize
lreshape
melt
merge
merge_asof
merge_ordered
notna
not

-----

## 2. The Palmer Penguins Dataset

As has been alluded to throughout the week so far, we will be using the Palmer Penguins dataset today and tomorrow to look at using data in practice with everything we have done so far. 

Of course we are going to learn how to do a lot of things with the data, but a key step when using programming and data is to, before even loading in the data to your Python console, is to have a good understanding of your data already. Therefore we are going to take some time to understand what the Palmer Penguins data is.

The Palmer Penguins dataset was collected as part of research conducted by Dr. Kristen Gorman and colleagues at the Palmer Station, a United States research station located on Anvers Island in Antarctica. The data collection was part of the Long Term Ecological Research (LTER) program, which aims to study the long-term effects of climate change and ecological processes in the region.The primary goal of the research was to understand the foraging ecology and reproductive success of the three penguin species. The measurements were taken during the breeding seasons to monitor changes in the physical characteristics and population dynamics of the penguins.

Key publications related to the Palmer Penguins dataset include:

Gorman, K. B., Williams, T. D., &amp; Fraser, W. R. (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLOS ONE, 9(3), e90081.
- This paper explores the ecological differences between male and female penguins and how environmental variability affects these differences.

Gorman, K. B., Williams, T. D., &amp; Fraser, W. R. (2010). Post-fledging survival of Adélie penguins at Palmer Station, Antarctica. Marine Ecology Progress Series, 405, 273-285.
- This paper focuses on the survival rates of Adélie penguins after they leave the nest and how various factors influence their chances of survival.

### Detailed Description of Each Feature

1. Species

The dataset includes three species of penguins: Adelie, Chinstrap, and Gentoo. Each species has distinct physical characteristics and behaviors.


2. Island

Penguins in the dataset were observed on three different islands: Biscoe, Dream, and Torgersen. These islands are part of the Palmer Archipelago in Antarctica.


3. Bill Length (mm)

This feature measures the length of a penguin's bill in millimeters. Bill length can vary significantly between species and is an important characteristic for identifying species.


4. Bill Depth (mm)

This feature measures the depth of a penguin's bill in millimeters. Bill depth, like bill length, is an important characteristic for species identification.


5. Flipper Length (mm)

This feature measures the length of a penguin's flipper in millimeters. Flipper length is related to a penguin's swimming ability and varies between species.


6. Body Mass (g)

This feature measures the body mass of a penguin in grams. Body mass can provide insights into the health and nutrition of the penguins.


7. Sex

This feature indicates the sex of the penguin, which can be either male or female. Understanding the sex distribution in the dataset can help in studying gender-related differences.


8. Year

This feature indicates the year when the observation was made. Analyzing data over multiple years can help identify trends and changes in penguin populations over time.

-----

## 3. Loading in the data with and without pandas.

As you will find with most programming languages, there are a multitude of ways in which you can load in data. Luckily pandas makes this quite streamlined. We will next explore the different ways that this can be done.

Loading data efficiently is a critical step in any data analysis process. While Python's built-in functions can handle basic data loading tasks, the pandas library offers a more powerful and flexible approach. In this section, we will explore different ways to load data using both built-in functions and pandas, highlighting the advantages of using pandas.

Python's built-in functions such as open, read, and csv.reader can be used to load data from files. Let's see how we can load a CSV file using these functions.

In [None]:
# Using built-in functions to load a CSV file
import csv

file_path = 'data/palmer_penguins.csv'

# Method 1: Using csv.reader
with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)
    data = [row for row in csv_reader]

# Display the first few rows of the dataset
for row in data[:5]:
    print(row)

Pandas provides a much simpler and more efficient way to load data from various sources. Here, we will demonstrate how to load a CSV file using the pd.read_csv function.

In [None]:
# Using pandas to load a CSV file
import pandas as pd

file_path = 'data/palmer_penguins.csv'

# Method 2: Using pd.read_csv
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

### Comparison of Methods

Let's compare the two methods in terms of simplicity, readability, and functionality.

- Simplicity: The pandas method is more concise and requires fewer lines of code.
- Readability: The pandas method is easier to read and understand, especially for large datasets.
- Functionality: Pandas provides additional features such as handling missing values, parsing dates, and setting column names.

### Handling Different Data Formats

Pandas can handle various data formats such as CSV, Excel, JSON, and SQL databases. Let's look at how to load data from these different formats using pandas.

In [None]:
# Loading an Excel file using pandas
file_path = 'data/palmer_penguins.xlsx'

# Method 3: Using pd.read_excel
data = pd.read_excel(file_path)

# Display the first few rows of the dataset
data.head()

In [None]:
# Loading a JSON file using pandas
file_path = 'data/palmer_penguins.json'

# Method 4: Using pd.read_json
data = pd.read_json(file_path, lines=True)

# Display the first few rows of the dataset
data.head()

In [None]:
# Loading data from a SQL database using pandas
import sqlite3

db_path = 'data/palmer_penguins.db'
query = 'SELECT * FROM penguins'

# Method 5: Using pd.read_sql_query
conn = sqlite3.connect(db_path)
data = pd.read_sql_query(query, conn)

# Display the first few rows of the dataset
data.head()

### Error Handling in Data Loading

Pandas provides robust error handling mechanisms to deal with issues during data loading, such as missing files or incorrect formats. It may be that you need to combine your existing knowledge of loops to do this. An additional type of loop we covered in session 10, is a try/except loops to help deal with errors. These can sometimes be a little fiddly --  definetly look more at this after the summer school.

In [None]:
# Error handling during data loading with pandas
file_path = 'path/to/your/nonexistent_file.csv'

# Method 6: Using try-except block
try:
    data = pd.read_csv(file_path)
    print(data.head())
except FileNotFoundError:
    print(f"The file at {file_path} was not found.")
except pd.errors.ParserError:
    print(f"There was a parsing error while reading the file at {file_path}.")

-----

## 4. What is a dataframe structure?

A pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (ie it has rows and columns). It's similar to a spreadsheet or SQL table and is one of the most commonly used data structures for data manipulation in pandas.

In [None]:
file_path = 'data/palmer_penguins.csv'
penguins = pd.read_csv(file_path)
penguins.head()

Each column in a DataFrame has a specific data type. You can view the data types of all columns using the dtypes attribute.

In [None]:
# Viewing data types of each column
penguins.dtypes

You can use various methods to explore the structure of a DataFrame, such as shape, columns, and index.

In [None]:
# Viewing the shape of the DataFrame (number of rows and columns)
penguins.shape

In [None]:
# Viewing the column names
penguins.columns

In [None]:
# Viewing the index (row labels)
penguins.index

The info method provides a concise summary of the DataFrame, including the number of non-null entries and data types of each column. We will be looking much more in the next session about how to get more detailed summaries and overviews.

In [None]:
# Viewing concise summary of the DataFrame
penguins.info()

### Creating a DataFrame from Scratch

Let's create a small DataFrame from scratch with some penguin data to illustrate how to combine data structures.

In [None]:
# Creating a small DataFrame from scratch
data = {
    'species': ['Adelie', 'Chinstrap', 'Gentoo'],
    'island': ['Torgersen', 'Dream', 'Biscoe'],
    'bill_length_mm': [39.1, 48.7, 50.0],
    'bill_depth_mm': [18.7, 17.4, 15.3],
    'flipper_length_mm': [181, 195, 210],
    'body_mass_g': [3750, 3800, 5000],
    'sex': ['male', 'female', 'male'],
    'year': [2007, 2008, 2009]
}

new_penguins = pd.DataFrame(data)
new_penguins

In [None]:
# Concatenating DataFrames
combined_penguins = pd.concat([penguins, new_penguins], ignore_index=True)
combined_penguins

In [None]:
# Viewing data types of the combined DataFrame
combined_penguins.dtypes

In [None]:
# Changing the data type of the 'year' column to string
combined_penguins['year'] = combined_penguins['year'].astype(str)
combined_penguins.dtypes

### Accessing Specific Columns in a DataFrame
Understanding how to access and manipulate specific columns in a DataFrame is crucial for data analysis. Here, we will cover various methods to select and work with columns in a DataFrame.

You can select a single column by using the column name as a key. This will return a Series.

In [None]:
# Selecting a single column
species = penguins['species']
species.head()

To select multiple columns, you can pass a list of column names. This will return a DataFrame with the selected columns.

In [None]:
# Selecting multiple columns
subset = penguins[['species', 'island', 'bill_length_mm']]
subset.head()

------

## ⭐️⭐️⭐️💥 What you learned in this session: Three stars and a wish.
**In your own words** write in the Markdown cell below:

- 3 things you would like to remember from this notebook.
- 1 thing you wish to understand better in the future or a question you'd like to ask.

*Add your reflections here.*

--------------

## Topic Overview

In [None]:
import pandas as pd

# Load the Palmer Penguins dataset
file_path = 'data/palmer_penguins.csv'
penguins = pd.read_csv(file_path)
penguins.head()

In [None]:
# Viewing data types of each column
penguins.dtypes

In [None]:
# Creating a small DataFrame from scratch
new_data = {
    'species': ['Adelie', 'Chinstrap', 'Gentoo'],
    'island': ['Torgersen', 'Dream', 'Biscoe'],
    'bill_length_mm': [39.1, 48.7, 50.0],
    'bill_depth_mm': [18.7, 17.4, 15.3],
    'flipper_length_mm': [181, 195, 210],
    'body_mass_g': [3750, 3800, 5000],
    'sex': ['male', 'female', 'male'],
    'year': [2007, 2008, 2009]
}

new_penguins = pd.DataFrame(new_data)
new_penguins

In [None]:
# Concatenating the new DataFrame with a subset of the Palmer Penguins dataset
subset_penguins = penguins.head(3)  # Taking a subset of the first 3 rows
combined_penguins = pd.concat([subset_penguins, new_penguins], ignore_index=True)
combined_penguins

In [None]:
# Selecting a single column
species = penguins['species']
species.head()

-----------

# ⛏ Exercise: Print the penguins.

Write a function print_records to print each record from the Palmer Penguins dataset row by row. Use a loop to iterate through each row and print the records in a formatted string.

Instructions:

1. Load the Palmer Penguins dataset.
2. Define a function print_records that takes a DataFrame as an argument.
3. Inside the function, use a loop to iterate through each row of the DataFrame.
4. Print each record in a formatted string.

In [None]:
# try to solve the task here

# ⛏ Exercise: Formatting the Year Using the datetime Package

Write a function format_year to format the year column using the datetime package. Convert the year to include January 1st of each year and add it as a new column formatted_year.

Instructions:

1. Load the Palmer Penguins dataset.
2. Convert the year column to a string if it is not already.
3. Define a function format_year that takes a DataFrame as an argument.
4. Inside the function, use the datetime package to format the year column to include a specific date (e.g., January 1st of each year).
5. Add a new column formatted_year with the formatted date.

In [None]:
# try to solve the task here