# Penguin Dataset Analysis

This notebook reads penguin data from an S3 parquet file and performs exploratory data analysis.

In [2]:
import boto3
import pandas as pd
import matplotlib.pyplot as plt
from io import BytesIO

In [3]:
def read_s3_parquet():
    # Create a session using the ros-sandbox profile
    session = boto3.Session(profile_name='ros-sandbox')
    
    # Create an S3 client using the session
    s3_client = session.client('s3')
    
    # S3 bucket and key information
    bucket = '985803916100-proposal-processed'
    key = 'penguins.parquet'
    
    try:
        # Get the parquet file from S3
        response = s3_client.get_object(Bucket=bucket, Key=key)
        
        # Create a BytesIO object from the response body
        parquet_buffer = BytesIO(response['Body'].read())
        
        # Read the parquet file into a pandas DataFrame using the buffer
        df = pd.read_parquet(parquet_buffer)
        return df
        
    except Exception as e:
        print(f"Error reading parquet file: {str(e)}")
        raise

In [4]:
# Read the data
df = read_s3_parquet()

# Display the first few rows
print("First few rows of the penguins dataset:")
df.head()

First few rows of the penguins dataset:


Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


## Basic Data Exploration

In [5]:
# Display basic information about the dataset
print("Dataset Info:")
df.info()

print("\nBasic Statistics:")
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB

Basic Statistics:


Unnamed: 0,rowid,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,344.0,342.0,342.0,342.0,342.0,344.0
mean,172.5,43.92193,17.15117,200.915205,4201.754386,2008.02907
std,99.448479,5.459584,1.974793,14.061714,801.954536,0.818356
min,1.0,32.1,13.1,172.0,2700.0,2007.0
25%,86.75,39.225,15.6,190.0,3550.0,2007.0
50%,172.5,44.45,17.3,197.0,4050.0,2008.0
75%,258.25,48.5,18.7,213.0,4750.0,2009.0
max,344.0,59.6,21.5,231.0,6300.0,2009.0
