<a href="https://colab.research.google.com/github/Skidmark156/username-DataScience-2025/blob/main/completed/_03_loading_and_exploring_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìÇ Notebook 03: Loading and Exploring Data

Welcome to the real world ‚Äî where data lives in messy CSVs and your job is to make sense of it.

In this notebook, you‚Äôll:
- Load a CSV file using `pd.read_csv()`
- Use `.head()`, `.tail()`, `.info()`, and `.describe()` to explore your data
- Identify potential issues (missing values, bad types, oddball rows)

Let‚Äôs get our hands dirty.
---

In [None]:
import pandas as pd

## üì• Load a CSV

Replace the filename below with a real CSV path or URL. For testing, use built-in seaborn datasets.

In [None]:
# Example with seaborn's Titanic dataset
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## üìä Peek at the Data

In [None]:
# First and last few rows
print("First 5 rows:")
print(df.head())

print("\nLast 5 rows:")
print(df.tail())

First 5 rows:
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g  gender  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  

Last 5 rows:
    species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
339  Gentoo  Biscoe             NaN            NaN                NaN   
340  Gentoo  Biscoe            46.8           14.3              215.0   
341  Gentoo  Biscoe            50.4           15.7              222.0   
342  Gentoo  Biscoe            45.2           14.8              212.0   
343  Gentoo 

## üß† Understand Structure

In [None]:
# Dimensions and columns
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

Shape: (344, 7)

Columns: ['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'gender']


## üßº Data Types & Missing Values

In [None]:
# Data types and null counts
print("\nInfo:")
df.info()

# How many missing values per column?
print("\nMissing values per column:")
print(df.isnull().sum())


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   gender             333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB

Missing values per column:
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
gender               11
dtype: int64


## üìà Summary Statistics

In [None]:
df.describe(include="all")

## ‚úèÔ∏è Renaming Columns (Optional but fun)

In [None]:
# Rename 'sex' to 'gender' for clarity
df.rename(columns={"sex": "gender"}, inplace=True)
df.head(2)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,gender
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female


---
## üîç Your Turn

1. Load a dataset of your choice (`pd.read_csv()` or `sns.load_dataset()`)
2. Print the first and last 5 rows.
3. Show `.info()` and `.describe()` results.
4. Print the column names. Rename one of them.

üéØ **Bonus:** What percentage of rows have *any* missing values?

```python
# HINT:
df.isnull().any(axis=1).mean() * 100  # percent of rows with any NaNs
```


In [None]:
# Step 1: Import libraries
import pandas as pd
import seaborn as sns

# Step 2: Load a dataset (using Seaborn's "flights" dataset)
df = sns.load_dataset("flights")

# Step 3: Peek at the data
print("First 5 rows of the flights dataset:")
print(df.head())

print("\nLast 5 rows of the flights dataset:")
print(df.tail())

# Step 4: Basic structure info
print("\nShape of DataFrame:", df.shape)
print("\nColumn names:", df.columns.tolist())

print("\n--- DataFrame Info ---")
df.info()

# Step 5: Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Step 6: Summary statistics
print("\n--- Summary Statistics ---")
print(df.describe(include="all"))

# Step 7: Rename a column
df.rename(columns={"passengers": "num_passengers"}, inplace=True)
print("\nColumns after renaming:")
print(df.columns.tolist())
print(df.head(2))

# Bonus: What % of rows have any missing values?
missing_percent = df.isnull().any(axis=1).mean() * 100
print(f"\nPercentage of rows with missing values: {missing_percent:.2f}%")

# Step 8: Optional exploration ‚Äì find busiest month across all years
busiest = df.groupby("month")["num_passengers"].mean().sort_values(ascending=False)
print("\nAverage passengers by month (descending):")
print(busiest)



First 5 rows of the flights dataset:
   year month  passengers
0  1949   Jan         112
1  1949   Feb         118
2  1949   Mar         132
3  1949   Apr         129
4  1949   May         121

Last 5 rows of the flights dataset:
     year month  passengers
139  1960   Aug         606
140  1960   Sep         508
141  1960   Oct         461
142  1960   Nov         390
143  1960   Dec         432

Shape of DataFrame: (144, 3)

Column names: ['year', 'month', 'passengers']

--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   year        144 non-null    int64   
 1   month       144 non-null    category
 2   passengers  144 non-null    int64   
dtypes: category(1), int64(2)
memory usage: 2.9 KB

Missing values per column:
year          0
month         0
passengers    0
dtype: int64

--- Summary Statistics ---
               year

  busiest = df.groupby("month")["num_passengers"].mean().sort_values(ascending=False)


---
## üéì Why This Matters

Every data science project begins here ‚Äî importing and inspecting data. If your data is a mess (spoiler: it always is), you need to know how to check it, clean it, and prep it.

Next up: slicing and dicing ‚Äî using `.loc[]`, `.iloc[]`, and boolean masks to select what you want and ignore what you don't.