# Workshop Week 5: Introduction to Pandas

This week, we are starting to explore **pandas**, the most popular Python library for data analysis. Pandas provides powerful, easy-to-use data structures and data analysis tools that are essential for any data analyst.

## Why Pandas?

Pandas is built on top of another library called **NumPy**. We will look at NumPy in more detail later in the course, but for now, all you need to know is that NumPy provides the high-performance arrays that pandas uses to store data. This makes pandas incredibly fast and efficient, even with large datasets.

## The DataFrame: Your Data's Home

The most important data structure in pandas is the **DataFrame**. A DataFrame is a two-dimensional table of data with rows and columns, much like a spreadsheet. Here are the key organisational principles:

- **Rows represent observations:** Each row is a single data record. For example, in a dataset of students, each row would represent one student.
- **Columns represent variables/features:** Each column is a specific piece of information about the observations. For our student dataset, columns might include `name`, `age`, and `grade`.

### Data Types in a DataFrame

A critical rule in pandas is that **each column must have a single data type**. For example, a column for `age` will only contain numbers, and a column for `name` will only contain text. This is important for two reasons:

1. **Data Consistency:** It ensures that all the data in a column is of the same type, which prevents errors in your analysis.
2. **Performance:** It allows pandas to use efficient, specialised operations for each data type.

## Loading a Real-World Dataset

Let's get our hands dirty with a real dataset. We'll be using the **Breast Cancer Wisconsin (Diagnostic) dataset**, which contains information about breast cancer tumors. This is a classic dataset for data analysis and machine learning.

In [None]:
import pandas as pd
import numpy as np

# Load the artificial dataset we created
df = pd.read_csv("employee_data.csv")

# Display the entire DataFrame to see our data
df

### Attributes vs Methods in pandas

In pandas, **DataFrames** have *attributes* and *methods*.  
Both help you understand or work with your data, but they behave differently:

- **Attributes** describe something *about* the DataFrame — for example, its size or column names.  
  You access them **without parentheses** because they simply *store information*.

  `df.shape`     → (number of rows, number of columns)  
  `df.columns`   → column names  
  `df.dtypes`    → data type of each column  

- **Methods** perform an *action* or *operation* on the DataFrame.  
  You call them **with parentheses**, sometimes including extra options.

  `df.head()`       → shows the first 5 rows  
  `df.describe()`   → summary statistics  

Lets start by exploring the DataFrame.

In [None]:
# Get the dimensions of the DataFrame (rows, columns)
df.shape

In [None]:
# Get the names of the columns
df.columns

In [None]:
# Get a concise summary of the DataFrame
df.info()

## Summary Statistics

Pandas provides powerful tools for quickly summarising your data. The `describe()` method is a great way to get an overview of the numerical columns in your DataFrame.

In [None]:
# Get summary statistics for numerical columns
df.describe()

### Interpreting the Summary Statistics

The `describe()` output gives us a great first look at our numerical data. Let's break down what each row means:

- **count:** The number of non-missing (non-NaN) values. Notice that `age` and `salary` have fewer than 20, which tells us there are missing values.
- **mean:** The average value. For `age`, the mean is very high because of the outlier (999). This shows how the mean is sensitive to extreme values.
- **std (Standard Deviation):** A measure of how spread out the data is. A high standard deviation (like in `salary`) means the values are widely distributed.
- **min:** The smallest value in the column.
- **25% (1st Quartile):** 25% of the data points are smaller than this value.
- **50% (Median):** The middle value of the dataset. 50% of the data is below this value, and 50% is above. The median is often a better measure of central tendency than the mean when there are outliers.
- **75% (3rd Quartile):** 75% of the data points are smaller than this value.
- **max:** The largest value in the column. The `max` for `age` (999) is clearly a data quality issue.

## Handling Missing Data

Real-world data is often messy and contains missing values. Pandas represents missing values as `NaN` (Not a Number). It's crucial to identify and handle these.

### Checking for Missing Values

We can use the `.isnull()` method, which returns a DataFrame of boolean values, followed by `.sum()` to count the number of missing values in each column.

In [None]:
# Count missing values in each column
df.isnull().sum()

As we saw in the `describe()` output, `age` and `salary` each have one missing value. `performance_score` also has a missing value.

### How Pandas Handles NaNs in Calculations

By default, pandas **ignores** `NaN` values when performing calculations like `mean()`, `sum()`, etc. This is why the `count` in `describe()` was lower for columns with missing data—pandas simply doesn't include them in the calculation, which is usually the desired behavior.

## Exploring Columns

Let's dive deeper into individual columns to understand their characteristics.

### Value Counts

The `.value_counts()` method is great for categorical data. It shows how many times each unique value appears in a column.

In [None]:
# Get the value counts for the 'department' column
df['department'].value_counts()

### Unique Values

You can get the number of unique values with `.nunique()` or a list of the unique values with `.unique()`.

In [None]:
unique_depts = df['department'].nunique()

print(f"Number of unique departments: {unique_depts}")

### Data Type Inference and Casting

Pandas is smart and tries to infer the data type of each column when you load data. However, sometimes it gets it wrong, especially if there are mixed types. For example, our `performance_score` column contains numbers and the string 'N/A', so pandas has classified the whole column as an `object` (which usually means string).

We can fix this by first replacing 'N/A' with a proper missing value (`np.nan`) and then casting the column to a numeric type using `.astype()`.

In [None]:
# First, check the data types
print("Original data types:")
print(df.dtypes)

# Create dictionary to of values to replace in performance score column
score_cleaning = {'TBC': np.nan,
                 'Full score': 5}

# Replace 'TBC' with NaN and 'Full Score' with 5
df['performance_score'] = df['performance_score'].replace(score_cleaning)

# Cast column to float data type
df['performance_score'] = df['performance_score'].astype(float)

print("New data types:")
print(df.dtypes)

## Subsetting and Indexing

Often, you only want to look at a specific part of your DataFrame. Pandas provides powerful tools for this.

### Selecting Columns

You can select a single column using square brackets `[]`, which returns a pandas Series.

In [None]:
# Select the 'name' column
df['name']

### Subsetting with `.loc` (Label-Based)

`.loc` is used for selecting data by row and column **labels**. The syntax is `df.loc[row_labels, column_labels]`.

In [None]:
# Select row with index 3 and the 'name' and 'salary' columns
df.loc[3, ['name', 'salary']]

### Subsetting with `.iloc` (Position-Based)

`.iloc` is used for selecting data by row and column **integer positions**. The syntax is `df.iloc[row_positions, column_positions]`.

In [None]:
# Select the first 3 rows and the first 2 columns
df.iloc[0:3, 0:2]

## Modifying the DataFrame

Let's see how we can add, remove, and change data in our DataFrame.

### Dropping a Column

You can remove a column using the `.drop()` method. You need to specify `axis=1` to indicate you're dropping a column.

In [None]:
# Drop the 'employee_id' column
df_dropped = df.drop('employee_id', axis=1)
df_dropped.head()

### Adding a Column

You can create a new column by simply assigning it a value. Let's create a 'age_when_joined' column.

In [None]:
# Add a new calculated column
df['age_when_joined'] = df['age'] - df['years_at_company']
df.head()

### Replacing Values

The `.replace()` method is useful for updating specific values. Let's say we want to give our departments more descriptive names.

In [None]:
# Replace department names
df['department'] = df['department'].replace({'HR': 'Human Resources', 'Engineering': 'Tech'})
df.head()

## Filtering for Data Quality

Filtering is one of the most common tasks in data analysis. You can create a boolean condition and use it to select rows from your DataFrame.

This is also a great way to identify rows with data quality issues. For example, we know that an age of 999 is impossible. Let's find that row.

In [None]:
# Filter for rows where age is greater than 100
df[df['age'] > 100]

We can now see the problematic row and decide how to handle it (e.g., correct it if we know the right age, or remove it).

## Important: DataFrames are In-Memory Only

It's crucial to understand that when you load data into a pandas DataFrame, it exists only in your computer's memory (RAM). This means:

- **Changes are temporary:** Any modifications you make (cleaning data, adding columns, fixing typos) only exist while your program is running.
- **Original file unchanged:** The original CSV file remains exactly as it was when you loaded it.
- **Data lost when program ends:** If you close your notebook or restart your Python session, all your changes disappear.

### When to Save Your Work

You should save your DataFrame to a new file when:
- You've cleaned the data and want to preserve those improvements
- You've added new calculated columns that took time to create
- You want to share the cleaned dataset with others
- You're finished with your analysis and want to keep the final version

Let's see how to save our cleaned employee dataset.

In [None]:
# Save the cleaned DataFrame to a new CSV file
df.to_csv('employee_data_cleaned.csv', index=False)

# You can also save just specific columns if needed
df[['name', 'department', 'salary']].to_csv('employee_summary.csv', index=False)

# Note: index=False prevents pandas from saving the row numbers as a separate column

## Final Thoughts

You've covered a lot of ground! This notebook provides a solid foundation for the most common pandas operations. The topics you requested are comprehensive for an introduction. The next logical steps, which you could introduce in a future workshop, would be:

- **Grouping and Aggregation:** Using `.groupby()` to perform calculations on specific groups within the data (e.g., finding the average salary per department).
- **Merging and Joining:** Combining multiple DataFrames, similar to SQL joins.
- **More advanced plotting:** Using libraries like Matplotlib or Seaborn to create more complex visualisations from the data.