# A Comprehensive Course on Pandas DataFrames

Welcome to the world of Pandas! The **DataFrame** is the single most important data structure in the Pandas library. It's a powerful, two-dimensional labeled data structure, similar to a spreadsheet, a SQL table, or a dictionary of Series objects.

Think of it as a **supercharged spreadsheet** right inside your Python code. It's the primary tool for data cleaning, transformation, analysis, and manipulation.

### Why use a DataFrame?
*   **Tabular Data**: It's designed for handling structured, table-like data.
*   **Powerful Indexing**: Easily select, slice, and filter data.
*   **Data Alignment**: Handles missing data gracefully and aligns data from different sources.
*   **Performance**: Optimized for speed, leveraging NumPy under the hood.
*   **Rich Functionality**: Built-in tools for grouping, aggregation, merging, and reshaping data.

### Table of Contents
1. [Setup & Creating a DataFrame](#creating)
2. [Inspecting Your Data](#inspecting)
3. [Selection & Indexing: Getting Data Out](#selecting)
4. [Conditional Filtering: Asking Questions](#filtering)
5. [Modifying the DataFrame](#modifying)
6. [Handling Missing Data](#missing-data)
7. [Grouping & Aggregation: The Power of `groupby`](#grouping)
8. [Saving a DataFrame to a File](#saving)

<a id='creating'></a>
## 1. Setup & Creating a DataFrame

First, we need to import the pandas library. The standard convention is to import it as `pd`.

You can create a DataFrame from various sources, but the most common are dictionaries, lists of lists, and, most importantly, reading from a file like a CSV.

In [1]:
!pip install pandas -q

In [2]:
import pandas as pd
import numpy as np # Often used with pandas

# Method 1: From a dictionary of lists
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df_from_dict = pd.DataFrame(data)
print("--- DataFrame from Dictionary ---")
print(df_from_dict)


# Method 2: From a file (most common!)
# First, let's create a sample employees.csv file to work with.
csv_content = """
ID,Name,Department,Salary,HireDate,City
101,Alice,Engineering,90000,2020-03-15,New York
102,Bob,Marketing,65000,2019-07-20,Los Angeles
103,Charlie,Engineering,110000,2018-01-10,New York
104,David,HR,55000,2021-05-30,Chicago
105,Eve,Marketing,70000,2021-11-01,Los Angeles
106,Frank,Engineering,,2022-02-12,Chicago
"""
with open('employees.csv', 'w') as f:
    f.write(csv_content.strip())

# Now, read the CSV into a DataFrame
df = pd.read_csv('employees.csv')
print("\n--- DataFrame from CSV File ---")
print(df)

--- DataFrame from Dictionary ---
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

--- DataFrame from CSV File ---
    ID     Name   Department    Salary    HireDate         City
0  101    Alice  Engineering   90000.0  2020-03-15     New York
1  102      Bob    Marketing   65000.0  2019-07-20  Los Angeles
2  103  Charlie  Engineering  110000.0  2018-01-10     New York
3  104    David           HR   55000.0  2021-05-30      Chicago
4  105      Eve    Marketing   70000.0  2021-11-01  Los Angeles
5  106    Frank  Engineering       NaN  2022-02-12      Chicago


<a id='inspecting'></a>
## 2. Inspecting Your Data

Once you've loaded your data, the first step is always to inspect it to understand its structure, types, and content.

In [3]:
# See the first 5 rows
print("--- .head() ---")
display(df.head())

# See the last 3 rows
print("\n--- .tail(3) ---")
display(df.tail(3))

# Get a concise summary of the DataFrame (VERY useful!)
# Shows index type, columns, non-null values, and memory usage.
print("\n--- .info() ---")
df.info()

# Get descriptive statistics for numerical columns
print("\n--- .describe() ---")
display(df.describe())

# Get the dimensions of the DataFrame (rows, columns)
print(f"\nShape of the DataFrame: {df.shape}")

# Get the column names
print(f"Columns: {df.columns.tolist()}")

--- .head() ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City
0,101,Alice,Engineering,90000.0,2020-03-15,New York
1,102,Bob,Marketing,65000.0,2019-07-20,Los Angeles
2,103,Charlie,Engineering,110000.0,2018-01-10,New York
3,104,David,HR,55000.0,2021-05-30,Chicago
4,105,Eve,Marketing,70000.0,2021-11-01,Los Angeles



--- .tail(3) ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City
3,104,David,HR,55000.0,2021-05-30,Chicago
4,105,Eve,Marketing,70000.0,2021-11-01,Los Angeles
5,106,Frank,Engineering,,2022-02-12,Chicago



--- .info() ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          6 non-null      int64  
 1   Name        6 non-null      object 
 2   Department  6 non-null      object 
 3   Salary      5 non-null      float64
 4   HireDate    6 non-null      object 
 5   City        6 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 420.0+ bytes

--- .describe() ---


Unnamed: 0,ID,Salary
count,6.0,5.0
mean,103.5,78000.0
std,1.870829,21965.882636
min,101.0,55000.0
25%,102.25,65000.0
50%,103.5,70000.0
75%,104.75,90000.0
max,106.0,110000.0



Shape of the DataFrame: (6, 6)
Columns: ['ID', 'Name', 'Department', 'Salary', 'HireDate', 'City']


<a id='selecting'></a>
## 3. Selection & Indexing: Getting Data Out

Pandas offers powerful ways to select subsets of your data. The two main methods are `.loc` (label-based) and `.iloc` (integer-position-based).

In [4]:
# --- Selecting Columns ---
print("--- Selecting a single column (returns a Series) ---")
names = df['Name']
display(names.head())
print(f"Type of a single column: {type(names)}")

print("\n--- Selecting multiple columns (returns a DataFrame) ---")
# Note the double square brackets!
subset = df[['Name', 'Salary', 'City']]
display(subset.head())

# --- Selecting Rows with .loc and .iloc ---

# .loc is for selecting by LABEL (index name, column name)
print("\n--- Using .loc ---")
# Get the row with index label 2
display(df.loc[2])
# Get rows with index labels 1 through 3
display(df.loc[1:3])

# .iloc is for selecting by INTEGER POSITION (0, 1, 2...)
print("\n--- Using .iloc ---")
# Get the row at position 0 (the first row)
display(df.iloc[0])
# Get rows from position 1 up to (but not including) 4
display(df.iloc[1:4])

# --- Combining Row and Column Selection ---
# Get the 'Salary' for the first 3 rows (using .loc)
print("\n--- Slicing rows and columns with .loc ---")
display(df.loc[0:2, ['Name', 'Salary']])

# Get the value at row 0, column 1 (using .iloc)
alice_name = df.iloc[0, 1]
print(f"Value at [0, 1]: {alice_name}")

--- Selecting a single column (returns a Series) ---


0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object

Type of a single column: <class 'pandas.core.series.Series'>

--- Selecting multiple columns (returns a DataFrame) ---


Unnamed: 0,Name,Salary,City
0,Alice,90000.0,New York
1,Bob,65000.0,Los Angeles
2,Charlie,110000.0,New York
3,David,55000.0,Chicago
4,Eve,70000.0,Los Angeles



--- Using .loc ---


ID                    103
Name              Charlie
Department    Engineering
Salary           110000.0
HireDate       2018-01-10
City             New York
Name: 2, dtype: object

Unnamed: 0,ID,Name,Department,Salary,HireDate,City
1,102,Bob,Marketing,65000.0,2019-07-20,Los Angeles
2,103,Charlie,Engineering,110000.0,2018-01-10,New York
3,104,David,HR,55000.0,2021-05-30,Chicago



--- Using .iloc ---


ID                    101
Name                Alice
Department    Engineering
Salary            90000.0
HireDate       2020-03-15
City             New York
Name: 0, dtype: object

Unnamed: 0,ID,Name,Department,Salary,HireDate,City
1,102,Bob,Marketing,65000.0,2019-07-20,Los Angeles
2,103,Charlie,Engineering,110000.0,2018-01-10,New York
3,104,David,HR,55000.0,2021-05-30,Chicago



--- Slicing rows and columns with .loc ---


Unnamed: 0,Name,Salary
0,Alice,90000.0
1,Bob,65000.0
2,Charlie,110000.0


Value at [0, 1]: Alice


<a id='filtering'></a>
## 4. Conditional Filtering: Asking Questions

This is where the real power begins. You can filter your DataFrame based on conditions to find specific data.

In [5]:
# Single Condition: Find all employees in the Engineering department
engineers = df[df['Department'] == 'Engineering']
print("--- Engineers ---")
display(engineers)

# Multiple Conditions: Find all engineers who earn more than 100,000
# IMPORTANT: Use `&` for AND, `|` for OR, and wrap each condition in parentheses `()`!
high_earning_engineers = df[(df['Department'] == 'Engineering') & (df['Salary'] > 100000)]
print("\n--- High-Earning Engineers ---")
display(high_earning_engineers)

# Using `.isin()` to filter by a list of values
# Find all employees in New York or Chicago
ny_or_chicago = df[df['City'].isin(['New York', 'Chicago'])]
print("\n--- Employees in New York or Chicago ---")
display(ny_or_chicago)

--- Engineers ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City
0,101,Alice,Engineering,90000.0,2020-03-15,New York
2,103,Charlie,Engineering,110000.0,2018-01-10,New York
5,106,Frank,Engineering,,2022-02-12,Chicago



--- High-Earning Engineers ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City
2,103,Charlie,Engineering,110000.0,2018-01-10,New York



--- Employees in New York or Chicago ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City
0,101,Alice,Engineering,90000.0,2020-03-15,New York
2,103,Charlie,Engineering,110000.0,2018-01-10,New York
3,104,David,HR,55000.0,2021-05-30,Chicago
5,106,Frank,Engineering,,2022-02-12,Chicago


<a id='modifying'></a>
## 5. Modifying the DataFrame

You can easily add new columns, modify existing ones, or remove them.

In [6]:
# Add a new column
df['YearsOfService'] = 2023 - pd.to_datetime(df['HireDate']).dt.year
print("--- Added 'YearsOfService' column ---")
display(df.head())

# Modify a column using .apply() and a lambda function
# Let's give everyone a 5% raise
df['Salary'] = df['Salary'].apply(lambda x: x * 1.05)
print("\n--- Increased salary by 5% ---")
display(df.head())

# Drop a column
# `axis=1` means we are dropping a column. `axis=0` is for rows.
df_dropped = df.drop('HireDate', axis=1)
print("\n--- Dropped 'HireDate' column ---")
display(df_dropped.head())

--- Added 'YearsOfService' column ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City,YearsOfService
0,101,Alice,Engineering,90000.0,2020-03-15,New York,3
1,102,Bob,Marketing,65000.0,2019-07-20,Los Angeles,4
2,103,Charlie,Engineering,110000.0,2018-01-10,New York,5
3,104,David,HR,55000.0,2021-05-30,Chicago,2
4,105,Eve,Marketing,70000.0,2021-11-01,Los Angeles,2



--- Increased salary by 5% ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City,YearsOfService
0,101,Alice,Engineering,94500.0,2020-03-15,New York,3
1,102,Bob,Marketing,68250.0,2019-07-20,Los Angeles,4
2,103,Charlie,Engineering,115500.0,2018-01-10,New York,5
3,104,David,HR,57750.0,2021-05-30,Chicago,2
4,105,Eve,Marketing,73500.0,2021-11-01,Los Angeles,2



--- Dropped 'HireDate' column ---


Unnamed: 0,ID,Name,Department,Salary,City,YearsOfService
0,101,Alice,Engineering,94500.0,New York,3
1,102,Bob,Marketing,68250.0,Los Angeles,4
2,103,Charlie,Engineering,115500.0,New York,5
3,104,David,HR,57750.0,Chicago,2
4,105,Eve,Marketing,73500.0,Los Angeles,2


<a id='missing-data'></a>
## 6. Handling Missing Data

Real-world data is often messy and contains missing values (represented in Pandas as `NaN` - Not a Number). Pandas provides excellent tools to handle them.

In [7]:
# First, let's see where our missing data is
print("--- Count of missing values per column ---")
print(df.isnull().sum())

# Method 1: Dropping rows with any missing values
df_dropped_na = df.dropna()
print("\n--- DataFrame after dropping rows with NaN ---")
display(df_dropped_na)

# Method 2: Filling missing values (imputation)
# Let's fill the missing salary with the average salary of the department.
# This is a more advanced but common technique.
df['Salary'] = df.groupby('Department')['Salary'].transform(lambda x: x.fillna(x.mean()))
print("\n--- DataFrame after filling NaN salary with department average ---")
display(df)

--- Count of missing values per column ---
ID                0
Name              0
Department        0
Salary            1
HireDate          0
City              0
YearsOfService    0
dtype: int64

--- DataFrame after dropping rows with NaN ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City,YearsOfService
0,101,Alice,Engineering,94500.0,2020-03-15,New York,3
1,102,Bob,Marketing,68250.0,2019-07-20,Los Angeles,4
2,103,Charlie,Engineering,115500.0,2018-01-10,New York,5
3,104,David,HR,57750.0,2021-05-30,Chicago,2
4,105,Eve,Marketing,73500.0,2021-11-01,Los Angeles,2



--- DataFrame after filling NaN salary with department average ---


Unnamed: 0,ID,Name,Department,Salary,HireDate,City,YearsOfService
0,101,Alice,Engineering,94500.0,2020-03-15,New York,3
1,102,Bob,Marketing,68250.0,2019-07-20,Los Angeles,4
2,103,Charlie,Engineering,115500.0,2018-01-10,New York,5
3,104,David,HR,57750.0,2021-05-30,Chicago,2
4,105,Eve,Marketing,73500.0,2021-11-01,Los Angeles,2
5,106,Frank,Engineering,105000.0,2022-02-12,Chicago,1


<a id='grouping'></a>
## 7. Grouping & Aggregation: The Power of `groupby`

The `groupby` operation is one of the most powerful features of Pandas. It involves splitting the data into groups based on some criteria, applying a function to each group independently, and combining the results into a new DataFrame. This is the "split-apply-combine" pattern.

In [8]:
# Group by 'Department' and calculate the mean salary for each
avg_salary_by_dept = df.groupby('Department')['Salary'].mean()
print("--- Average Salary by Department ---")
display(avg_salary_by_dept)

# Group by 'City' and apply multiple aggregations
# Find the count of employees, average salary, and max years of service per city
city_stats = df.groupby('City').agg(
    EmployeeCount=('ID', 'count'),
    AverageSalary=('Salary', 'mean'),
    MaxYearsOfService=('YearsOfService', 'max')
)
print("\n--- Detailed Statistics by City ---")
display(city_stats)

--- Average Salary by Department ---


Department
Engineering    105000.0
HR              57750.0
Marketing       70875.0
Name: Salary, dtype: float64


--- Detailed Statistics by City ---


Unnamed: 0_level_0,EmployeeCount,AverageSalary,MaxYearsOfService
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chicago,2,81375.0,2
Los Angeles,2,70875.0,4
New York,2,105000.0,5


<a id='saving'></a>
## 8. Saving a DataFrame to a File

After all your hard work cleaning and transforming the data, you'll want to save it.

In [9]:
# Let's save our final, cleaned DataFrame to a new CSV file
# `index=False` is important to prevent pandas from writing the DataFrame index as a column.
df.to_csv('employees_cleaned.csv', index=False)

print("Cleaned DataFrame saved to 'employees_cleaned.csv'")

Cleaned DataFrame saved to 'employees_cleaned.csv'


## Conclusion

You have now learned the fundamentals of the Pandas DataFrame! You can:

- Create a DataFrame from scratch or a file.
- Inspect its contents and structure.
- Select and filter data to find exactly what you need.
- Modify and add new information.
- Handle common issues like missing data.
- Perform powerful aggregations with `groupby`.
- Save your results.

This is the foundation for almost all data analysis work in Python. The next steps are to explore merging/joining DataFrames, time-series analysis, and advanced plotting with libraries like Matplotlib and Seaborn.