
# Pandas: DataFrames and Data Manipulation

---

## Introduction

Pandas is a powerful Python library for data analysis and manipulation. It provides data structures like Series and DataFrames, which are essential for handling structured data.

In this notebook, you will learn about:

- Creating and exploring Pandas Series and DataFrames
- Manipulating data: Filtering, sorting, and grouping
- Handling missing data
- Merging and joining DataFrames

Use cases in data science and exercises are included to help you practice and understand.
    


## Pandas Series and DataFrames

### Overview

- **Series**: A one-dimensional labeled array.
- **DataFrame**: A two-dimensional labeled data structure.

### Examples
    

In [None]:

import pandas as pd

# Creating a Series
data = [10, 20, 30, 40]
series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print("Pandas Series:", series)

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print("Pandas DataFrame:", df)
    


### Exercise

1. Create a Pandas Series representing the population of four cities.
2. Create a DataFrame with columns for "Product", "Price", and "Quantity".
3. Add a new column to the DataFrame that calculates the total cost (Price × Quantity).
    


## Data Manipulation: Filtering, Sorting, and Grouping

### Overview

Pandas provides powerful functions to filter, sort, and group data for analysis.

### Examples
    

In [None]:

# Filtering data
filtered_df = df[df['Age'] > 25]
print("Filtered DataFrame:", filtered_df)

# Sorting data
sorted_df = df.sort_values(by='Salary', ascending=False)
print("Sorted DataFrame:", sorted_df)

# Grouping data
grouped_data = df.groupby('Age')['Salary'].mean()
print("Grouped Data:", grouped_data)
    


### Exercise

1. Filter a DataFrame to show only rows where the "Salary" is greater than 55000.
2. Sort a DataFrame by multiple columns (e.g., "Age" and "Salary").
3. Group a DataFrame by a categorical column and calculate the sum of another column.
    


## Handling Missing Data

### Overview

Missing data is common in datasets. Pandas provides functions to handle missing values, such as:

- `isna()`: Detect missing values.
- `fillna()`: Fill missing values with a specified value or method.
- `dropna()`: Remove missing values.

### Examples
    

In [None]:

# Handling missing data
data_with_missing = {'Name': ['Alice', 'Bob', None], 'Age': [25, None, 35], 'Salary': [50000, 60000, None]}
df_missing = pd.DataFrame(data_with_missing)
print("Original DataFrame with missing values:", df_missing)

# Filling missing values
filled_df = df_missing.fillna({'Age': 30, 'Salary': 0})
print("DataFrame after filling missing values:", filled_df)

# Dropping missing values
dropped_df = df_missing.dropna()
print("DataFrame after dropping missing values:", dropped_df)    


### Exercise

1. Create a DataFrame with missing values and fill them using the mean of the column.
2. Drop rows with any missing values from a DataFrame.
3. Replace missing values in a Series with a constant value.
    


## Merging and Joining DataFrames

### Overview

Pandas allows combining multiple DataFrames using:

- `merge()`: Combines DataFrames based on keys.
- `join()`: Combines DataFrames on index.
- `concat()`: Concatenates DataFrames along a particular axis.

### Examples
    

In [None]:

# Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 60000]})
merged_df = pd.merge(df1, df2, on='ID')
print("Merged DataFrame:", merged_df)

# Joining DataFrames
df3 = pd.DataFrame({'Age': [25, 30]}, index=['Alice', 'Bob'])
joined_df = df1.set_index('Name').join(df3)
print("Joined DataFrame:", joined_df)

# Concatenating DataFrames
concat_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print("Concatenated DataFrame:", concat_df)    


### Exercise

1. Merge two DataFrames with different keys and explore the `how` parameter (e.g., `left`, `right`, `outer`, `inner`).
2. Join two DataFrames on their index and add a new column.
3. Concatenate three DataFrames along rows and reset the index.
    


## Use Case in Data Science

Pandas is used in data science for:

- Cleaning and preprocessing data
- Exploratory data analysis (EDA)
- Aggregating and summarizing data for insights

### Example Use Case
    

In [None]:
# Example: Analyzing sales data
sales_data = {'Product': ['A', 'B', 'A', 'B'],
              'Region': ['North', 'South', 'South', 'North'],
              'Sales': [200, 150, 100, 250]}
df_sales = pd.DataFrame(sales_data)

# Total sales by region
sales_by_region = df_sales.groupby('Region')['Sales'].sum()
print("Total sales by region:", sales_by_region)

# Fill missing sales with the mean
df_sales['Sales'] = df_sales['Sales'].fillna(df_sales['Sales'].mean())    


### Exercise

1. Create a DataFrame representing a sales dataset with columns for "Product", "Region", and "Sales".
2. Group the dataset by "Product" and calculate the total sales.
3. Merge two DataFrames representing sales and inventory data.
    


## Summary

In this notebook, you learned about:

- Creating and exploring Pandas Series and DataFrames
- Manipulating data: Filtering, sorting, and grouping
- Handling missing data
- Merging and joining DataFrames

These skills are essential for working with structured data in data science.

---

### Final Exercise

1. Create a DataFrame representing student test scores with columns for "Name", "Subject", and "Score".
2. Calculate the average score for each subject and sort the results in descending order.
3. Merge two DataFrames: one containing student details and another containing their scores.
    