- Pandas: A Library for Handling Structured Data in Python

Pandas is a powerful and flexible Python library used for data manipulation and analysis. It is built on top of NumPy and provides data structures like Series and DataFrame, which make data analysis tasks easier and more efficient.

Key features of Pandas include:
1. DataFrames and Series: Pandas uses DataFrames (2D tables) and Series (1D arrays) to represent structured data, similar to tables in a database or Excel.
2. Comprehensive Data Operations: Pandas provides tools for filtering, sorting, grouping, merging, concatenating, pivoting, and reshaping data.
3. Handling Missing Data: Pandas offers various methods to detect and handle missing or null values.
4. Integration with Other Libraries: Pandas integrates well with other Python libraries used in data science, such as NumPy, SciPy, and Matplotlib.

Pandas is an essential tool for data science and analysis because it simplifies data manipulation, enhances productivity, and integrates with other data science libraries. By using Pandas, data scientists can focus more on analysis and modeling, confident that their data is properly structured and cleaned.

1. Create a DataFrame

Q1: Create a DataFrame from a dictionary of lists. The dictionary should have keys 'Name', 'Age', 'City', and 'Salary'. Populate the DataFrame with at least 5 entries.

In [None]:
import pandas as pd

data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)

print(df)


2. Filter DataFrame

Q2: Filter the DataFrame to show only the rows where the Age is greater than 25.

In [None]:
# Filter the DataFrame
filtered_df = df[df['Age'] > 25]

# Display the filtered DataFrame
print(filtered_df)

3. Add a New Column

Q3: Add a new column 'Bonus' to the DataFrame, where the bonus is 10% of the Salary.

In [None]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28]}
df = pd.DataFrame(data)

# Add a new column named 'Score' with a list of values
df['Score'] = [90, 85, 75]

print(df)


4. Group By

Q4: Group the DataFrame by 'City' and calculate the average Salary for each city.

In [None]:
import pandas as pd

data = {'Order ID': [100, 100, 101, 102, 102],
        'Product': ['A', 'B', 'A', 'C', 'C'],
        'Price': [10, 15, 8, 20, 25]}
df = pd.DataFrame(data)

# Group by 'Order ID' and calculate total price per order
grouped_df = df.groupby('Order ID')['Price'].sum()

print(grouped_df)


5. Merge DataFrames

Q5: Merge two DataFrames on the 'Name' column. The second DataFrame should have columns 'Name' and 'Department'.

In [None]:
# Sample DataFrames (assuming 'CustomerID' is the common column)
customer_data = {'CustomerID': [100, 101, 102], 'Name': ['Alice', 'Bob', 'Charlie']}
order_data = {'CustomerID': [100, 102, 103], 'OrderID': [1000, 1002, 1004], 'Amount': [100, 200, 150]}

df_customers = pd.DataFrame(customer_data)
df_orders = pd.DataFrame(order_data)

merged_df = pd.merge(df_customers, df_orders, on='CustomerID', how='left')  # Left join to keep all customers
print(merged_df)


6. Handle Missing Data

Q6: Create a DataFrame with missing values and fill the missing values with the mean of the column.

import pandas as pd
import numpy as np

data = {'Name': ['Alice', 'Bob', np.nan, 'Charlie'],
        'Age': [25, 30, np.nan, 28],
        'Score': [90, 85, np.nan, 75]}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())

# Drop rows with missing values (careful, data loss possible)
# dropped_df = df.dropna()

# Fill missing values with mean of 'Age' column
df.fillna(df['Age'].mean(), inplace=True)

# Forward fill 'Score' column (assuming scores are sequential)
df['Score'].fillna(method='ffill', inplace=True)

print(df)


7. Pivot Table

Q7: Create a pivot table from the original DataFrame showing the average Salary for each Age group.

In [None]:
# Create a pivot table
pivot_table = df.pivot_table(values='Salary', index='Age', aggfunc='mean')

# Display the pivot table
print(pivot_table)

8. Apply Function

Q8: Apply a custom function to the 'Salary' column that increases each salary by 5%.

In [None]:
# Define a custom function
def increase_salary(salary):
    return salary * 1.05

# Apply the custom function to the 'Salary' column
df['Increased Salary'] = df['Salary'].apply(increase_salary)

# Display the updated DataFrame
print(df)

9. Read and Write CSV

Q9: Save the DataFrame to a CSV file and then read it back into a new DataFrame.

In [None]:
import pandas as pd

# Read a CSV file
data = pd.read_csv('data.csv', delimiter=';')  # Assuming ';' as delimiter

# Process or analyze the data in DataFrame 'data'

# Write the DataFrame to a new CSV file
data.to_csv('processed_data.csv')


10. Multi-indexing

Q10: Create a DataFrame with a multi-index (Name, City) and sort it by the index.

In [None]:
# Create a DataFrame with a multi-index
multi_index_df = df.set_index(['Name', 'City'])

# Sort the DataFrame by the index
sorted_multi_index_df = multi_index_df.sort_index()

# Display the sorted multi-index DataFrame
print(sorted_multi_index_df)