- Pandas: A Library for Handling Structured Data in Python

Pandas is a powerful and flexible Python library used for data manipulation and analysis. It is built on top of NumPy and provides data structures like Series and DataFrame, which make data analysis tasks easier and more efficient.

Key features of Pandas include:
1. DataFrames and Series: Pandas uses DataFrames (2D tables) and Series (1D arrays) to represent structured data, similar to tables in a database or Excel.
2. Comprehensive Data Operations: Pandas provides tools for filtering, sorting, grouping, merging, concatenating, pivoting, and reshaping data.
3. Handling Missing Data: Pandas offers various methods to detect and handle missing or null values.
4. Integration with Other Libraries: Pandas integrates well with other Python libraries used in data science, such as NumPy, SciPy, and Matplotlib.

Pandas is an essential tool for data science and analysis because it simplifies data manipulation, enhances productivity, and integrates with other data science libraries. By using Pandas, data scientists can focus more on analysis and modeling, confident that their data is properly structured and cleaned.

1. Create a DataFrame

Q1: Create a DataFrame from a dictionary of lists. The dictionary should have keys 'Name', 'Age', 'City', and 'Salary'. Populate the DataFrame with at least 5 entries.

In [None]:
import pandas as pd

# Create a dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 75000, 90000, 85000]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

2. Filter DataFrame

Q2: Filter the DataFrame to show only the rows where the Age is greater than 25.

In [None]:
# Filter the DataFrame
filtered_df = df[df['Age'] > 25]

# Display the filtered DataFrame
print(filtered_df)

3. Add a New Column

Q3: Add a new column 'Bonus' to the DataFrame, where the bonus is 10% of the Salary.

In [None]:
# Add a new column 'Bonus'
df['Bonus'] = df['Salary'] * 0.10

# Display the updated DataFrame
print(df)

4. Group By

Q4: Group the DataFrame by 'City' and calculate the average Salary for each city.

In [None]:
# Group by 'City' and calculate the average Salary
grouped_df = df.groupby('City')['Salary'].mean().reset_index()

# Display the grouped DataFrame
print(grouped_df)

5. Merge DataFrames

Q5: Merge two DataFrames on the 'Name' column. The second DataFrame should have columns 'Name' and 'Department'.

In [None]:
# Create the second DataFrame
data2 = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales']
}

df2 = pd.DataFrame(data2)

# Merge the DataFrames
merged_df = pd.merge(df, df2, on='Name')

# Display the merged DataFrame
print(merged_df)

6. Handle Missing Data

Q6: Create a DataFrame with missing values and fill the missing values with the mean of the column.

In [None]:
import numpy as np

# Create a DataFrame with missing values
data_with_nan = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, np.nan, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Phoenix'],
    'Salary': [70000, 80000, np.nan, 90000, 85000]
}

df_with_nan = pd.DataFrame(data_with_nan)

# Fill missing values with the mean of the column
df_with_nan['Age'].fillna(df_with_nan['Age'].mean(), inplace=True)
df_with_nan['Salary'].fillna(df_with_nan['Salary'].mean(), inplace=True)
df_with_nan['City'].fillna('Unknown', inplace=True)

# Display the DataFrame with filled values
print(df_with_nan)

7. Pivot Table

Q7: Create a pivot table from the original DataFrame showing the average Salary for each Age group.

In [None]:
# Create a pivot table
pivot_table = df.pivot_table(values='Salary', index='Age', aggfunc='mean')

# Display the pivot table
print(pivot_table)

8. Apply Function

Q8: Apply a custom function to the 'Salary' column that increases each salary by 5%.

In [None]:
# Define a custom function
def increase_salary(salary):
    return salary * 1.05

# Apply the custom function to the 'Salary' column
df['Increased Salary'] = df['Salary'].apply(increase_salary)

# Display the updated DataFrame
print(df)

9. Read and Write CSV

Q9: Save the DataFrame to a CSV file and then read it back into a new DataFrame.

In [None]:
# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

# Read the CSV file into a new DataFrame
new_df = pd.read_csv('data.csv')

# Display the new DataFrame
print(new_df)

10. Multi-indexing

Q10: Create a DataFrame with a multi-index (Name, City) and sort it by the index.

In [None]:
# Create a DataFrame with a multi-index
multi_index_df = df.set_index(['Name', 'City'])

# Sort the DataFrame by the index
sorted_multi_index_df = multi_index_df.sort_index()

# Display the sorted multi-index DataFrame
print(sorted_multi_index_df)