## Handling missing data and imputation techniques

### Exercise 1: Handling Missing Data - Dropping Rows

You are working on a machine learning project where you have collected data about customers, including their age, income, and purchase behavior. However, some rows have missing values for the purchase behavior column. Your task is to handle the missing data by dropping the rows with missing purchase behavior values.

In [None]:
import pandas as pd

# Define the DataFrame
df = pd.DataFrame({
    'Age': [32, 28, 45, 35],
    'Income': [50000, 60000, 75000, 80000],
    'Purchase': [1, None, 0, None]
})

# TODO: Drop rows with missing purchase feature values. Use the dropna() method

# Display the updated DataFrame
print(df)

# What are the advantages and limitations of such approach?

### Exercise 2: Handling Missing Data - Imputing with Mean Value

You are analyzing a dataset of student performance in various subjects. However, some scores are missing, represented as `NaN`. Your task is to handle the missing data by imputing the missing scores with the mean score of the available scores for each subject.

In [None]:
import pandas as pd
import numpy as np

# Define the DataFrame
df = pd.DataFrame({
    'Subject': ['Math', 'Science', 'English', 'Math', 'History']*6,
    'Score': [80, 90, np.nan, 70, np.nan, 95]*5
})

# TODO: Calculate the mean score for each subject

# TODO: Impute missing scores with the mean score for each subject. Use the transform() method and a lambda expression

# Display the updated DataFrame
print(df)

# What are the advantages and limitations of such approach?

## Merging DataFrames

### Exercise 3: Merging options

In this exercise, you will work with two DataFrames containing information about authors and their citation counts. Your task is to merge the DataFrames based on the "Author" column and explore different merge types (left, right, inner, and outer) to understand the relationship between authors and their citation counts.

In [None]:
# Define the first DataFrame with paper titles and authors' names
df1 = pd.DataFrame({
    'Paper Title': ['AI in Healthcare', 'Machine Learning Algorithms', 'Natural Language Processing'],
    'Author': ['Youssef Saeed', 'Fatima Ali', 'Ahmed Hassan']
})

# Define the second DataFrame with paper titles and citation counts
df2 = pd.DataFrame({
    'Paper Title': ['AI in Healthcare', 'Machine Learning Algorithms', 'Deep Learning for Image Recognition'],
    'Citations': [50, 100, 75]
})

# TODO: Perform left merge. Use merge() method

# Print result
print("\nLeft Merge:")
print(left_merge)

# TODO: Perform right merge. Use merge() method

# Print result
print("\nRight Merge:")
print(right_merge)

# TODO: Perform inner merge. Use merge() method

# Print result
print("\nInner Merge:")
print(inner_merge)

# TODO: Perform outer merge. Use merge() method

# Print result
print("\nOuter Merge:")
print(outer_merge)

## Concatenating DataFrames

### Exercise 4: Concatenate DataFrames horizontally and vertically

You have two DataFrames containing information about authors and their publications in the field of artificial intelligence. Your task is to concatenate these DataFrames horizontally and vertically to combine the data. Display the concatenated DataFrame.

In [None]:
# Define AI-papers data
df_papers_1 = pd.DataFrame({
    'Author': ['Saud Al-Mutairi', 'Aisha Al-Harbi', 'Fahad Al-Dosari'],
    'Publication': ['AI Review', 'AI Journal', 'AI Insights']
})

# Define additional AI-papers data
df_papers_2 = pd.DataFrame({
    'Author': ['Nora Al-Subai', 'Abdulaziz Al-Sulaiman', 'Hala Al-Mutlaq'],
    'Publication': ['AI Trends', 'AI Today', 'AI Innovations']
})

# TODO: Concatenate horizontally. Use concat() method

# TODO: Concatenate vertically. Use concat() method

# TODO: Display the concatenated DataFrames

## Reshaping and Transforming Data

### Exercise 5: More practice with Pandas

You are given a DataFrame containing sales data for an AI company. The DataFrame has the following columns: "Month", "Product", "Region", "Revenue".

Your task is to perform the following operations:

- Pivot the DataFrame to transform it into a wide format, where each unique product becomes a separate column, and the revenue values are filled in the corresponding cells.
- Group the data by region and calculate the total revenue for each region.
- Normalize the revenue values by dividing them by the maximum revenue in each region.
- Create a new column called "Quarter" by extracting the quarter information from the "Month" column.
- Sort the DataFrame by region in ascending order and then by quarter in descending order.

In [None]:
df = pd.DataFrame({
    "Month": ["Jan", "Jan", "Feb", "Feb", "Mar", "Mar"],
    "Product": ["Product A", "Product B", "Product A", "Product B", "Product A", "Product B"],
    "Region": ["North", "South", "North", "South", "North", "South"],
    "Revenue": [1000, 1500, 1200, 1800, 900, 1350]
})

# TODO: Pivot the DataFrame to transform it into a wide format. Use index="Region", columns="Month", and values="Revenue"

# TODO: Group the data by region and calculate the total revenue

# TODO: Normalize the revenue values

# TODO: Extract the quarter information from the "Month" column

# TODO: Sort the DataFrame by region and quarter

# Print the results and observe the differences