<div align="center">

# <b style="font-family: 'LUISS', 'Lato'">Pandas Exercises - Part I</b>
<h2 style="font-family: 'LUISS', 'Lato'">Python and R for Data Science</h2>

<h3 style="font-family: 'LUISS', 'Lato'">Data Science and Management</h3>
<img src="https://lollo93.gitlab.io/python-and-r-labs/dist/img/cliente-luiss.png">

<br><br><br>

</div>

## Exercise 1: DataFrame from a dictionary

Write a a function `build_df` that takes a dictionary as input and returns a DataFrame. The dictionary should have the following structure:
- keys: column names
- values: lists of values for each column

In [1]:
# Solution goes here
import pandas as pd
def build_df(data):
    df = pd.DataFrame(data)
    return df

### Test your code

Run this code to test your solution:

In [2]:
# Test case 1: Basic functionality
data = {
    'column1': [1, 2, 3, 4],
    'column2': ['a', 'b', 'c', 'd']
}
df = build_df(data)
assert df.equals(pd.DataFrame(data)), "Test case 1 failed"

# Test case 2: Empty dictionary
data = {}
df = build_df(data)
assert df.equals(pd.DataFrame(data)), "Test case 2 failed"

# Test case 3: Single column
data = {
    'column1': [1, 2, 3]
}
df = build_df(data)
assert df.equals(pd.DataFrame(data)), "Test case 3 failed"

# Test case 4: Different data types
data = {
    'column1': [1, 2, 3],
    'column2': [1.1, 2.2, 3.3],
    'column3': ['x', 'y', 'z']
}
df = build_df(data)
assert df.equals(pd.DataFrame(data)), "Test case 4 failed"

## Exercise 2: DataFrame with index labels

Write a Python function `build_df_with_index` that takes a dictionary of data and a list of index labels as arguments, and returns a pandas DataFrame with the specified index labels applied.

In [3]:
# Solution goes here
def build_df_with_index(data, labels):
    return pd.DataFrame(data, labels)

### Test your code

Run this code to test your solution:

In [4]:
import pandas as pd
import numpy as np


# Test case 1: Basic functionality with valid data and labels
exam_data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = build_df_with_index(exam_data, labels)

# Verify DataFrame shape
assert df.shape == (10, 4), "Test case 1 failed: Incorrect DataFrame shape"

# Verify index labels
assert list(df.index) == labels, "Test case 1 failed: Index labels do not match"

# Verify column names
assert list(df.columns) == ['name', 'score', 'attempts', 'qualify'], "Test case 1 failed: Column names do not match"

# Verify data values
assert df.loc['a', 'name'] == 'Anastasia', "Test case 1 failed: Incorrect data in 'name' column"
assert df.loc['c', 'score'] == 16.5, "Test case 1 failed: Incorrect data in 'score' column"
assert pd.isna(df.loc['d', 'score']), "Test case 1 failed: Expected NaN value in 'score' column"
assert df.loc['j', 'qualify'] == 'yes', "Test case 1 failed: Incorrect data in 'qualify' column"
print("Test case 1 passed.")

# Test case 2: Empty data dictionary and labels
empty_data = {}
empty_labels = []

df_empty = build_df_with_index(empty_data, empty_labels)

# Verify DataFrame is empty
assert df_empty.empty, "Test case 2 failed: DataFrame should be empty"
print("Test case 2 passed.")

# Test case 3: Mismatched data length and labels (should raise an error)
mismatched_labels = ['a', 'b', 'c']
try:
    build_df_with_index(exam_data, mismatched_labels)
    print("Test case 3 failed: Expected an error due to mismatched labels length")
except ValueError:
    print("Test case 3 passed.")

# Test case 4: Non-string labels (numbers as labels)
numeric_labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

df_numeric_labels = build_df_with_index(exam_data, numeric_labels)

# Verify DataFrame index with numeric labels
assert list(df_numeric_labels.index) == numeric_labels, "Test case 4 failed: Index labels do not match"
print("Test case 4 passed.")

Test case 1 passed.
Test case 2 passed.
Test case 3 passed.
Test case 4 passed.


## Exercise 3: Extract first rows from a DataFrame

Define a function `get_first_n_rows(df, n)` that:
  - Takes a pandas DataFrame `df` and an integer `n`.
  - Returns a new DataFrame with the first `n` rows of `df`.

In [5]:
# Solution goes here
def get_first_n_rows(df, n):
    return df.head(n)

### Test your code

Run this code to test your solution:

In [6]:
import numpy as np

exam_data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data, index=labels)

# Test case 1: Get first 3 rows
result = get_first_n_rows(df, 3)
assert result.equals(df.iloc[:3]), "Test case 1 failed: First 3 rows do not match"

# Test case 2: Get first 5 rows
result = get_first_n_rows(df, 5)
assert result.equals(df.iloc[:5]), "Test case 2 failed: First 5 rows do not match"

# Test case 3: Get first 0 rows (empty result)
result = get_first_n_rows(df, 0)
assert result.equals(df.iloc[:0]), "Test case 3 failed: Expected an empty DataFrame"

# Test case 4: Get first 10 rows (full DataFrame)
result = get_first_n_rows(df, 10)
assert result.equals(df), "Test case 4 failed: Expected the full DataFrame"

## Exercise 4: Select columns from a DataFrame

Define a function `select_from_df(df, columns)` that:
- Takes two arguments:
    - `df`: A pandas DataFrame containing the original data.
    - `columns`: A list of column names to include in the output DataFrame.
- Returns a new DataFrame containing only the columns specified in `columns`. If a column in `columns` does not exist in `df`, the function should raise a `KeyError` (as pandas would do).

In [13]:
# Your solution goes here
def select_from_df(df, columns):
    return df[columns]

### Test your code

In [14]:
# Sample DataFrame setup
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, 27, 22, 32],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'score': [85, 90, 88, 95]
}
df = pd.DataFrame(data)

# Test case 1: Select 'name' and 'score' columns
result = select_from_df(df, ['name', 'score'])
expected_df = df[['name', 'score']]
assert result.equals(expected_df), "Test case 1 failed: Expected DataFrame with 'name' and 'score' columns only."

# Test case 2: Select single column 'age'
result = select_from_df(df, ['age'])
expected_df = df[['age']]
assert result.equals(expected_df), "Test case 2 failed: Expected DataFrame with 'age' column only."

# Test case 3: Select all columns
result = select_from_df(df, ['name', 'age', 'city', 'score'])
assert result.equals(df), "Test case 3 failed: Expected full DataFrame."

# Test case 4: Non-existent column
try:
    select_from_df(df, ['name', 'non_existent_column'])
    print("Test case 4 failed: Expected KeyError for non-existent column.")
except KeyError:
    pass
    # print("Test case 4 passed: KeyError raised as expected.")

## Exercise 5: Select Rows with Column Values Greater Than a Specified Value

Define a function `select_row_column_greater_than_value(df, column, value)` that:
- Takes three arguments:
    - `df`: A pandas DataFrame containing the original data.
    - `column`: A string representing the name of the column to compare.
    - `value`: A numeric value to compare against.
- Returns a new DataFrame with only the rows where the specified column’s value is greater than `value`.
- If the column does not exist in `df`, the function should raise a `KeyError` (as done by `pandas`).

In [26]:
# your solution goes here
def select_row_column_greater_than_value(df, column, value):
    return df.loc[df[column]>value]


### Test your code

In [27]:
# Sample DataFrame setup
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, 27, 22, 32],
    'score': [85, 90, 88, 95]
}
df = pd.DataFrame(data)

# Test case 1: Select rows where 'age' is greater than 25
result = select_row_column_greater_than_value(df, 'age', 25)
expected_df = df[df['age'] > 25]
assert result.equals(expected_df), "Test case 1 failed: Expected rows where 'age' > 25"

# Test case 2: Select rows where 'score' is greater than 88
result = select_row_column_greater_than_value(df, 'score', 88)
expected_df = df[df['score'] > 88]
assert result.equals(expected_df), "Test case 2 failed: Expected rows where 'score' > 88"

# Test case 3: Select rows where 'age' is greater than a high value (empty result)
result = select_row_column_greater_than_value(df, 'age', 100)
assert result.empty, "Test case 3 failed: Expected empty DataFrame when no rows meet criteria"

# Test case 4: Non-existent column
try:
    select_row_column_greater_than_value(df, 'height', 25)
    print("Test case 4 failed: Expected KeyError for non-existent column")
except KeyError:
    # print("Test case 4 passed: KeyError raised as expected")
    pass

## Exercise 6: DataFrame with null values

Define a function that takes a DataFrame as input and returns two values: a subset of the DataFrame containing only the rows with null (NaN) values, and the count of rows with null values.

In particular, define a function `find_and_count_rows_with_null(df)` that:
- Takes one argument:
    - `df`: A pandas DataFrame.
- Returns two values:
    - A new DataFrame containing only the rows from `df` that have at least one null value.
    - An integer representing the number of rows in `df` that contain at least one null value.

In [None]:
# Solution goes here
def find_and_count_rows_with_null(df):
    return df[df.isna().sum(axis = 1)>0], len(df[df.isna().sum(axis = 1)>0])

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, np.nan, np.nan, 32],
    'score': [85, 90, np.nan, 95]
}
df = pd.DataFrame(data)

find_and_count_rows_with_null(df)

(      name  age  score
 1      Bob  NaN   90.0
 2  Charlie  NaN    NaN,
 2)

### Test your code

In [51]:
import numpy as np

# Sample DataFrame setup
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, np.nan, 22, 32],
    'score': [85, 90, np.nan, 95]
}
df = pd.DataFrame(data)

# Test case 1: Basic functionality
rows_with_null, count_null_rows = find_and_count_rows_with_null(df)
expected_rows = df[df.isnull().any(axis=1)]
assert rows_with_null.equals(expected_rows), "Test case 1 failed: Expected rows with null values do not match"
assert count_null_rows == 2, "Test case 1 failed: Expected count of rows with null values is 2"

# Test case 2: No null values
data_no_nulls = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, 28, 22, 32],
    'score': [85, 90, 88, 95]
}
df_no_nulls = pd.DataFrame(data_no_nulls)
rows_with_null, count_null_rows = find_and_count_rows_with_null(df_no_nulls)
assert rows_with_null.empty, "Test case 2 failed: Expected empty DataFrame for rows with null values"
assert count_null_rows == 0, "Test case 2 failed: Expected count of rows with null values is 0"

# Test case 3: All rows have null values in at least one column
data_all_nulls = {
    'name': ['Alice', 'Bob', np.nan, 'David'],
    'age': [np.nan, np.nan, np.nan, np.nan],
    'score': [85, np.nan, np.nan, 95]
}
df_all_nulls = pd.DataFrame(data_all_nulls)
rows_with_null, count_null_rows = find_and_count_rows_with_null(df_all_nulls)
assert rows_with_null.equals(df_all_nulls), "Test case 3 failed: Expected entire DataFrame for rows with null values"
assert count_null_rows == 4, "Test case 3 failed: Expected count of rows with null values is 4"

## Exercise 7: Sum of values from a DataFrame

Define a function that takes a DataFrame and a column name as input and returns the sum of all values in the specified column. In particular, define a function `sum_from_column(df, column)` that:
- Takes two arguments:
    - `df`: A pandas DataFrame containing the data.
    - `column`: A string representing the name of the column to sum.
- Returns the sum of all values in the specified column.
- If the column contains any null values, the function should ignore these values in the sum.
- If the column does not exist in `df`, the function should raise a `KeyError`.

In [70]:
# Solution goes here

### Test your code

In [16]:
import numpy as np
import pandas as pd

# Sample DataFrame setup
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, np.nan, 22, 32],
    'score': [85, 90, 88, 95]
}
df = pd.DataFrame(data)

# Test case 1: Sum over 'age' column with null values
result = sum_from_column(df, 'age')
assert result == 78.0, f"Test case 1 failed: Expected sum of ages to be 78.0, but got {result}"

# Test case 2: Sum over 'score' column with no null values
result = sum_from_column(df, 'score')
assert result == 358, f"Test case 2 failed: Expected sum of scores to be 358, but got {result}"

# Test case 3: Non-existent column
try:
    sum_from_column(df, 'height')
    print("Test case 3 failed: Expected KeyError for non-existent column")
except KeyError:
    # print("Test case 3 passed: KeyError raised as expected")
    pass

# Test case 4: Sum over an empty column (all values are NaN)
data_all_null = {'col': [np.nan, np.nan, np.nan]}
df_all_null = pd.DataFrame(data_all_null)
result = sum_from_column(df_all_null, 'col')
assert result == 0, f"Test case 4 failed: Expected sum of 0 for an all-NaN column, but got {result}"

## Exercise 8: Count Values Greater Than a Specified Threshold in a Column

Define a function that takes a DataFrame, a column name, and a threshold value, and returns the count of values in the specified column that are greater than the threshold.

In particualar, define a function `count_values_greater_than(df, column, threshold)` that:
- Takes three arguments:
    - `df`: A pandas DataFrame containing the data.
    - `column`: A string representing the name of the column to check.
    - `threshold`: A numeric value to compare against.
- Returns an integer representing the count of values in the specified column that are greater than the given threshold.
- If the column contains any null values, ignore these values in the count.


In [22]:
# Solution goes here

### Test your code

In [24]:
import numpy as np
import pandas as pd

# Sample DataFrame setup
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, np.nan, 22, 32],
    'score': [85, 90, 88, 95]
}
df = pd.DataFrame(data)

# Test case 1: Count values in 'age' column greater than 23
result = count_values_greater_than(df, 'age', 23)
assert result == 2, f"Test case 1 failed: Expected count of 2, but got {result}"

# Test case 2: Count values in 'score' column greater than 89
result = count_values_greater_than(df, 'score', 89)
assert result == 2, f"Test case 2 failed: Expected count of 2, but got {result}"

# Test case 3: Non-existent column
try:
    count_values_greater_than(df, 'height', 23)
    print("Test case 3 failed: Expected KeyError for non-existent column")
except KeyError:
    # print("Test case 3 passed: KeyError raised as expected")
    pass

# Test case 4: Count values greater than threshold when all values are below it
result = count_values_greater_than(df, 'age', 100)
assert result == 0, f"Test case 4 failed: Expected count of 0 for threshold above all values, but got {result}"

## Exercise 9: Add a new column to a DataFrame

Write a function `add_column(df, column_name, values)` that:
- Takes a pandas DataFrame `df`, a string `column_name`, and a list `values`.
- Adds a new column to `df` with the name `column_name` and the values from `values`.
- Returns the modified DataFrame.

In [28]:
# Solution goes here

### Test your code

In [27]:
import pandas as pd

# Test case 1: Add a column with correct length
df = pd.DataFrame({'A': [1, 2, 3]})
result = add_column(df.copy(), 'B', [4, 5, 6])
assert 'B' in result.columns and all(result['B'] == [4, 5, 6]), "Test case 1 failed"

# Test case 2: Add a column to empty DataFrame
df_empty = pd.DataFrame()
result = add_column(df_empty.copy(), 'D', [])
assert 'D' in result.columns and result.empty, "Test case 2 failed"

## Exercise 10: Sort a DataFrame by a column

Write a function `sort_by_column(df, column, ascending=True)` that:
- Takes a pandas DataFrame `df`, a string `column`, and a boolean `ascending` (default `True`).
- Returns a new DataFrame sorted by the specified column in ascending or descending order.

In [None]:
# Solution goes here

### Test your code

In [30]:
import pandas as pd

# Test case 1: Sort ascending
df = pd.DataFrame({'A': [3, 1, 2], 'B': [9, 8, 7]})
result = sort_by_column(df, 'A')
assert list(result['A']) == [1, 2, 3], "Test case 1 failed"

# Test case 2: Sort descending
result = sort_by_column(df, 'B', ascending=False)
assert list(result['B']) == [9, 8, 7], "Test case 2 failed"

# Test case 3: Sort with duplicate values
df_dup = pd.DataFrame({'A': [2, 1, 2], 'B': [5, 6, 7]})
result = sort_by_column(df_dup, 'A')
assert list(result['A']) == [1, 2, 2], "Test case 3 failed"

## Exercise 11: Filter with multiple conditions, sort, and select columns

Write a function `complex_filter_sort_select(df)` that:
- Takes a pandas DataFrame `df`.
- Filters the rows where **(the value in 'age' is greater than 25 and 'score' is at least 90) or (the value in 'name' is 'Anna')**.
- Sorts the filtered DataFrame by `'score'` in descending order.
- Returns a new DataFrame containing only the columns `'name'`, `'age'`, and `'score'`.

In [None]:
# Solution goes here

### Test your code

In [32]:
import pandas as pd

data = {
    'name': ['Anna', 'Bob', 'Cleo', 'Dan'],
    'age': [23, 35, 29, 40],
    'score': [88, 92, 85, 95]
}
df = pd.DataFrame(data)

# Test: should select Anna (name=='Anna'), Bob (age>25 and score>=90), Dan (age>25 and score>=90)
result = complex_filter_sort_select(df)
expected = pd.DataFrame({
    'name': ['Dan', 'Bob', 'Anna'],
    'age': [40, 35, 23],
    'score': [95, 92, 88]
}, index=[3,1,0])
assert result.reset_index(drop=True).equals(expected.reset_index(drop=True)), "Test failed"

## Exercise 12: Compute the average grade for each student

Write a function `add_average_grade(df)` that:
- Takes a pandas DataFrame `df` where each row represents a student and each column (e.g., 'exam1', 'exam2', ...) represents the grade for an exam.
- Adds a new column called `'average'` containing the mean of the grades for each student (row).
- Returns the modified DataFrame.

In [None]:
# Solution goes here

### Test your code

In [34]:
import pandas as pd

data = {
    'exam1': [28, 22, 30],
    'exam2': [30, 18, 27],
    'exam3': [26, 25, 29]
}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Carla'])

result = add_average_grade(df.copy())
expected_averages = [(28+30+26)/3, (22+18+25)/3, (30+27+29)/3]
assert all(abs(a-b) < 1e-6 for a, b in zip(result['average'], expected_averages)), "Test failed"

## Exercise 13: Replace NaN, drop rows, and filter with multiple conditions and index

Write a function `clean_and_filter(df, score_column, col1, col2, val1, val2, min_index=None, use_and=True)` that:
- Takes a pandas DataFrame `df`.
- Replaces all NaN values in the column `score_column` with 0.
- Drops all rows that have any remaining NaN values in any other column.
- Filters the DataFrame to keep only the rows where:
    - If `use_and` is True: the value in `col1` is equal to `val1` **and** the value in `col2` is equal to `val2`.
    - If `use_and` is False: the value in `col1` is equal to `val1` **or** the value in `col2` is equal to `val2`.
    - In addition, only keep rows whose index is greater than or equal to `min_index` if `min_index` is not None.
- Returns the resulting DataFrame.

In [None]:
# Solution goes here

### Test your code

In [24]:
import pandas as pd
import numpy as np

data = {
    'score': [np.nan, 5, np.nan, 8, 7],
    'A': ['x', 'y', 'x', 'z', 'x'],
    'B': [1, 2, 1, 2, np.nan]
}
df = pd.DataFrame(data)

# Test 1: Replace NaN in 'score', drop other NaN, filter A=='x' and B==1, min_index=1
result = clean_and_filter_with_min_index(df, 'score', 'A', 'B', 'x', 1, min_index=1, use_and=True)
expected = pd.DataFrame({'score': [0.0], 'A': ['x'], 'B': [1.0]}, index=[2])
assert result.reset_index(drop=True).equals(expected.reset_index(drop=True)), "Test 1 failed"

# Test 2: Replace NaN in 'score', drop other NaN, filter A=='x' or B==2, min_index=1
result = clean_and_filter_with_min_index(df, 'score', 'A', 'B', 'x', 2, min_index=1, use_and=False)
expected = pd.DataFrame({'score': [5.0, 0.0, 8.0], 'A': ['y', 'x', 'z'], 'B': [2.0, 1.0, 2.0]}, index=[1,2,3])
assert result.reset_index(drop=True).equals(expected.reset_index(drop=True)), "Test 2 failed"

# Test 3: Replace NaN in 'score', drop other NaN, filter A=='z' and B==2, no min_index filter
result = clean_and_filter_with_min_index(df, 'score', 'A', 'B', 'z', 2, min_index=None, use_and=True)
expected = pd.DataFrame({'score': [8.0], 'A': ['z'], 'B': [2.0]}, index=[3])
assert result.reset_index(drop=True).equals(expected.reset_index(drop=True)), "Test 3 failed"
