task 1

Task 1: Build and load the dataset
Create a list of dictionaries called attendance_raw with exactly 24 records. Each record must include:

student_id in the format S001 to S024
cohort as one of ["alpha", "beta", "gamma"]
attended_sessions as an integer between 0 and 6
expected_sessions as the integer 6
Then load the list into a DataFrame named attendance. Print the first five rows and call info() to confirm the structure and data types.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(19)
attendance_raw={
    "student_id":[f"S{i:03}" for i in range(1,25)],
    "cohort":np.random.choice(["alpha", "beta", "gamma"], size=24),
    "attended_sessions":np.random.randint(0, 7, size=24),
    "expected_sessions":6,
}

attendance=pd.DataFrame(attendance_raw)
print(attendance[:5],'\n')
print(attendance.info())

  student_id cohort  attended_sessions  expected_sessions
0       S001   beta                  1                  6
1       S002  gamma                  1                  6
2       S003   beta                  2                  6
3       S004  gamma                  5                  6
4       S005  alpha                  3                  6 

<class 'pandas.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   student_id         24 non-null     str  
 1   cohort             24 non-null     str  
 2   attended_sessions  24 non-null     int32
 3   expected_sessions  24 non-null     int64
dtypes: int32(1), int64(1), str(2)
memory usage: 804.0 bytes
None


task 2

Task 2: Set an index and validate alignment Set student_id as the index and store the result in attendance_indexed. Create a Series named excused_absences with at least 10 student IDs (some IDs must not exist in the DataFrame). Add this Series to attended_sessions to create a new column adjusted_attendance. Confirm that rows without matching IDs become missing in adjusted_attendance. Then fill missing values in adjusted_attendance with the original attended_sessions and show the updated column.

In [2]:
attendance_indexed=attendance.set_index("student_id")
excused_info={
    "S003":1,
    "S006":2,
    "S056":3,
    "S001":1,
    "S007":1,
    "S028":2,
    "S002":2,
    "S015":4,
    "S004":7,
    "S058":2,}

excused_absences=pd.Series(excused_info)
attendance_indexed["adjusted_attendance"] = attendance_indexed["attended_sessions"] + excused_absences
missing_rows = attendance_indexed["adjusted_attendance"].isna().sum()
print("DATAFRAME WITH MISSING ROWS: ",attendance_indexed,'\n')

attendance_indexed["adjusted_attendance"] = attendance_indexed["adjusted_attendance"].fillna(attendance_indexed["attended_sessions"])
print("UPDATED DATAFRAME: ",attendance_indexed)



DATAFRAME WITH NAN ROWS:             cohort  attended_sessions  expected_sessions  adjusted_attendance
student_id                                                                  
S001         beta                  1                  6                  2.0
S002        gamma                  1                  6                  3.0
S003         beta                  2                  6                  3.0
S004        gamma                  5                  6                 12.0
S005        alpha                  3                  6                  NaN
S006        alpha                  1                  6                  3.0
S007        gamma                  0                  6                  1.0
S008        gamma                  4                  6                  NaN
S009        alpha                  5                  6                  NaN
S010        gamma                  5                  6                  NaN
S011        gamma                  0              

task 3

Task 3: Clean and normalize categories
Introduce a small inconsistency by modifying a few cohort values to include extra whitespace and inconsistent casing. Then write pandas code to normalize the cohort column by stripping whitespace and converting to lowercase. After cleaning, display the unique cohorts to confirm that the inconsistencies are resolved.

In [3]:
attendance_indexed.loc["S001", "cohort"] = "  alpha  "
attendance_indexed.loc["S002", "cohort"] = " BETA "
attendance_indexed.loc["S003", "cohort"] = " GaMmA "

attendance_indexed["cohort"]=attendance_indexed["cohort"].str.strip().str.lower()
unique_cohorts=attendance_indexed["cohort"].unique()
print("unique cohorts: ",unique_cohorts)

unique cohorts:  <StringArray>
['alpha', 'beta', 'gamma']
Length: 3, dtype: str


task 4

Filter and compute summaries
Filter the DataFrame to students where attended_sessions is below expected_sessions. Store the result in low_attendance. Compute the average attended_sessions by cohort using group by. Print the summary and verify that cohorts in the summary match the cleaned cohorts.

In [4]:
low_attendance=attendance_indexed[attendance_indexed["attended_sessions"] < attendance_indexed["expected_sessions"]]
average_by_cohort = attendance_indexed.groupby("cohort")["attended_sessions"].mean()
print("Average Attendance by Cohort:",average_by_cohort,'\n')

print("Cohorts in summary: ",list(average_by_cohort.index))
print("Cleaned cohorts match:", set(average_by_cohort.index) == set(attendance_indexed["cohort"].unique()))

Average Attendance by Cohort: cohort
alpha    2.875000
beta     2.142857
gamma    3.555556
Name: attended_sessions, dtype: float64 

Cohorts in summary:  ['alpha', 'beta', 'gamma']
Cleaned cohorts match: True


task 5

Task 5: Add a derived field and validate it
Create a new column attendance_ok that is True when attended_sessions is at least expected_sessions, otherwise False. Use a boolean comparison rather than a loop. Then validate the column by confirming that every row in low_attendance has attendance_ok equal to False.

In [5]:
attendance_indexed["attendance_ok"] = attendance_indexed["attended_sessions"] >= attendance_indexed["expected_sessions"]
print((attendance_indexed.loc[low_attendance.index, 'attendance_ok'] == False).all())

True
