##### Task 1: Build and load the dataset
Create a list of dictionaries called attendance_raw with exactly 24 records. Each record must include:

student_id in the format S001 to S024
cohort as one of ["alpha", "beta", "gamma"]
attended_sessions as an integer between 0 and 6
expected_sessions as the integer 6
Then load the list into a DataFrame named attendance. Print the first five rows and call info() to confirm the structure and data types.

In [81]:
import pandas as pd
import numpy as np

In [82]:
rng=np.random.default_rng(42)

In [83]:
cohorts=["alpha", "beta", "gamma"]

In [84]:
attendance_raw=[{
    "student_id":f"S{str(i).zfill(3)}",
    "cohort": rng.choice(cohorts),
    "attended_sessions":int(rng.integers(0,7)),
    "expected_sessions": 6
}
for i in range (24) ]

In [85]:
attendance=pd.DataFrame(attendance_raw)
attendance

Unnamed: 0,student_id,cohort,attended_sessions,expected_sessions
0,S000,alpha,5,6
1,S001,beta,3,6
2,S002,beta,6,6
3,S003,alpha,4,6
4,S004,alpha,0,6
5,S005,beta,6,6
6,S006,gamma,5,6
7,S007,gamma,5,6
8,S008,beta,0,6
9,S009,gamma,3,6


In [86]:
attendance.head()

Unnamed: 0,student_id,cohort,attended_sessions,expected_sessions
0,S000,alpha,5,6
1,S001,beta,3,6
2,S002,beta,6,6
3,S003,alpha,4,6
4,S004,alpha,0,6


In [87]:
attendance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   student_id         24 non-null     object
 1   cohort             24 non-null     object
 2   attended_sessions  24 non-null     int64 
 3   expected_sessions  24 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 900.0+ bytes


##### Task 2: Set an index and validate alignment
Set student_id as the index and store the result in attendance_indexed. Create a Series named excused_absences with at least 10 student IDs (some IDs must not exist in the DataFrame). Add this Series to attended_sessions to create a new column adjusted_attendance. Confirm that rows without matching IDs become missing in adjusted_attendance. Then fill missing values in adjusted_attendance with the original attended_sessions and show the updated column.

In [88]:
attendance_indexed=attendance.set_index("student_id")

In [89]:
attendance_indexed.head()

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S000,alpha,5,6
S001,beta,3,6
S002,beta,6,6
S003,alpha,4,6
S004,alpha,0,6


In [90]:
excused_absences=pd.Series(
    data=[1, 2, 1, 0, 1, 3, 1, 0, 2, 1],  
    index=["S001", "S002", "S005", "S010", "S012", "S015", "S018", "S020", "S025", "S030"]
)
excused_absences

S001    1
S002    2
S005    1
S010    0
S012    1
S015    3
S018    1
S020    0
S025    2
S030    1
dtype: int64

In [91]:
attendance_indexed['adjusted_attendance'] = attendance_indexed['attended_sessions'] + excused_absences

In [92]:
print(attendance_indexed[['attended_sessions', 'adjusted_attendance']])

            attended_sessions  adjusted_attendance
student_id                                        
S000                        5                  NaN
S001                        3                  4.0
S002                        6                  8.0
S003                        4                  NaN
S004                        0                  NaN
S005                        6                  7.0
S006                        5                  NaN
S007                        5                  NaN
S008                        0                  NaN
S009                        3                  NaN
S010                        2                  2.0
S011                        6                  NaN
S012                        4                  5.0
S013                        5                  NaN
S014                        3                  NaN
S015                        1                  4.0
S016                        3                  NaN
S017                        0  

In [93]:
attendance_indexed['adjusted_attendance'] =attendance_indexed["adjusted_attendance"].fillna(attendance_indexed["attended_sessions"])

In [94]:
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S000,alpha,5,6,5.0
S001,beta,3,6,4.0
S002,beta,6,6,8.0
S003,alpha,4,6,4.0
S004,alpha,0,6,0.0
S005,beta,6,6,7.0
S006,gamma,5,6,5.0
S007,gamma,5,6,5.0
S008,beta,0,6,0.0
S009,gamma,3,6,3.0


In [95]:
print(attendance_indexed[['attended_sessions', 'adjusted_attendance']])

            attended_sessions  adjusted_attendance
student_id                                        
S000                        5                  5.0
S001                        3                  4.0
S002                        6                  8.0
S003                        4                  4.0
S004                        0                  0.0
S005                        6                  7.0
S006                        5                  5.0
S007                        5                  5.0
S008                        0                  0.0
S009                        3                  3.0
S010                        2                  2.0
S011                        6                  6.0
S012                        4                  5.0
S013                        5                  5.0
S014                        3                  3.0
S015                        1                  4.0
S016                        3                  3.0
S017                        0  

##### Task 3: Clean and normalize categories
Introduce a small inconsistency by modifying a few cohort values to include extra whitespace and inconsistent casing. Then write pandas code to normalize the cohort column by stripping whitespace and converting to lowercase. After cleaning, display the unique cohorts to confirm that the inconsistencies are resolved.

In [96]:
attendance_indexed.loc['S002', 'cohort'] = ' Alpha '   
attendance_indexed.loc['S005', 'cohort'] = 'BETA'     
attendance_indexed.loc['S012', 'cohort'] = ' gamma'  
attendance_indexed.loc['S015', 'cohort'] = 'GAmma '

In [97]:
print(attendance_indexed[['cohort']].head(15))

             cohort
student_id         
S000          alpha
S001           beta
S002         Alpha 
S003          alpha
S004          alpha
S005           BETA
S006          gamma
S007          gamma
S008           beta
S009          gamma
S010           beta
S011          alpha
S012          gamma
S013           beta
S014           beta


In [98]:
attendance_indexed['cohort']=attendance_indexed['cohort'].str.strip().str.lower()

In [99]:
attendance_indexed['cohort'].unique()

array(['alpha', 'beta', 'gamma'], dtype=object)

##### Task 4: Filter and compute summaries
Filter the DataFrame to students where attended_sessions is below expected_sessions. Store the result in low_attendance. Compute the average attended_sessions by cohort using groupby. Print the summary and verify that cohorts in the summary match the cleaned cohorts.

In [100]:
low_attendance=attendance_indexed[attendance_indexed["attended_sessions"]<attendance_indexed["expected_sessions"]]
print(low_attendance[['attended_sessions', 'expected_sessions', 'cohort']])

            attended_sessions  expected_sessions cohort
student_id                                             
S000                        5                  6  alpha
S001                        3                  6   beta
S003                        4                  6  alpha
S004                        0                  6  alpha
S006                        5                  6  gamma
S007                        5                  6  gamma
S008                        0                  6   beta
S009                        3                  6  gamma
S010                        2                  6   beta
S012                        4                  6  gamma
S013                        5                  6   beta
S014                        3                  6   beta
S015                        1                  6  gamma
S016                        3                  6  alpha
S017                        0                  6  gamma
S018                        5                  6

In [101]:
attendance_summary = attendance_indexed.groupby('cohort')['attended_sessions'].mean()
attendance_summary

cohort
alpha    4.333333
beta     3.571429
gamma    3.125000
Name: attended_sessions, dtype: float64

In [102]:
print("Unique cohorts in summary:", attendance_summary.index.tolist())

Unique cohorts in summary: ['alpha', 'beta', 'gamma']


##### Task 5: Add a derived field and validate it
Create a new column attendance_ok that is True when attended_sessions is at least expected_sessions, otherwise False. Use a boolean comparison rather than a loop. Then validate the column by confirming that every row in low_attendance has attendance_ok equal to False.

In [103]:
attendance_indexed['attendance_ok'] = attendance_indexed['attended_sessions'] >= attendance_indexed['expected_sessions']
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance,attendance_ok
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
S000,alpha,5,6,5.0,False
S001,beta,3,6,4.0,False
S002,alpha,6,6,8.0,True
S003,alpha,4,6,4.0,False
S004,alpha,0,6,0.0,False
S005,beta,6,6,7.0,True
S006,gamma,5,6,5.0,False
S007,gamma,5,6,5.0,False
S008,beta,0,6,0.0,False
S009,gamma,3,6,3.0,False


In [104]:
print(attendance_indexed[['attended_sessions', 'expected_sessions', 'attendance_ok']].head(10))

            attended_sessions  expected_sessions  attendance_ok
student_id                                                     
S000                        5                  6          False
S001                        3                  6          False
S002                        6                  6           True
S003                        4                  6          False
S004                        0                  6          False
S005                        6                  6           True
S006                        5                  6          False
S007                        5                  6          False
S008                        0                  6          False
S009                        3                  6          False


In [105]:
#validate
low_attendance = attendance_indexed[attendance_indexed['attended_sessions'] < attendance_indexed['expected_sessions']]
low_attendance

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance,attendance_ok
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
S000,alpha,5,6,5.0,False
S001,beta,3,6,4.0,False
S003,alpha,4,6,4.0,False
S004,alpha,0,6,0.0,False
S006,gamma,5,6,5.0,False
S007,gamma,5,6,5.0,False
S008,beta,0,6,0.0,False
S009,gamma,3,6,3.0,False
S010,beta,2,6,2.0,False
S012,gamma,4,6,5.0,False


In [106]:
# Display to confirm
print(low_attendance[['attended_sessions', 'expected_sessions', 'attendance_ok']])

            attended_sessions  expected_sessions  attendance_ok
student_id                                                     
S000                        5                  6          False
S001                        3                  6          False
S003                        4                  6          False
S004                        0                  6          False
S006                        5                  6          False
S007                        5                  6          False
S008                        0                  6          False
S009                        3                  6          False
S010                        2                  6          False
S012                        4                  6          False
S013                        5                  6          False
S014                        3                  6          False
S015                        1                  6          False
S016                        3           

In [107]:
# Checking that all low-attendance students have attendance_ok = False
all_low_ok_false = (low_attendance['attendance_ok'] == False).all()
print("All low_attendance rows have attendance_ok = False?", all_low_ok_false)

All low_attendance rows have attendance_ok = False? True
