# **Task 1: Build and load the dataset*
**Create a list of dictionaries called attendance_raw with exactly 24 records. Each record must include:**

**student_id in the format S001 to S024**
**cohort as one of ["alpha", "beta", "gamma"]**
**attended_sessions as an integer between 0 and 6**
**expected_sessions as the integer 6**
**Then load the list into a DataFrame named attendance. Print the first five rows and call** ****info() to confirm the structure and data types.**

In [1]:
import random
random.seed(42)

In [2]:
import pandas as pd
import numpy as np

In [3]:
import random

cohorts = ['alpha','beta','gamma']

attendance_raw = []

for i in range(1,25):
    student_id = 's' + str(i).zfill(3)
    cohort = random.choice(cohorts)
    attended_sessions = random.randint(0,6)
    expected_sessions = 6

    # create a dict and add it into a list 
    attendance_raw.append({
    "student_id" : student_id,
    "cohort" : cohort,
    "attended_sessions" : attended_sessions,
    "expected_sessions" : expected_sessions 
    })

In [4]:
attendance = pd.DataFrame(attendance_raw)
attendance 

Unnamed: 0,student_id,cohort,attended_sessions,expected_sessions
0,s001,gamma,0,6
1,s002,alpha,5,6
2,s003,beta,1,6
3,s004,alpha,1,6
4,s005,gamma,0,6
5,s006,gamma,5,6
6,s007,gamma,0,6
7,s008,gamma,3,6
8,s009,alpha,0,6
9,s010,alpha,1,6


For showing first 5 rows of our dataset:

In [5]:
attendance[:5]

Unnamed: 0,student_id,cohort,attended_sessions,expected_sessions
0,s001,gamma,0,6
1,s002,alpha,5,6
2,s003,beta,1,6
3,s004,alpha,1,6
4,s005,gamma,0,6


Second way:

In [6]:
attendance.head()

Unnamed: 0,student_id,cohort,attended_sessions,expected_sessions
0,s001,gamma,0,6
1,s002,alpha,5,6
2,s003,beta,1,6
3,s004,alpha,1,6
4,s005,gamma,0,6


In [7]:
attendance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   student_id         24 non-null     object
 1   cohort             24 non-null     object
 2   attended_sessions  24 non-null     int64 
 3   expected_sessions  24 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 900.0+ bytes


# **Task 2: Set an index and validate alignment*
**Set student_id as the index and store the result in attendance_indexed. Create a Series named excused_absences with at least 10 student IDs (some IDs must not exist in the DataFrame). Add this Series to attended_sessions to create a new column adjusted_attendance. Confirm that rows without matching IDs become missing in adjusted_attendance. Then fill missing values in adjusted_attendance with the original attended_sessions and show the updated column.**

In [8]:
attendance_indexed = attendance.set_index('student_id')
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
s001,gamma,0,6
s002,alpha,5,6
s003,beta,1,6
s004,alpha,1,6
s005,gamma,0,6
s006,gamma,5,6
s007,gamma,0,6
s008,gamma,3,6
s009,alpha,0,6
s010,alpha,1,6


In [9]:
# adjusted_attendance - at how many lessons did the student participated(between 0-6)

In [10]:
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
s001,gamma,0,6
s002,alpha,5,6
s003,beta,1,6
s004,alpha,1,6
s005,gamma,0,6
s006,gamma,5,6
s007,gamma,0,6
s008,gamma,3,6
s009,alpha,0,6
s010,alpha,1,6


In [11]:
excused_absences = pd.Series([1,2,1,3,1,2,1,1,2,1],
                             index = ['s001','s005','s010','s015','s020','s025','s030','s002','s018','s024'])

In [12]:
# excused_absences - Burada bəzi tələbələr üçün əlavə “icazəli gəlmədikləri dərslər” qeyd 
# olunub.
# Məsələn, S001-in excused absence = 1 → bir dərs icazəli olaraq gəlməyib,
# amma bu dərsi adjusted_attendance hesabına əlavə edirik.

In [13]:
# adjusted_attendance = attended_sessions + excused_absences

# Bu yeni sütun göstərir:

# “Tələbənin gerçəkdə gəldiyi dərs sayı + icazəli dərs sayı”

# Misal:

# S001: attended_sessions = 4
#        excused_absences = 1
# adjusted_attendance = 4 + 1 = 5

In [14]:
attendance_indexed['adjusted_attendance'] = attendance_indexed['attended_sessions'] + excused_absences

In [15]:
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
s001,gamma,0,6,1.0
s002,alpha,5,6,6.0
s003,beta,1,6,
s004,alpha,1,6,
s005,gamma,0,6,2.0
s006,gamma,5,6,
s007,gamma,0,6,
s008,gamma,3,6,
s009,alpha,0,6,
s010,alpha,1,6,2.0


In [16]:
attendance_indexed['adjusted_attendance'] = attendance_indexed['adjusted_attendance'].fillna(attendance_indexed['attended_sessions'])

In [17]:
# Amma bütün excused_absences DataFrame-də yoxdur, bəziləri uyğun gəlmir → NaN yaranır.

# NaN-ları dolduranda → orijinal attended_sessions qalır, çünki icazəli dərs yoxdur.

In [18]:
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
s001,gamma,0,6,1.0
s002,alpha,5,6,6.0
s003,beta,1,6,1.0
s004,alpha,1,6,1.0
s005,gamma,0,6,2.0
s006,gamma,5,6,5.0
s007,gamma,0,6,0.0
s008,gamma,3,6,3.0
s009,alpha,0,6,0.0
s010,alpha,1,6,2.0


# **Task 3: Clean and normalize categories*
**Introduce a small inconsistency by modifying a few cohort values to include extra whitespace and inconsistent casing. Then write pandas code to normalize the cohort column by stripping whitespace and converting to lowercase. After cleaning, display the unique cohorts to confirm that the inconsistencies are resolved.**

In [19]:
attendance_indexed.loc['s001','cohort'] = ' Gamma '
attendance_indexed.loc['s005','cohort'] = '      ALPHA' 
attendance_indexed.loc['s010','cohort'] = 'Beta '

In [20]:
attendance_indexed['cohort']

student_id
s001         Gamma 
s002          alpha
s003           beta
s004          alpha
s005          ALPHA
s006          gamma
s007          gamma
s008          gamma
s009          alpha
s010          Beta 
s011          alpha
s012          gamma
s013          gamma
s014          gamma
s015          gamma
s016           beta
s017           beta
s018           beta
s019          alpha
s020          alpha
s021           beta
s022           beta
s023          alpha
s024           beta
Name: cohort, dtype: object

In [21]:
attendance_indexed['cohort'] = attendance_indexed['cohort'].str.strip().str.lower()
# .str.strip() → baş və son boşluqları silir
# .str.lower() → bütün hərfləri kiçik edir

In [22]:
attendance_indexed['cohort']

student_id
s001    gamma
s002    alpha
s003     beta
s004    alpha
s005    alpha
s006    gamma
s007    gamma
s008    gamma
s009    alpha
s010     beta
s011    alpha
s012    gamma
s013    gamma
s014    gamma
s015    gamma
s016     beta
s017     beta
s018     beta
s019    alpha
s020    alpha
s021     beta
s022     beta
s023    alpha
s024     beta
Name: cohort, dtype: object

In [23]:
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
s001,gamma,0,6,1.0
s002,alpha,5,6,6.0
s003,beta,1,6,1.0
s004,alpha,1,6,1.0
s005,alpha,0,6,2.0
s006,gamma,5,6,5.0
s007,gamma,0,6,0.0
s008,gamma,3,6,3.0
s009,alpha,0,6,0.0
s010,beta,1,6,2.0


# **Task 4: Filter and compute summaries*
**Filter the DataFrame to students where attended_sessions is below expected_sessions. Store the result in low_attendance. Compute the average attended_sessions by cohort using groupby. Print the summary and verify that cohorts in the summary match the cleaned cohorts.**

In [24]:
low_attendance = attendance_indexed[attendance_indexed['expected_sessions']>=attendance_indexed['attended_sessions']]

In [25]:
cohort_summary = low_attendance.groupby('cohort')['attended_sessions'].mean()
cohort_summary

cohort
alpha    3.375
beta     2.000
gamma    2.250
Name: attended_sessions, dtype: float64

In [26]:
attendance_indexed['cohort'].unique()

array(['gamma', 'alpha', 'beta'], dtype=object)

# **Task 5: Add a derived field and validate it*
**Create a new column attendance_ok that is True when attended_sessions is at least expected_sessions, otherwise False. Use a boolean comparison rather than a loop. Then validate the column by confirming that every row in low_attendance has attendance_ok equal to False.**

In [27]:
attendance_indexed['attendance_ok'] = attendance_indexed['attended_sessions'] >= attendance_indexed['expected_sessions']

In [28]:
attendance_indexed

Unnamed: 0_level_0,cohort,attended_sessions,expected_sessions,adjusted_attendance,attendance_ok
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
s001,gamma,0,6,1.0,False
s002,alpha,5,6,6.0,False
s003,beta,1,6,1.0,False
s004,alpha,1,6,1.0,False
s005,alpha,0,6,2.0,False
s006,gamma,5,6,5.0,False
s007,gamma,0,6,0.0,False
s008,gamma,3,6,3.0,False
s009,alpha,0,6,0.0,False
s010,beta,1,6,2.0,False


In [29]:
(attendance_indexed['attendance_ok'] == False).all()

np.False_