In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.graph_objects import Bar, Figure
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.figure_factory as ff
import plotly.subplots as sp

In [2]:
df = pd.read_csv("../Data/clening_data.csv")

In [3]:
fig = px.histogram(df, x="Y", title="Program Completion Status", labels={"Y": "Completion Status"})
fig.update_layout(bargap=0.2, xaxis=dict(tickmode='array', tickvals=[0, 1], ticktext=['Completed', 'Not Completed']))
fig.show()

# Program Completion Status Chart

This bar chart represents the count of individuals categorized by their **program completion status**:

- **Completed**: The majority of individuals have completed the program, with a count exceeding 5000.
- **Not Completed**: A smaller segment, with a count under 1000, has not completed the program.

The visualization highlights the stark difference between the two groups, emphasizing the high completion rate of the program.


In [5]:
fig = px.histogram(df, x="Gender", color="Y", barmode="group", title="Completion Status by Gender")
fig.show()


# Completion Status by Gender

This bar chart visualizes the **program completion status** segmented by **gender**:

- The X-axis represents gender, with two categories: "ذكر" (Male) and "أنثى" (Female).
- The Y-axis represents the count of individuals.

Each gender group is further divided into:
- **Completed (Blue - Y=0)**: Indicates the count of participants who completed the program.
- **Not Completed (Red - Y=1)**: Represents those who did not complete the program.

Key observations:
1. Both genders have a significantly higher count in the "Completed" category compared to "Not Completed."
2. Among females ("أنثى"), the proportion of non-completions is slightly higher compared to males ("ذكر").

This chart helps analyze completion trends across genders.


In [7]:
fig = px.histogram(df, x="Home Region", color="Y", barmode="group", title="Completion Status by Home Region")
fig.show()

# Completion Status by Home Region

This bar chart illustrates the **program completion status** across various home regions:

- The X-axis lists the regions, such as "منطقة الرياض" (Riyadh Region), "منطقة مكة المكرمة" (Mecca Region), and others.
- The Y-axis represents the count of individuals.

Each region is split into two categories:
- **Completed (Blue - Y=0)**: Individuals who completed the program.
- **Not Completed (Red - Y=1)**: Individuals who did not complete the program.

Key observations:
1. **Riyadh Region ("منطقة الرياض")** dominates with the highest count of completions and non-completions, significantly outpacing other regions.
2. Other regions have considerably lower counts for both categories, indicating less representation or participation.
3. The proportion of non-completions varies slightly among regions but remains much smaller than completions.

This chart provides insights into regional participation and completion trends.


In [8]:
fig = px.histogram(df, x="Level of Education", color="Y", barmode="group", title="Completion Status by Education Level")
fig.show()


# Completion Status by Education Level

This bar chart visualizes the **program completion status** segmented by **education level**:

- The X-axis represents education levels, including categories such as "البكالوريوس" (Bachelor's), "الماجستير" (Master's), "ثانوي" (High School), "الدكتوراه" (Doctorate), and "الدبلوم" (Diploma).
- The Y-axis shows the count of individuals.

Each education level is divided into:
- **Completed (Blue - Y=0)**: Individuals who completed the program.
- **Not Completed (Red - Y=1)**: Individuals who did not complete the program.

Key observations:
1. The **Bachelor's degree ("البكالوريوس")** category has the highest number of completions, significantly surpassing other education levels.
2. Non-completions are also most frequent in the Bachelor's category, but they remain far fewer than completions.
3. Other education levels, such as Master's and Diploma, show much lower participation counts for both completions and non-completions.

This chart highlights that individuals with Bachelor's degrees dominate program participation and completion rates.


In [9]:
numerical_features = ["PCRF", "GRST", "CAUF", "INFA", "ABIR", "SERU", "TOSL", "APMR", "DTFH", "QWLM"]
df_mean = df.groupby("Y")[numerical_features].mean().T
df_mean.columns = ["Completed", "Not Completed"]

fig = go.Figure()
for col in df_mean.columns:
    fig.add_trace(go.Bar(x=df_mean.index, y=df_mean[col], name=col))
    
fig.update_layout(title="Average Feature Scores by Completion Status", 
                  xaxis_title="Feature", yaxis_title="Average Score", barmode="group")
fig.show()


# Average Feature Scores by Completion Status

This bar chart compares the **average feature scores** between two groups: **Completed** and **Not Completed**.

- The X-axis lists various features (e.g., PCRF, GRST, CAUF, INFA, etc.).
- The Y-axis represents the average score for each feature.

Each feature has two bars:
- **Blue (Completed)**: Represents the average score for individuals who completed the program.
- **Red (Not Completed)**: Represents the average score for individuals who did not complete the program.

Key observations:
1. **CAUF** has the highest average score among both groups, with a slightly higher score for the Completed group.
2. For most features, the Completed group has higher average scores compared to the Not Completed group, particularly in **PCRF, CAUF, and APMR**.
3. Features like **GRST, ABIR, and SERU** show minimal differences between the two groups, indicating less impact on completion status.

This visualization highlights which features correlate more strongly with program completion.


In [10]:
# Pie Chart for 'Y'
fig = px.pie(df, names='Y', title="Program Completion Status", hole=0.4)
fig.show()

# Program Completion Status - Pie Chart

This pie chart represents the distribution of **program completion status**:

- **Blue (0 - Completed)**: Represents 84% of the participants who successfully completed the program.
- **Red (1 - Not Completed)**: Represents 16% of the participants who did not complete the program.

Key insight:
The chart highlights a high completion rate, with the vast majority (84%) of participants successfully completing the program, while only a small fraction (16%) did not complete it.


In [11]:
# Boxplot for GPA and Y
fig = px.box(df, x='Y', y='GPA', points='all', title="GPA Distribution by Completion Status")
fig.show()

# GPA Distribution by Completion Status

This scatter plot visualizes the **GPA distribution** segmented by **completion status**:

- The X-axis represents completion status:
  - **0**: Completed
  - **1**: Not Completed
- The Y-axis represents GPA scores.

Key observations:
1. Most GPA values for both groups cluster near the lower end of the scale (close to 0), with a few outliers.
2. A notable outlier exists in the "Completed" group (0) with a GPA near 100, which significantly deviates from the cluster.
3. There is little visible difference in the GPA distribution pattern between the "Completed" and "Not Completed" groups.

This visualization provides insights into the GPA trends among participants, showing that extreme GPAs are rare.


In [12]:
# Histogram for Program Days by Y
fig = px.histogram(df, x='Program Days', color='Y', title="Program Days Distribution by Completion Status")
fig.show()

# Program Days Distribution by Completion Status

This bar chart represents the **distribution of program days** segmented by **completion status**:

- The X-axis shows the number of program days.
- The Y-axis indicates the count of participants.
- The data is grouped by:
  - **Blue (0 - Completed)**: Participants who completed the program.
  - **Red (1 - Not Completed)**: Participants who did not complete the program.

Key observations:
1. The majority of participants completed the program within a very short duration (close to 0-10 days), with a high count for both completion statuses.
2. A secondary peak is observed around 50 days for the "Completed" group.
3. For longer durations (above 100 days), participation drops significantly for both groups, though a few participants still show activity.

This chart indicates that most participants tend to complete or exit the program early, with fewer engaging over extended periods.


In [13]:
# Pie Chart for Employment Status
fig = px.pie(df, names='Employment Status', title="Employment Status Distribution", hole=0.3)
fig.show()

# Employment Status Distribution

This pie chart illustrates the distribution of participants based on their **employment status**:

- **42.8% (Blue)**: "موظف" (Employed)
- **17.1% (Red)**: "غير موظف" (Unemployed)
- **15.4% (Green)**: "طالب" (Student)
- **14.6% (Purple)**: "خريج" (Graduate)
- **7.98% (Orange)**: "موظف - طالب" (Employed and Student)
- **2.11% (Cyan)**: "عمل حر" (Freelance)

Key insights:
1. The largest group consists of **Employed** participants, representing nearly half of the total.
2. **Unemployed** and **Students** make up significant portions, accounting for 17.1% and 15.4%, respectively.
3. The smallest category is **Freelance**, comprising only 2.11%.

This chart provides a clear overview of the employment demographics among the participants.


In [12]:
# Parallel Categories Plot for Gender, Employment Status, and Y
fig = px.parallel_categories(
    df,
    dimensions=['Gender', 'Employment Status', 'Y'],
    title="Parallel Categories Plot for Completion Status"
)
fig.show()

# Parallel Categories Plot for Completion Status

This parallel categories plot illustrates the relationships between **Gender**, **Employment Status**, and **Completion Status**:

- **Left axis (Gender)**: Categorizes participants as "ذكر" (Male) or "أنثى" (Female).
- **Middle axis (Employment Status)**: Includes employment categories such as "موظف" (Employed), "غير موظف" (Unemployed), "طالب" (Student), "خريج" (Graduate), "عمل حر" (Freelance), and "موظف - طالب" (Employed and Student).
- **Right axis (Completion Status)**: 
  - **0**: Completed
  - **1**: Not Completed

Key observations:
1. A significant number of both males and females are in the "موظف" (Employed) category, and most of them have completed the program.
2. "غير موظف" (Unemployed) and "طالب" (Student) categories also contribute to both completion and non-completion statuses, with the majority leaning toward completion.
3. The smaller groups, such as "عمل حر" (Freelance) and "موظف - طالب" (Employed and Student), show a consistent trend towards completion.

This visualization highlights how gender and employment status influence program completion trends.


In [13]:
# Line Plot for Average Program Days by Y
program_days_mean = df.groupby('Y')['Program Days'].mean()
fig = px.line(x=program_days_mean.index, y=program_days_mean, labels={'x': 'Completion Status', 'y': 'Average Program Days'}, title="Average Program Days by Completion Status")
fig.show()

# Average Program Days by Completion Status

This line chart shows the relationship between **completion status** and the **average program days**:

- The X-axis represents **completion status**:
  - **0**: Completed
  - **1**: Not Completed
- The Y-axis represents the **average program days**.

Key observation:
1. There is a clear linear trend, with the average program days increasing slightly as completion status shifts from "Completed" (0) to "Not Completed" (1).
2. Participants who completed the program had an average of about 19.4 days, while those who did not complete had an average closer to 20.8 days.

This suggests that individuals who did not complete the program tended to remain in the program slightly longer on average.


In [14]:
# Pie Chart for Level of Education
fig = px.pie(df, names='Level of Education', title="Level of Education Distribution")
fig.show()

# Level of Education Distribution

This pie chart illustrates the distribution of participants based on their **level of education**:

- **83.3% (Blue)**: "البكالوريوس" (Bachelor's degree)
- **7.4% (Red)**: "الماجستير" (Master's degree)
- **4.75% (Green)**: "الدبلوم" (Diploma)
- **4.11% (Purple)**: "ثانوي" (High School)
- **0.462% (Orange)**: "الدكتوراه" (Doctorate)

Key insights:
1. The vast majority of participants (83.3%) hold a **Bachelor's degree**, dominating the distribution.
2. Master's degree holders account for a significant portion (7.4%), while other education levels, including Diploma, High School, and Doctorate, contribute minimally.

This chart highlights the dominance of Bachelor's degree holders among the participants.


In [4]:
# create new column named Program Type 
def categorize_program(days):
    if days <= 21:  # duration 
        return "برنامج"
    else:
        return "معسكر"

df['Program Type'] = df['Program Days'].apply(categorize_program)

In [5]:
program_types = df['Program Type'].unique()
fig = sp.make_subplots(
    rows=1,
    cols=len(program_types),
    subplot_titles=[f"{program_type}" for program_type in program_types],
    specs=[[{'type': 'domain'} for _ in program_types]]
)

for i, program_type in enumerate(program_types, 1):
    # Filter data for the current program type
    data = df[df['Program Type'] == program_type]['Y'].value_counts(normalize=True) * 100
    fig.add_trace(
        go.Pie(
            labels=['Completed', 'Withdrawn'],
            values=data.values,
            hole=0.4,
            title=f"{program_type}",
        ),
        row=1,
        col=i
    )

# Update layout
fig.update_layout(
    title="Completion vs Withdrawal Rates for Different Program Types",
    template="plotly_white"
)

fig.show()

# Completion vs Withdrawal Rates for Different Program Types

This visualization includes two pie charts comparing **completion rates** and **withdrawal rates** for two types of programs:

1. **Left Pie Chart - "برنامج" (Program):**
   - **82.3% (Blue)**: Completed
   - **17.7% (Red)**: Withdrawn
   - Indicates a higher withdrawal rate (17.7%) compared to the second program type.

2. **Right Pie Chart - "معسكر" (Bootcamp):**
   - **90.2% (Blue)**: Completed
   - **9.84% (Red)**: Withdrawn
   - Demonstrates a higher completion rate and lower withdrawal rate compared to the general program.

### Key Insights:
- **"Bootcamp" (معسكر)** programs show better retention and success, with a significantly lower withdrawal rate (9.84%) than general **programs (برنامج)**.
- Withdrawal rates are nearly **double** for the general program type compared to the bootcamp.

This chart highlights the effectiveness of bootcamps in ensuring program completion.


In [18]:
# Combine specific Employment Status categories into one
def integrate_employment_status(status):
    if status in ['طالب', 'موظف', 'موظف - طالب']:
        return 'موظف - طالب'
    return status

# Apply the function to Employment Status column
df['Integrated Employment Status'] = df['Employment Status'].apply(integrate_employment_status)

# Group data for each Program Type and calculate withdrawal percentages
withdrawal_data = df.groupby(['Program Type', 'Integrated Employment Status'])['Y'].mean().reset_index()

program_types = withdrawal_data['Program Type'].unique()

custom_colors = ['#4B0082', '#6A5ACD', '#483D8B', '#4169E1', '#9370DB']

# Create subplots
fig = sp.make_subplots(
    rows=1, cols=len(program_types), 
    specs=[[{'type': 'domain'}] * len(program_types)],
    subplot_titles=[f"{ptype}" for ptype in program_types]
)

# Add pie charts for each Program Type
for idx, program_type in enumerate(program_types):
    filtered_data = withdrawal_data[withdrawal_data['Program Type'] == program_type]
    fig.add_trace(
        go.Pie(
            labels=filtered_data['Integrated Employment Status'],
            values=filtered_data['Y'],
            hole=0.4,
            textinfo="percent+label",
            marker=dict(colors=custom_colors[:len(filtered_data)])  

        ),
        row=1, col=idx + 1
    )

# Update layout
fig.update_layout(
    title="Withdrawal Percentages by Program Type",
    template="plotly_white"
)

# Show plot
fig.show()


# Withdrawal Percentages by Program Type

This visualization presents two pie charts showing **withdrawal percentages** across different employment categories for two program types:

1. **Left Pie Chart - "برنامج" (Program):**
   - **31.5% (Light Purple)**: "عمل حر" (Freelance)
   - **23.5% (Dark Purple)**: "غير موظف" (Unemployed)
   - **22.9% (Purple)**: "خريج" (Graduate)
   - **22.1% (Blue)**: "موظف - طالب" (Employed and Student)

   Freelance participants have the highest withdrawal percentage, followed closely by the unemployed group.

2. **Right Pie Chart - "معسكر" (Bootcamp):**
   - **32.8% (Light Purple)**: "عمل حر" (Freelance)
   - **25.5% (Blue)**: "موظف - طالب" (Employed and Student)
   - **23.2% (Dark Purple)**: "غير موظف" (Unemployed)
   - **18.6% (Purple)**: "خريج" (Graduate)

   Freelance participants again have the highest withdrawal rate, with employed students following closely.

### Key Insights:
- Across both programs, **Freelance ("عمل حر")** participants show the highest withdrawal percentages.
- Graduates ("خريج") have the lowest withdrawal percentage in bootcamps but not in general programs.
- The distribution of withdrawal percentages differs slightly between program types, with **Bootcamp** having less variance across categories.


In [15]:
# Scatter Matrix for Numerical Features
fig = px.scatter_matrix(df, dimensions=['Age', 'GPA', 'PCRF', 'GRST'], color='Y', title="Scatter Matrix for Numerical Features")
fig.show()

# Scatter Matrix for Numerical Features

This scatter matrix visualizes the relationships between numerical features (Age, GPA, PCRF, GRST) and their relationship to the **completion status (Y)**.

- **Axes:**
  - Features included: Age, GPA, PCRF, GRST.
  - Each subplot shows the relationship between two features.
- **Color Gradient (Y)**:
  - Yellow indicates participants with higher completion status values.
  - Dark blue indicates lower completion status values.

### Key Observations:
1. **Age vs. GPA**: A slight linear trend is observed, where older participants generally have higher GPA scores, though with notable outliers.
2. **PCRF and GRST**: Most data points cluster around lower values, with no clear pattern in relation to completion status.
3. **GPA Distribution**: A few extreme outliers are visible, especially for high GPA values.
4. **Age**: Shows a relatively consistent spread across completion statuses, but higher concentration among younger participants.

This scatter matrix helps identify patterns and correlations among features and their potential influence on completion outcomes.


In [16]:
# Parallel Coordinates for Numerical Features
fig = px.parallel_coordinates(df, dimensions=['Age', 'GPA', 'PCRF', 'Program Days'], color='Y', title="Parallel Coordinates for Numerical Features")
fig.show()

# Parallel Coordinates for Numerical Features

This parallel coordinates plot displays the relationships between numerical features and their influence on **completion status (Y)**:

- **Features:**
  - **Age**: Participant's age.
  - **GPA**: Grade Point Average.
  - **PCRF**: Feature representing specific performance data.
  - **Program Days**: Duration of program participation.

- **Color Gradient (Y):**
  - Yellow indicates participants closer to completion (Y=1).
  - Dark blue represents participants less likely to complete (Y=0).

### Key Observations:
1. **Age**: Younger participants cluster more in the yellow region (higher completion rates), while older participants show more variance in outcomes.
2. **GPA**: Higher GPAs (above ~40) correlate with yellow lines (better completion outcomes), whereas lower GPAs have mixed results.
3. **PCRF**: A higher PCRF value (above ~15) tends to correlate with completion.
4. **Program Days**: Longer program durations show both high and low completion likelihoods, indicating no consistent trend.

### Insights:
This visualization reveals that **GPA** and **PCRF** are strong indicators of completion status, while **Age** and **Program Days** have less definitive trends.


In [3]:
# Define the list of education specialities to filter by
education_specialities = [
    'هندسة الحاسب الالي والشبكات', 'هندسة الشبكات والاتصالات', 'هندسة الشبكات والاتصالات',
    'هندسة الالكترونيات', 'هندسة اتصالات', 'هندسة الحاسب الالي والشبكات',
    'نظم معلومات حاسوبية-قواعد بيانات', 'نظم معلومات ادارية', 'نظم معلومات جغرافية وخرائط'
]

# Filter the DataFrame for rows where Y equals 1 and Education Speciality matches the specified list
df_filtered = df[(df['Y'] == 1) & (df['Education Speaciality'].isin(education_specialities))]

# Count the occurrences of حضوري (in-person) program presentation by education speciality and employment status
حضوري_counts = df_filtered[df_filtered['Program Presentation Method'] == 'حضوري'].groupby(['Education Speaciality', 'Employment Status']).size()

# Count the occurrences of عن بعد (remote) program presentation by education speciality and employment status
عن_بعد_counts = df_filtered[df_filtered['Program Presentation Method'] == 'عن بعد'].groupby(['Education Speaciality', 'Employment Status']).size()

# Initialize an empty figure object for the bar chart
fig = Figure()

# Add حضوري (in-person) counts to the figure, grouped by employment status
for status in حضوري_counts.index.get_level_values('Employment Status').unique():
    fig.add_trace(
        Bar(
            name=f"حضوري - {status}",  # Label for حضوري by employment status
            x=حضوري_counts.xs(status, level='Employment Status').index,  # X-axis: education specialities
            y=حضوري_counts.xs(status, level='Employment Status').values,  # Y-axis: counts
            marker_color='blue'  # Color for حضوري bars
        )
    )

# Add عن بعد (remote) counts to the figure, grouped by employment status
for status in عن_بعد_counts.index.get_level_values('Employment Status').unique():
    fig.add_trace(
        Bar(
            name=f"عن بعد - {status}",  # Label for عن بعد by employment status
            x=عن_بعد_counts.xs(status, level='Employment Status').index,  # X-axis: education specialities
            y=عن_بعد_counts.xs(status, level='Employment Status').values,  # Y-axis: counts
            marker_color='red'  # Color for عن بعد bars
        )
    )

# Update the layout of the figure
fig.update_layout(
    title='Program Presentation Method by Education Speciality and Employment Status (Y=1)',  # Chart title
    xaxis_title='Education Speciality',  # Label for the X-axis
    yaxis_title='Count',  # Label for the Y-axis
    barmode='stack',  # Bars are stacked for comparison
    legend_title="Program Presentation Method and Employment Status"  # Title for the legend
)

# Show the final bar chart
fig.show()


# Program Presentation Method by Education Speciality and Employment Status (Y=1)

This bar chart represents the distribution of participants' **education specialties** by **program presentation method** and **employment status**, where completion status (Y) equals 1.

- **X-axis**: Education specialties such as:
  - "نظم معلومات إدارية" (Management Information Systems)
  - "نظم معلومات جغرافية وخرائط" (Geographic Information Systems and Mapping)
  - "هندسة الشبكات والاتصالات" (Network and Communications Engineering)
  - "هندسة الاتصالات" (Telecommunications Engineering)
  - "هندسة الحاسب الآلي والمعلومات" (Computer and Information Engineering)

- **Y-axis**: Count of participants.

- **Colors**:
  - **Red (عن بعد)**: Remote learning, segmented by employment status:
    - Employed, Unemployed, and Student.
  - **Blue (حضوري)**: In-person learning, segmented by employment status:
    - Graduate, Employed, and Student.

### Key Observations:
1. **Remote Learning (Red)**:
   - Most participants in "Management Information Systems" and "Geographic Information Systems" chose remote learning, particularly students.
   - Participants in "Computer and Information Engineering" also favored remote learning but with a balance of employment statuses.

2. **In-Person Learning (Blue)**:
   - Popular in "Telecommunications Engineering" and "Network and Communications Engineering."
   - Employed participants dominate this category.

3. **Education Specialties**:
   - "Management Information Systems" and "Geographic Information Systems" have a higher preference for remote methods compared to engineering specialties, which are more balanced.

### Insights:
This chart highlights the preference for program delivery methods across education fields and employment statuses, showing a significant divide between remote and in-person learning preferences.


In [18]:
# Filter data for 'Y' = 1 (DropOut programs)
DropOut_data = df[df['Y'] == 1]

# Extract unique Program Skill Levels
program_skill_levels = DropOut_data['Program Skill Level'].unique()

# Create subplots for each unique Program Skill Level
fig = make_subplots(
    rows=1,
    cols=len(program_skill_levels),
    subplot_titles=[f"{level}" for level in program_skill_levels],
    specs=[[{'type': 'domain'} for _ in program_skill_levels]]
)

# Add pie charts for each Program Skill Level with Program Main Category Code
for i, skill_level in enumerate(program_skill_levels, 1):
    # Filter data for the current Program Skill Level
    filtered_data = DropOut_data[DropOut_data['Program Skill Level'] == skill_level]['Program Main Category Code'].value_counts(normalize=True) * 100
    fig.add_trace(
        go.Pie(
            labels=filtered_data.index,  # Program Main Category Codes
            values=filtered_data.values,  # Percentage distribution
            hole=0.4,  # Donut-style pie chart
            title=f"{skill_level}"
        ),
        row=1,
        col=i
    )

# Update layout for the entire figure
fig.update_layout(
    title="Program Main Category Code Distribution by Program Skill Level (DropOut Programs)",
    template="plotly_white"
)

# Show the plot
fig.show()

# Program Main Category Code Distribution by Program Skill Level (Dropout Programs)

This visualization consists of three pie charts illustrating the distribution of **program main category codes** across different **program skill levels** for dropout programs:

1. **Left Pie Chart - "مبتدئ" (Beginner):**
   - **55.9% (Blue - CAUF)**: The largest proportion is from the CAUF category.
   - **35.4% (Red - PCRF)**: The second largest proportion.
   - Remaining categories like APMR, ABIR, and others constitute small portions, each below 3%.

2. **Middle Pie Chart - "متوسط" (Intermediate):**
   - **34% (Green - APMR)**: The largest share among intermediate-level dropouts.
   - **21.9% (Red - PCRF)** and **18.3% (Blue - CAUF)**: Significant proportions as well.
   - Other categories, such as GRST and SERU, make up smaller shares.

3. **Right Pie Chart - "متقدم" (Advanced):**
   - **50.9% (Red - PCRF)**: Dominates the distribution for advanced-level programs.
   - **28.3% (Blue - CAUF)** and **16% (Green - APMR)**: Also significant contributors.
   - The rest of the categories are minor contributors.

### Key Insights:
1. **Beginner-Level Programs**: Heavily dominated by CAUF and PCRF categories.
2. **Intermediate-Level Programs**: APMR takes the lead, but CAUF and PCRF still have substantial shares.
3. **Advanced-Level Programs**: PCRF becomes more dominant, followed by CAUF and APMR.

This visualization highlights the varying distribution of main category codes across skill levels, with specific trends for each level.


In [20]:
# Treemap for Completion Status, Gender, and Employment Status
fig = px.treemap(
    df, 
    path=['Y', 'Gender', 'Employment Status'], 
    values='Total Regestration',
    title="Treemap of Completion Status, Gender, and Employment Status"
)
fig.show()

# Treemap of Completion Status, Gender, and Employment Status

This treemap visualizes the relationships between **completion status (Y)**, **gender**, and **employment status**:

- **Color Key**:
  - **Blue (0)**: Represents participants who completed the program.
  - **Red (1)**: Represents participants who did not complete the program.

- **Hierarchy**:
  1. **First Level**: Gender ("ذكر" for Male, "أنثى" for Female).
  2. **Second Level**: Employment status, such as:
     - "موظف" (Employed)
     - "غير موظف" (Unemployed)
     - "طالب" (Student)
     - "خريج" (Graduate)
     - "عمل حر" (Freelance)

### Key Observations:
1. **Completion Status (0 - Blue)**:
   - The majority of participants are grouped under "موظف" (Employed) for both genders.
   - "غير موظف" (Unemployed) and "طالب" (Student) also have sizable proportions, with more representation for females ("أنثى").

2. **Non-Completion Status (1 - Red)**:
   - The proportions for non-completion are noticeably smaller compared to completions.
   - Non-completion is more evenly distributed across employment categories but remains higher for "غير موظف" (Unemployed) and "طالب" (Student).

### Insights:
- **Completion trends** are dominated by employed participants across both genders.
- **Non-completion rates** are higher among unemployed and student participants.
- Gender has a noticeable impact, with females ("أنثى") contributing more to the non-completion group relative to their male ("ذكر") counterparts.


In [21]:
# Sunburst for Completion Status, Gender, and Level of Education
fig = px.sunburst(
    df, 
    path=['Y', 'Gender', 'Level of Education'], 
    values='Total Regestration', 
    title="Sunburst Chart for Completion Status, Gender, and Education Level"
)
fig.show()

# Sunburst Chart for Completion Status, Gender, and Education Level

This sunburst chart visualizes the hierarchical distribution of **completion status (Y)**, **gender**, and **education level**:

- **Center Circle (Y):**
  - **0 (Blue)**: Represents participants who completed the program.
  - **1 (Red)**: Represents participants who did not complete the program.

- **Middle Circle (Gender):**
  - "ذكر" (Male)
  - "أنثى" (Female)

- **Outer Circle (Education Level):**
  - "البكالوريوس" (Bachelor's)
  - "الماجستير" (Master's)
  - "الدبلوم" (Diploma)

### Key Observations:
1. **Completion Status (Y=0 - Blue)**:
   - The majority of participants who completed the program are male and hold a **Bachelor's degree**.
   - A smaller proportion of females with Bachelor's and Master's degrees also completed the program.

2. **Non-Completion Status (Y=1 - Red)**:
   - The non-completion group is dominated by males and females with Bachelor's degrees.
   - This group has fewer participants with Master's or Diploma levels.

### Insights:
- Participants with **Bachelor's degrees** make up the largest share of both completion and non-completion groups.
- **Males** have a slightly higher proportion in the completion group compared to females.
- Education level plays a key role, with advanced degrees (e.g., Master's) showing higher completion rates relative to Bachelor's or Diploma holders.


In [23]:
# Facet Scatter Plot for Age vs GPA by Gender and Completion Status
fig = px.scatter(
    df, 
    x='Age', 
    y='GPA', 
    color='Gender', 
    facet_col='Y', 
    title="Facet Scatter Plot of Age vs GPA by Gender and Completion Status"
)
fig.show()

# Facet Scatter Plot of Age vs GPA by Gender and Completion Status

This scatter plot visualizes the relationship between **Age** and **GPA**, segmented by **gender** and **completion status (Y)**.

- **Panels**:
  - **Y=0 (Left)**: Participants who completed the program.
  - **Y=1 (Right)**: Participants who did not complete the program.

- **Points**:
  - **Blue (ذكر - Male)**.
  - **Red (أنثى - Female)**.

### Key Observations:
1. **Completion Status (Y=0 - Left Panel)**:
   - The majority of GPA values cluster near 0 for both males and females.
   - A few extreme outliers (e.g., GPA close to 100) exist, mainly for females.
   - Participants' ages are widely distributed, with no visible trend across GPA values.

2. **Non-Completion Status (Y=1 - Right Panel)**:
   - Similar clustering of GPA values near 0, with minimal spread among higher GPAs.
   - Age distribution is more compact, with participants mostly under 40.

3. **Gender Distribution**:
   - Females (red points) dominate in GPA outliers for the completion group.
   - Both genders are evenly distributed among low GPA scores in both panels.

### Insights:
- GPA is generally low for both completion and non-completion groups, with very few participants scoring high.
- Age does not strongly correlate with GPA or completion status, although younger participants are slightly more common.
- Gender-related patterns are not prominent except for the occasional GPA outliers among females.


In [27]:
# Heatmap for Completion Rates by Home Region, Gender, and Y
pivot = df.pivot_table(index='Home Region', columns='Gender', values='Y', aggfunc='mean').fillna(0)
heatmap = ff.create_annotated_heatmap(
    z=pivot.values,
    x=pivot.columns.tolist(),
    y=pivot.index.tolist(),
    colorscale='Viridis',
)
heatmap.show()

### Heatmap of Gender Distribution Across Regions

- **Rows**: Represents different regions (e.g., منطقة نجران, منطقة مكة المكرمة).
- **Columns**: Divided into gender categories (ذكر for Male and أنثى for Female).
- **Values**: Numerical representation (proportions or percentages) indicating the distribution of each gender within the corresponding region.

- **Color Scale**:
  - Darker shades represent lower values.
  - Brighter or distinct colors highlight higher values.

- **Insights**:
  - The heatmap helps identify gender distribution patterns across regions.
  - Notable regions like منطقة الجوف show distinct variations in male proportions compared to others.

This visualization is ideal for spotting regional disparities in gender representation.

In [29]:
# Parallel Categories Plot for Gender, Level of Education, and Y
fig = px.parallel_categories(
    df, 
    dimensions=['Gender', 'Level of Education', 'Y'], 
    title="Parallel Categories Plot for Gender, Education Level, and Completion Status"
)
fig.show()

### Parallel Categories Plot for Gender, Education Level, and Completion Status

- **Axes**:
  - **Gender**: Divided into two categories (ذكر for Male and أنثى for Female).
  - **Level of Education**: Includes categories such as البكالوريوس (Bachelor), الماجستير (Master), etc.
  - **Completion Status (Y)**: Binary values (0 for incomplete, 1 for complete).

- **Connections**:
  - Lines connect categories across the axes, showing the flow of individuals through different gender, education levels, and completion statuses.

- **Insights**:
  - The density of lines highlights common paths or trends (e.g., a higher proportion of one gender achieving a specific education level or completion status).
  - Useful for identifying patterns or disparities in education and completion rates based on gender.

This visualization provides an overview of categorical relationships and transitions across multiple dimensions.

In [30]:
# 3D Scatter Plot for Age, Program Days, and GPA, colored by Y
fig = px.scatter_3d(
    df, 
    x='Age', 
    y='Program Days', 
    z='GPA', 
    color='Y', 
    title="3D Scatter Plot of Age, Program Days, and GPA"
)
fig.show()

### 3D Scatter Plot of Age, Program Days, and GPA

- **Description**: The plot visualizes the relationship between three variables:
  - `Age` on the X-axis.
  - `Program Days` on the Z-axis.
  - `GPA` on the Y-axis.
- **Color**: A color gradient represents an additional variable `Y`, with yellow indicating higher values and purple indicating lower values.
- **Insights**:
  - Most data points are clustered, indicating similar values for the variables.
  - The gradient provides a visual cue for interpreting the distribution of `Y` across the dataset.

This plot helps explore how age, program days, and GPA are distributed and correlated.