# Overview
Welcome to Tuwaiq Academy, a non-profit training institute in Saudi Arabia. With millions of applicants vying for limited seats in advanced AI programs, this challenge is your chance to revolutionize the allocation process. Your task is to build a predictive model that identifies top candidates likely to successfully complete the training program.

### Goal:
The goal of the competition is to develop an efficient model that sifts through the pool of applicants, pinpointing those with the highest likelihood of completing Tuwaiq Academy's rigorous training programs. By achieving this, you contribute to the strategic utilization of limited seats, ensuring they are granted to individuals your model identifies as having the best potential to excel in the program. Seize the challenge, make a lasting impact!

# Description
### Overview:
Tuwaiq Academy, a non-profit educational institute in the Kingdom of Saudi Arabia, is on a mission to advance AI and emerging technologies. With a surge in applications and limited seats for our coveted advanced programs, we invite you to participate in this competition for educational excellence.

### Objective:
In the face of millions of applicants, Tuwaiq Academy aims to streamline the selection process for our advanced training programs. Your challenge is to develop a predictive model that can accurately identify individuals with the highest likelihood of successfully completing the rigorous training curriculum.

### Why Participate:

* Impactful Contribution: Your efforts directly influence the strategic allocation of limited seats, ensuring they are granted to those most likely to excel in the program.
* Advancing Education: By participating in this competition, you become a key player in shaping the future of education, contributing to the growth of knowledge and expertise in the field.
### Key Details:

* Dataset: We provide registration information for accepted trainees in a variety of training programs, including a completion target column.
* Objective: Develop a robust model to identify candidates with the best potential for successful program completion.
* Outcome: The winning models will significantly enhance the efficiency of the selection process, allowing Tuwaiq Academy to optimize resource allocation and make a lasting impact on the education landscape.
### How to Win:

* Accuracy: Build a model that accurately predicts trainees' likelihood of completing the program.
* Innovation: Stand out by incorporating innovative approaches, algorithms, or features that elevate the predictive power of your model.
* Explanation: Provide insights into the factors driving your model's predictions, demonstrating a clear understanding of the problem.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Installing required Libraries:

In [143]:
import pandas as pd
import numpy as np

# Exploratory Data Analysis (EDA)

In [144]:
df=pd.read_csv('train.csv')
df

Unnamed: 0,Student ID,Age,Gender,Home Region,Home City,Program ID,Program Main Category Code,Program Sub Category Code,Technology Type,Program Skill Level,...,Completed Degree,Level of Education,Education Speaciality,College,University Degree Score,University Degree Score System,Employment Status,Job Type,Still Working,Y
0,4f14c50d-162e-4a15-9cf0-ec129c33bcf0,37.0,ذكر,منطقة الرياض,الرياض,453686d8-4023-4506-b2df-fac8b059ac26,PCRF,PCRF,,,...,نعم,البكالوريوس,هندسة حاسب الالي,,2.44,4.0,غير موظف,,,0
1,0599d409-876b-41a5-af05-749ef0e77d32,21.0,ذكر,منطقة عسير,خميس مشيط,cc8e4e42-65d5-4fa1-82f9-6c6c2d508b60,APMR,SWPS,,متوسط,...,نعم,البكالوريوس,الإذاعة والتلفزيون والفيلم,الفنون والعلوم الإنسانية,5.00,5.0,طالب,,,0
2,38a11c0e-4afc-4261-9c64-e94cc0a272fb,24.0,ذكر,منطقة الرياض,الرياض,e006900d-05a9-4c2b-a36f-0ffb9fce44cd,APMR,,,متوسط,...,نعم,البكالوريوس,Information Technology,,3.50,5.0,موظف,,,0
3,1693e85b-f80e-40ce-846f-395ddcece6d3,23.0,ذكر,منطقة الرياض,الرياض,2ec15f6b-233b-428a-b9f5-e40bc8d14cf9,TOSL,TOSL,,,...,نعم,البكالوريوس,حوسبة تطبيقية - (مسار شبكات الحاسب),,3.55,5.0,خريج,,,0
4,98a0e8d0-5f80-4634-afd8-322aa0902863,23.0,ذكر,منطقة الرياض,الرياض,d32da0e9-1aed-48c3-992d-a22f9ccc741e,CAUF,SWPS,تقليدية,متوسط,...,لا,البكالوريوس,نظم المعلومات الحاسوبية,تكنولوجيا الاتصالات والمعلومات,4.00,5.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6543,cd196579-9590-441b-8787-41078f3cee25,31.0,أنثى,منطقة الرياض,الرياض,4f8c696a-b783-4d40-9776-105f6d3bd624,CAUF,SWPS,,,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,4.40,5.0,,,,0
6544,37bfc11c-ff8c-42dc-9cf9-0d13bb8f7131,27.0,أنثى,منطقة القصيم,بريدة,e94942dd-8684-4746-97ae-df567b9b0a4a,PCRF,PCRF,,مبتدئ,...,نعم,البكالوريوس,علوم الحاسب,,4.46,5.0,موظف,,,0
6545,fc114302-a79f-439f-a08b-fe0a51cf839e,24.0,أنثى,منطقة الرياض,الرياض,02ae0b47-64a6-47a1-b3c5-c0e4df393c30,PCRF,PCRF,تقليدية,مبتدئ,...,لا,البكالوريوس,نظم المعلومات,تكنولوجيا الاتصالات والمعلومات,4.93,5.0,موظف,دوام كامل,Yes,1
6546,4b6d9a36-4402-4c75-bc3a-fca927dbaf65,25.0,ذكر,منطقة الرياض,الرياض,9b4cedaa-fac0-4eac-aa4b-b05b6a0c97ff,PCRF,PCRF,,متوسط,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,4.00,4.0,غير موظف,تدريب,No,0


In [145]:
df.shape

(6548, 24)

In [146]:
df.describe()

Unnamed: 0,Age,Program Days,University Degree Score,University Degree Score System,Y
count,6456.0,6548.0,6467.0,6467.0,6548.0
mean,26.831165,19.691662,8.224432,9.773929,0.158674
std,5.535967,32.112061,19.120384,21.259962,0.3654
min,18.0,3.0,0.0,4.0,0.0
25%,23.0,5.0,3.3,5.0,0.0
50%,25.0,12.0,4.0,5.0,0.0
75%,29.0,19.0,4.51,5.0,0.0
max,57.0,292.0,100.0,100.0,1.0


In [147]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6548 entries, 0 to 6547
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Student ID                      6548 non-null   object 
 1   Age                             6456 non-null   float64
 2   Gender                          6548 non-null   object 
 3   Home Region                     6546 non-null   object 
 4   Home City                       6546 non-null   object 
 5   Program ID                      6548 non-null   object 
 6   Program Main Category Code      6548 non-null   object 
 7   Program Sub Category Code       5613 non-null   object 
 8   Technology Type                 3566 non-null   object 
 9   Program Skill Level             4902 non-null   object 
 10  Program Presentation Method     6548 non-null   object 
 11  Program Start Date              6548 non-null   object 
 12  Program End Date                65

## Removing Duplicates

In [148]:
for i in df.columns:
    print(df[i].value_counts())
    print("---------------------")
# The first ID have been recorded for 9 times, we will check if the data have dublicates or not.

415cfcb1-8dfa-459d-b719-d942cc5e19e1    9
4baf0168-b21d-4f18-b85c-170d1023cb1c    6
b1ae7dcd-2de0-4797-9331-4e00f7099d3e    6
cc92fa1e-494d-406f-8faf-9019d15ed5d0    6
647325d0-ef30-457b-b41d-e00e110e06f2    6
                                       ..
6aa34c90-a5c8-4377-ac0c-de1b22ff008a    1
059afdb5-cad3-4493-9c88-01a983b85273    1
5d28891f-4b57-4bdd-85b9-6c10bc1447a8    1
c6b28dff-1870-4302-af40-c959b81c310b    1
008f3386-0d43-45a4-8372-b282e5a0101a    1
Name: Student ID, Length: 5196, dtype: int64
---------------------
23.0    892
24.0    760
25.0    552
22.0    548
26.0    460
27.0    395
21.0    314
29.0    297
28.0    291
30.0    230
20.0    225
31.0    216
32.0    199
34.0    153
33.0    136
35.0    112
19.0     95
37.0     95
38.0     86
36.0     73
39.0     68
41.0     56
42.0     49
40.0     42
44.0     26
43.0     24
18.0     17
45.0     10
46.0      8
48.0      6
49.0      6
47.0      4
55.0      3
50.0      3
52.0      1
51.0      1
56.0      1
53.0      1
57.0      1
Nam

In [149]:
df.loc[df['Student ID'] == '415cfcb1-8dfa-459d-b719-d942cc5e19e1']
# the same student showed in the dataset but the other information is different, this means the student is signed 9 times for different programs(data is not duplicated)

Unnamed: 0,Student ID,Age,Gender,Home Region,Home City,Program ID,Program Main Category Code,Program Sub Category Code,Technology Type,Program Skill Level,...,Completed Degree,Level of Education,Education Speaciality,College,University Degree Score,University Degree Score System,Employment Status,Job Type,Still Working,Y
418,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,d0f06377-196d-486c-bad8-15c04f8cdad3,PCRF,PCRF,ناشئة,مبتدئ,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
785,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,f2374bcb-111c-402d-beaf-37433687db4d,TOSL,,,مبتدئ,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
1700,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,d8fbc53c-e042-4e45-b8c7-42b4ad0528ee,CAUF,,,مبتدئ,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
2302,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,642f04b2-67f8-4403-98f1-c5c5755eb358,CAUF,SWPS,داعمة,متوسط,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
2818,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,b3c56c6d-dd46-4f26-93ee-c16e211aa3b9,PCRF,PCRF,داعمة,مبتدئ,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
2969,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,5b4d389b-ce84-46c4-87f0-70531fdaf1e7,INFA,INFA,,,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
4270,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,1a3faefb-3708-40ab-82e6-4fcf07983665,PCRF,PCRF,داعمة,مبتدئ,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
4861,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,0d6336e2-4a79-4c32-8f15-073f9a4f6976,PCRF,PCRF,تقليدية,مبتدئ,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0
5705,415cfcb1-8dfa-459d-b719-d942cc5e19e1,44.0,ذكر,منطقة الرياض,الرياض,9144fab5-59f3-4b98-bc3c-0784a04883fe,TOSL,TOSL,,,...,نعم,الدبلوم,شبكات الحاسب الآلي,تكنولوجيا الاتصالات والمعلومات,4.0,5.0,موظف,دوام كامل,Yes,0


In [150]:
df.duplicated().sum()
# Data shows that there is Identical rows, so we will need to delete the duplicates

48

In [151]:
# print the rows with duplicate information (sorted) to check that there is a true duplication
duplicated = df[df.duplicated(keep=False)].sort_values(by=df.columns.tolist())
duplicated

Unnamed: 0,Student ID,Age,Gender,Home Region,Home City,Program ID,Program Main Category Code,Program Sub Category Code,Technology Type,Program Skill Level,...,Completed Degree,Level of Education,Education Speaciality,College,University Degree Score,University Degree Score System,Employment Status,Job Type,Still Working,Y
1475,023f77b4-1136-4049-890c-2ad8128db5ba,24.0,ذكر,منطقة الرياض,الرياض,21c233a1-0101-44e6-8af8-39d545bddaaf,PCRF,PCRF,,,...,نعم,البكالوريوس,علوم الحاسبات,تكنولوجيا الاتصالات والمعلومات,3.00,4.0,موظف,دوام كامل,Yes,0
3740,023f77b4-1136-4049-890c-2ad8128db5ba,24.0,ذكر,منطقة الرياض,الرياض,21c233a1-0101-44e6-8af8-39d545bddaaf,PCRF,PCRF,,,...,نعم,البكالوريوس,علوم الحاسبات,تكنولوجيا الاتصالات والمعلومات,3.00,4.0,موظف,دوام كامل,Yes,0
5319,1acd4e67-2970-4cc2-bcd0-ed1326379e40,35.0,ذكر,منطقة الرياض,الرياض,0d6336e2-4a79-4c32-8f15-073f9a4f6976,PCRF,PCRF,تقليدية,مبتدئ,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,2.00,4.0,موظف,دوام كامل,Yes,0
6052,1acd4e67-2970-4cc2-bcd0-ed1326379e40,35.0,ذكر,منطقة الرياض,الرياض,0d6336e2-4a79-4c32-8f15-073f9a4f6976,PCRF,PCRF,تقليدية,مبتدئ,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,2.00,4.0,موظف,دوام كامل,Yes,0
3780,206447b4-433d-4950-b445-7e60e75b6ca8,40.0,ذكر,المنطقة الشرقية,الأحساء,1f09a274-8f35-41a1-9a5e-61a5f1cb98fc,TOSL,,,متوسط,...,نعم,البكالوريوس,علوم الحاسبات,تكنولوجيا الاتصالات والمعلومات,3.12,4.0,غير موظف,دوام كامل,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4189,ecbedd25-274c-439e-a31f-bbd7ac009714,32.0,ذكر,منطقة الرياض,الرياض,ab263106-20a8-42aa-9626-278e62ae3a49,CAUF,SWPS,تقليدية,متقدم,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,5.00,5.0,موظف,دوام كامل,Yes,0
2152,f2a20174-e002-45bf-a4a8-63be34062a60,24.0,ذكر,منطقة الباحة,الباحة,1f09a274-8f35-41a1-9a5e-61a5f1cb98fc,TOSL,,,متوسط,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,2.51,4.0,موظف,تدريب,Yes,0
2609,f2a20174-e002-45bf-a4a8-63be34062a60,24.0,ذكر,منطقة الباحة,الباحة,1f09a274-8f35-41a1-9a5e-61a5f1cb98fc,TOSL,,,متوسط,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,2.51,4.0,موظف,تدريب,Yes,0
2507,fabd4c95-3908-4dd9-b28d-8a6f05629880,23.0,أنثى,منطقة الرياض,الرياض,1f09a274-8f35-41a1-9a5e-61a5f1cb98fc,TOSL,,,متوسط,...,نعم,البكالوريوس,تقنية المعلومات,تكنولوجيا الاتصالات والمعلومات,4.00,5.0,,,,0


In [152]:
# removing duplicates
df.drop_duplicates(inplace=True)

df.shape

(6500, 24)

## Changing date format

In [153]:
df['Program Start Date']=pd.to_datetime(df['Program Start Date'])
df['Program End Date']=pd.to_datetime(df['Program End Date'])

In [154]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6547
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Student ID                      6500 non-null   object        
 1   Age                             6413 non-null   float64       
 2   Gender                          6500 non-null   object        
 3   Home Region                     6498 non-null   object        
 4   Home City                       6498 non-null   object        
 5   Program ID                      6500 non-null   object        
 6   Program Main Category Code      6500 non-null   object        
 7   Program Sub Category Code       5580 non-null   object        
 8   Technology Type                 3542 non-null   object        
 9   Program Skill Level             4855 non-null   object        
 10  Program Presentation Method     6500 non-null   object        
 11  Prog

## Fixing Nulls

In [155]:
df.isna().sum()

Student ID                           0
Age                                 87
Gender                               0
Home Region                          2
Home City                            2
Program ID                           0
Program Main Category Code           0
Program Sub Category Code          920
Technology Type                   2958
Program Skill Level               1645
Program Presentation Method          0
Program Start Date                   0
Program End Date                     0
Program Days                         0
Completed Degree                     0
Level of Education                  22
Education Speaciality              272
College                           3862
University Degree Score             76
University Degree Score System      76
Employment Status                  557
Job Type                          4535
Still Working                     4535
Y                                    0
dtype: int64

In [156]:
# Age:
df['Age'].value_counts()

23.0    886
24.0    750
25.0    549
22.0    546
26.0    459
27.0    392
21.0    314
29.0    296
28.0    288
30.0    230
20.0    225
31.0    216
32.0    194
34.0    152
33.0    136
35.0    111
19.0     95
37.0     93
38.0     84
36.0     72
39.0     67
41.0     56
42.0     49
40.0     41
44.0     26
43.0     24
18.0     17
45.0     10
46.0      8
48.0      6
49.0      6
47.0      4
55.0      3
50.0      3
52.0      1
51.0      1
56.0      1
53.0      1
57.0      1
Name: Age, dtype: int64

In [157]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Age'].value_counts()

23.0    886
24.0    750
25.0    636
22.0    546
26.0    459
27.0    392
21.0    314
29.0    296
28.0    288
30.0    230
20.0    225
31.0    216
32.0    194
34.0    152
33.0    136
35.0    111
19.0     95
37.0     93
38.0     84
36.0     72
39.0     67
41.0     56
42.0     49
40.0     41
44.0     26
43.0     24
18.0     17
45.0     10
46.0      8
48.0      6
49.0      6
47.0      4
55.0      3
50.0      3
52.0      1
51.0      1
56.0      1
53.0      1
57.0      1
Name: Age, dtype: int64

In [158]:
# Level of Education
df['Level of Education'].value_counts()

البكالوريوس    5391
الماجستير       481
الدبلوم         309
ثانوي           267
الدكتوراه        30
Name: Level of Education, dtype: int64

In [159]:
df['Level of Education'].fillna(df['Level of Education'].mode().iloc[0], inplace=True)
df['Level of Education'].value_counts()

البكالوريوس    5413
الماجستير       481
الدبلوم         309
ثانوي           267
الدكتوراه        30
Name: Level of Education, dtype: int64

In [160]:
# Sub Category
main_sub_categories = df.groupby('Program Main Category Code')['Program Sub Category Code']\
                       .apply(lambda x: x.astype(str).unique())
print(main_sub_categories)

Program Main Category Code
ABIR                      [INFA, nan, ABIR]
APMR    [SWPS, nan, SRTA, KLTM, ASCW, QTDY]
CAUF                [SWPS, CRDP, nan, ERST]
DTFH                                  [nan]
GRST                                 [INFA]
INFA                                 [INFA]
PCRF                                 [PCRF]
QWLM                                  [nan]
SERU                           [INFA, ERST]
TOSL                            [TOSL, nan]
Name: Program Sub Category Code, dtype: object


In [161]:
def fill_sub_category(row):
    main_category = row['Program Main Category Code']
    sub_categories = row['Program Sub Category Code']
    
    # Handle case where sub_categories is a single NaN float
    if pd.isna(sub_categories):
        return [main_category]
    
    # Ensure sub_categories is always treated as a list
    if not isinstance(sub_categories, list):
        sub_categories = [sub_categories]
    
    # If all values are nan, fill with main category
    if all(pd.isna(x) for x in sub_categories):
        return [main_category] * len(sub_categories)
    
    # Replace nan values with first non-nan value if exists, otherwise main category
    first_non_nan = next((x for x in sub_categories if not pd.isna(x)), main_category)
    return [first_non_nan if pd.isna(x) else x for x in sub_categories]

# Apply the function to fill nan values
df['Program Sub Category Code'] = df.apply(fill_sub_category, axis=1)

In [162]:
main_sub_categories = df.groupby('Program Main Category Code')['Program Sub Category Code']\
                       .apply(lambda x: x.astype(str).unique())
print(main_sub_categories)

Program Main Category Code
ABIR                                 [['INFA'], ['ABIR']]
APMR    [['SWPS'], ['APMR'], ['SRTA'], ['KLTM'], ['ASC...
CAUF             [['SWPS'], ['CRDP'], ['CAUF'], ['ERST']]
DTFH                                           [['DTFH']]
GRST                                           [['INFA']]
INFA                                           [['INFA']]
PCRF                                           [['PCRF']]
QWLM                                           [['QWLM']]
SERU                                 [['INFA'], ['ERST']]
TOSL                                           [['TOSL']]
Name: Program Sub Category Code, dtype: object


In [163]:
df.isnull().sum()

Student ID                           0
Age                                  0
Gender                               0
Home Region                          2
Home City                            2
Program ID                           0
Program Main Category Code           0
Program Sub Category Code            0
Technology Type                   2958
Program Skill Level               1645
Program Presentation Method          0
Program Start Date                   0
Program End Date                     0
Program Days                         0
Completed Degree                     0
Level of Education                   0
Education Speaciality              272
College                           3862
University Degree Score             76
University Degree Score System      76
Employment Status                  557
Job Type                          4535
Still Working                     4535
Y                                    0
dtype: int64

In [164]:
# Home Region, Home City
# There is only 2 null records in Home Region, and they are also nulls in Home City thought, we will drop them
df[df['Home Region'].isnull()]

Unnamed: 0,Student ID,Age,Gender,Home Region,Home City,Program ID,Program Main Category Code,Program Sub Category Code,Technology Type,Program Skill Level,...,Completed Degree,Level of Education,Education Speaciality,College,University Degree Score,University Degree Score System,Employment Status,Job Type,Still Working,Y
1864,cc394a25-74ed-43f9-92bc-e0021fc969c5,45.0,ذكر,,,ce3562c8-8d27-4ffb-8dfc-e1dd4527b32a,PCRF,[PCRF],تقليدية,مبتدئ,...,نعم,الماجستير,نظم المعلومات الإدارية,الأعمال والإدارة والقانون,4.0,5.0,موظف,دوام كامل,Yes,0
3292,55b23798-369d-4698-8cb9-f8d1684bf1f8,25.0,ذكر,,,899795e1-7bf3-46d0-a58e-824d4033f6da,PCRF,[PCRF],تقليدية,متوسط,...,نعم,البكالوريوس,,,,,,,,0


In [165]:
df.dropna(subset=['Home Region'], inplace=True)

In [166]:
# Education Speaciality
# Cince Tuwaiq academy provides a brogram related to Computer Science,
# it is normal to have students who have Computer Science background to apply for, wo we will fill the nulls with علوم الحاسبات
df['Education Speaciality'].value_counts()

علوم الحاسبات                               1061
تقنية المعلومات                              705
نظم المعلومات                                453
علوم حاسب                                    346
علوم الحاسب                                  193
                                            ... 
تقنية ميكانيكية - تكييف وتبريد                 1
تقنية المعلومات - مسار علم البيانات والذ       1
برمجه وتطوير الويب                             1
computer security                              1
دراسات وقضايا معاصرة                           1
Name: Education Speaciality, Length: 871, dtype: int64

In [167]:
df['Education Speaciality'].fillna('علوم الحاسبات',inplace=True)

In [168]:
# College
# cince the data is about Tuwaiq academy, witch provides a computer science related corses, we will change the data into تكنولوجيا الاتصالات والمعلومات and غير ذلك,
# and fill the missing data with تكنولوجيا الاتصالات والمعلومات
df['College'].value_counts()

تكنولوجيا الاتصالات والمعلومات         2293
الأعمال والإدارة والقانون               151
العلوم الطبيعية والرياضيات والإحصاء      62
الهندسة والتصنيع والبناء                 43
الفنون والعلوم الإنسانية                 38
العلوم الاجتماعية والصحافة والإعلام      25
التعليم                                  19
الصحة والرفاة                             4
البرامج والمؤهلات العامة                  2
Name: College, dtype: int64

In [169]:
df['College'].fillna('تكنولوجيا الاتصالات والمعلومات', inplace=True)

df['College'] = df['College'].mask(
    df['College'] != 'تكنولوجيا الاتصالات والمعلومات',
    'غير ذلك'
)

df['College'].value_counts()

تكنولوجيا الاتصالات والمعلومات    6154
غير ذلك                            344
Name: College, dtype: int64

In [170]:
df.isna().sum()

Student ID                           0
Age                                  0
Gender                               0
Home Region                          0
Home City                            0
Program ID                           0
Program Main Category Code           0
Program Sub Category Code            0
Technology Type                   2958
Program Skill Level               1645
Program Presentation Method          0
Program Start Date                   0
Program End Date                     0
Program Days                         0
Completed Degree                     0
Level of Education                   0
Education Speaciality                0
College                              0
University Degree Score             75
University Degree Score System      75
Employment Status                  556
Job Type                          4534
Still Working                     4534
Y                                    0
dtype: int64