# EIE Simple EDA

## Import Libraries

In [1]:
import pandas as pd

## Load Datasets & Description

### `Locations` table

Features:
- `KATOTTG_2023` (KATOTTG code): 31862 unique KATOTTG codes (a unique digital code for each administrative-territorial unit and territory of territorial communities in Ukraine, used from January 2021 after decentralization). KATOTTG codes in Ukraine consist of the letters "UA" followed by a seventeen-digit format `UA|AA|BB|CCC|DDD|EE|FFFFF`. The first two `AA` digits signify the first-level administrative-territorial unit (region, Autonomous Republic of Crimea, or city with a special status), as for KOATUU. The following two digits, `BB`, signify the second-level administrative-territorial unit – district. The next three, `CCC`, represent the third-level administrative-territorial unit – territorial community (hromada). Three more digits, `DDD`, signify the fourth-level administrative-territorial unit – the specific locality, cities, villages, and settlements. The following two `EE` signify the additional level administrative-territorial unit – city districts (including in cities with special status). The last five digits, `F`, are a unique object identifier.
- `KOATUU_2020` (KOATUU code): 30940 unique codes(a unique code for each administrative-territorial unit and territory of territorial communities in Ukraine, used from January 1998 till January 2021). KOATUU codes in Ukraine follow a ten-digit format `AA|BBB|CCC|DD`. The first two digits `AA` signify the first-level administrative-territorial unit (region, Autonomous Republic of Crimea, or city with a special status), the next three `BBB` represent the second-level administrative-territorial unit (cities of regional subordination; districts of the Autonomous Republic of Crimea, regions; districts in cities with a special status determined by the laws of Ukraine), the following three `CCC` represent the third-level administrative-territorial unit (cities of district subordination; districts in cities of oblast subordination; urban-type settlements; village councils), and the following two `DD` denote the fourth-level administrative-territorial unit (villages; settlements).
- `category` (type of administrative-territorial units). Range of values:
    - region
    - district
    - hromada
    - village
    - settlement
    - city
    - urban village
    - districts in cities
    - capital
    - abroad
- `ukrainian_name` (Ukrainian name of the administrative-territorial units): 19474 unique names in Ukrainian: region, district, hromada, etc.
- `english_name` (English name for the first- and second-level administrative-territorial units only): 714 unique names in English: region, district, hromada, etc.

In [2]:
df_location = pd.read_csv('./final_tables/locations_base.csv')
df_location.head()

Unnamed: 0,KATOTTG_2023,KOATUU_2020,category,ukrainian_name,english_name
0,UA01000000000013043,100000000,region,Автономна Республіка Крим,Avtonomna Respublika Krym
1,UA01020000000022387,120400000,district,Бахчисарайський,Bakhchysaraiskyi
2,UA01020010000048857,8536990200,hromada,Андріївська,
3,UA01020010010075540,8536990201,village,Андріївка,
4,UA01020010020030666,8536990203,settlement,Сонячний,


### `Schools` table

Features:
- `KATOTTG_2023` (KATOTTG code): 7168 unique KATOTTG codes (see the description in `Locations` table) of the schools locations.
- `EDRPOU` (unique code of the legal entity or physical person): 12898 unique codes of schools, which uniquely define the schools. EDRPOU codes in Ukraine follow a eight-digit format.
- `year`: year in which the school meets as a graduate institution. The range of values in the dataset is 2016-2023.
- `eotypename` (type of school in Ukrainian language): 37 unique types of schools in Ukrainian.

Thus, the pairs (EDRPOU, year) are unique elements in this table.

In [8]:
df_school = pd.read_csv('./final_tables/schools_edrpou.csv', dtype=str)
df_school.head()

Unnamed: 0,KATOTTG_2023,EDRPOU,year,eotypename
0,UA05020030010063857,4601943,2023,заклад фахової передвищої освіти
1,UA05020030010063857,419667,2022,заклад фахової передвищої освіти
2,UA05020030010063857,3065891,2018,вище професійне училище
3,UA05020030010063857,3065891,2019,вище професійне училище
4,UA05020030010063857,3065891,2020,вище професійне училище


### `Schools_stats` table
Features:
- `EDRPOU` (unique code of the legal entity or physical person): 9419 unique codes of schools.
- `eotype` (type of school's location): rural or urban.
- `eolevel` (educational level of the school): According to the educational level provided by an institution there are different types of general education institutions: I level - primary school (grades 1 - 3 (4)); II level - middle school (grades 5-9); III level - high school, high school, usually with a specialized field of study (10 - 11 grades). Schools of all three levels can function together or independently - primary, middle, high.  Only graduates from schools with III level of education can enter the universities. Range of values in the dataset: I-III, II-III, III.
- `teachstaff` (number of teaching staff). Range of values: 2-215 (integer).
- `nonteachstaff` (number of non-teaching staff). Range of values: 0-165 (integer).
- `teachstaffretage` (number of teaching staff of retirement age). Range of values: 0-54 (integer).
- `pupils` (number of pupils). Range of values: 5-2477 (integer).
- `classes` (number of classes). Range of values: 0-98 (integer).
- `opex` (expenses for the operation of the school, current year, thousands UAH). Range of values: 247.787-48330.2 (float)
- `opexplan` (planned expenses for the operation of the school, next year, thousands UAH). Range of values:
    - 0: no information about planned expenses
    - 56.794-83878.7 (float)
- `hub` (is the school a hub school). Hub school is the "main" school of the certain rural community; the hub school usually has better quality of education and equipment than other schools in the territory; other schools in the territory become subordinate to the hub school. Range of values: N/A (no info), yes, no.
- `year` (year of the statistics info for the school). Range of values: 2019-2020 (integer).

In [20]:
df_school_stats = pd.read_csv('./final_tables/school_stats.csv')
df_school_stats.head()

Unnamed: 0,EDRPOU,eotype,eolevel,teachstuff,nonteachstuff,teachstuffretage,pupils,classes,opex,opexplan,hub,year
0,26244113,urban,I-III,44.0,20.0,8.0,539,20,9915.2,11123.0,no,2019
1,26244107,urban,I-III,59.0,18.0,8.0,840,30,13692.2,13758.8,no,2019
2,26244136,urban,I-III,82.0,23.0,11.0,981,36,16184.7,17254.2,no,2019
3,26244099,urban,I-III,61.0,22.0,6.0,854,32,14321.9,14598.6,no,2019
4,23064971,urban,I-III,70.0,23.0,8.0,864,33,15061.0,16011.5,no,2019


### `Students` table

Features:
- `outid` (Student ID): 2490052 unique student identifiers.
- `birth`: year of student's birth. Range of values: 1944-2010 (integer).
- `sextypename`: sex of the student. Range of values: male, female.
- `classprofilename`: class profile. Range of values:
    - Universal
    - Junior Specialist
    - Ukrainian Philology
    - Qualified Worker
    - Foreign Philology
    - Mathematical
    - Technological
    - Historical
    - Other (multi-profile)
    - Information Technology
    - Other
    - Biological and Chemical
    - Physical and Mathematical
    - Economic
    - Legal
    - Military and Sports
    - Sports
    - Biotechnological
    - Ecological
    - Biological
    - Geographical
    - Artistic and Aesthetic
    - Physical
    - Chemical-technological and Agrochemical
    - Biological and Physical
    - Philosophical
    - Physico-chemical
    - NaN
- `regtypename`: 7 unique original graduation status. Range of values:
    - A graduate of an Ukrainian school of the current year
    - The institution of punishment
    - A graduate of previous years
    - A graduate of a foreign school
    - A graduate of a vocational pre-higher education institution
    - A student of a higher education institution
    - A student of a higher/vocational pre-higher education institution
- `classlangname`: language of the class. Range of values:
    - Ukrainian
    - Russian
    - Romanian
    - Hungarian
    - Moldovan
    - Polish
    - Other
- `KATOTTG_2023` (KATOTTG code): 18470 unique KATOTTG codes (see the description at `Locations` table) of the student home locations.
- `EDRPOU_school` (unique code of the legal entity or physical person): 12898 unique codes of schools.
- `year`: year when student takes the test (a student has different outid in different years if he/she/it takes exams more than 1 time). Range of values: 2018-2023 (integer).
- `status`:  7 unique pre-processed/fixed graduation status. Range of values:
    - A graduate of an Ukrainian school of the current year
    - The institution of punishment
    - A graduate of previous years
    - A graduate of a foreign school
    - A graduate of a vocational pre-higher education institution
    - A student of a higher education institution
    - A student of a higher/vocational pre-higher education institution

In [21]:
df_students = pd.read_csv('./final_tables/students.csv', dtype = str)
df_students.head()

Unnamed: 0,outid,birth,sextypename,classprofilename,regtypename,classlangname,KATOTTG_2023,EDRPOU_school,year,status
0,a99c6c63-aa70-4aec-ba42-370f7261e857,1998,Male,,A graduate of an Ukrainian school of the curre...,,UA23080270010078454,26373098,2016,A graduate of an Ukrainian school of the curre...
1,c3136421-569e-422e-ae8f-41c4c931fd70,1998,Female,,A graduate of an Ukrainian school of the curre...,,UA68040210010032567,25880114,2016,A graduate of an Ukrainian school of the curre...
2,30de395e-7a74-452a-8370-6856d240fbfb,1999,Male,,A graduate of an Ukrainian school of the curre...,,UA73060610010033137,21431046,2016,A graduate of an Ukrainian school of the curre...
3,852ca6ab-7fbd-40ad-ae51-39dc94edc9e1,1999,Female,,A graduate of an Ukrainian school of the curre...,,UA14120030010055241,25705061,2016,A graduate of an Ukrainian school of the curre...
4,bc9b70ca-c091-440f-b1de-f04b308f3a54,1999,Male,,A graduate of an Ukrainian school of the curre...,,UA61040490010069060,14040173,2016,A graduate of an Ukrainian school of the curre...


### `Students_Take_Tests` table

Features:
- `outid` (Student ID): 2490052 unique  students identifiers
- `year` year when student takes the test (a student has different outid in different years if he/she/it takes exams more than 1 time). Range of values: 2018-2023 (integer).
- `score100` (Normalized test score): 117 unique score of student in 100-point scale. Range of values:
    - N/A: when student didn't pass this test
    - 100-200 (integer)
- `score12` (Test Score in school 12-point scale): 13 unique score of student in 12-point scale. Range of values:
    - N/A: when student didn't pass this test
    - 1-12 (integer)
- `score` (Raw test score). 106 unique scores values. Different scale for each subject. Range of values:
    - N/A: when student didn't pass this test
    - 0-104 (integer)
- `test_status` (Status of the test):
    - Accepted: when a student passed the test
    - Absent: when a student didn't appear on the test
    - Failed: when a student took the test but failed
    - Canceled: when a student passed the test but canceled the results
    - Not selected 100-200: when a student took the test but selected not to translate the result to 100-200 points
    - Not registered for the main session: when a student took the test in additional sessions
- `test_subject` (Name of the test subject): Range of values:
    - umltest: Ukrainian language and literature test (2021)
    - ukrsubtest: Ukrainian language test only for school graduates
    - ukrtest: Ukrainian language and literature test (2016-2020)/Ukrainian language test (2021-2023)
    - mathsttest: Mathematics test only for school graduates
    - mathtest: Mathematics test
    - histtest: History of Ukraine test
    - biotest: Biology test
    - geotest: Geography test
    - chemtest: Chemistry test
    - phystest: Physics test
    - engtest: English language test
    - fratest: French language test
    - deutest: German language test
    - spatest: Spanish language test
    - rustest: Russian language test
- `test_type` (Type of the test): Range of values:
    - EIE: test which was held in 2016-2021 years
    - NMT: test which was held in 2022-2023 years
- `KATOTTG_2023_test_center` (KOATUU code): 634 unique KATOTTG codes (see the description at `Locations` table)  of the test centers locations.
- `EDRPOU_test_center`(EDRPOU code): 2112 unique codes of legal entities, who are owners of test centers (unique test centers code).

In [23]:
df_st_take_tests = pd.read_csv('./final_tables/students_take_tests.csv')
df_st_take_tests.head()

Unnamed: 0,outid,year,score100,score12,score,test_status,test_subject,test_type,KATOTTG_2023_test_center,EDRPOU_test_center
0,00000AC7-CDDE-4C77-B979-8B0351AF1305,2017,161.0,9.0,,Accepted,ukrtest,EIE,UA51100270010275193,20995060.0
1,00000dce-36de-4d58-9dc2-7ffc824f597a,2021,,,,Absent,ukrtest,EIE,UA18060090010074365,22061344.0
2,00001a8d-fff5-4c7c-bea2-b0157f7c5655,2021,128.0,6.0,31.0,Accepted,ukrtest,EIE,UA71080490010144486,25922746.0
3,0000268f-9fdd-49b2-9ee2-422778c9c4f1,2016,160.0,8.0,,Accepted,ukrtest,EIE,UA63120270010216514,24486622.0
4,0000324e-f525-49c4-a963-8df0cc02d6d5,2018,166.0,9.0,69.0,Accepted,ukrtest,EIE,UA46100230010074173,34387362.0


### `Test centers` table

Features:
- `KATOTTG_2023` (KOATUU code): 507 unique KATOTTG codes (see the description at `Locations` table) of the test centers locations.
- `year` (Year): year in which a test center conducted tests. Range of values: 2016-2023.
- `EDRPOU` (unique code of the legal entity or physical person): 2111 unique codes of legal entities, who are owners of test centers(unique test centers code).

In [30]:
df_test_centers = pd.read_csv('./final_tables/test_centers_edrpou.csv')
df_test_centers.head()

Unnamed: 0,KATOTTG_2023,year,EDRPOU
0,UA05020030010063857,2018,3065891
1,UA05020030010063857,2019,3065891
2,UA05020030010063857,2020,3065891
3,UA05020030010063857,2021,3065891
4,UA05020030010063857,2021,35725833


### `Tests` table

Features:

- `test_type` (type of test): Range of values:
    - EIE: test which was held in 2016-2021 years
    - NMT: test which was held in 2022-2023 years
- `test_subject` (name of test subject): Range of values:
    - umltest: Ukrainian language and literature test
    - ukrsubtest: Ukrainian language test only for school graduates
    - ukrtest: Ukrainian language test 
    - mathsttest: Standard math test 
    - mathtest: Math test
    - histtest: History test
    - biotest: Biology test
    - geotest: Geography test
    - chemtest: Chemistry test
    - phystest: Physics test
    - engtest: English language test
    - fratest: French language test
    - deutest: German language test
    - spatest: Spanish language test
    - rustest: russian language test

In [15]:
df_tests = pd.read_csv('./final_tables/subjects.csv')
df_tests.head()

Unnamed: 0,test_type,test_subject
0,EIE,histtest
1,EIE,mathsttest
2,EIE,ukrtest
3,EIE,phystest
4,EIE,geotest


### `Years` table

Features:

- year: Range of values:
    - 2018-2023 (integer)

Table contains eight unique year values.

Years from 2016 to 2021 stand for EIE test.<br>
Years 2022 and 2023 stand for NMT test.

In [16]:
df_years = pd.read_csv('./final_tables/years.csv')
df_years.head()

Unnamed: 0,year
0,2016
1,2017
2,2018
3,2019
4,2020


## Join Tables: Examples

### Join School and School Stats

In [35]:
df_school['EDRPOU'] = df_school['EDRPOU'].astype(str)
df_school_stats['EDRPOU'] = df_school_stats['EDRPOU'].astype(str)

df_school_full_info = pd.merge(df_school, df_school_stats, on='EDRPOU', how='inner')
df_school_full_info.head()

Unnamed: 0,KATOTTG_2023,EDRPOU,year_x,eotypename,eotype,eolevel,teachstuff,nonteachstuff,teachstuffretage,pupils,classes,opex,opexplan,hub,year_y
0,UA05020030010063857,26119498,2016,середня загальноосвітня школа,urban,I-III,97.0,34.0,15.0,1475,45,17771.857,20338.645,no,2019
1,UA05020030010063857,26119498,2016,середня загальноосвітня школа,urban,I-III,83.0,35.0,12.0,1527,46,20798.588,26434.211,,2020
2,UA05020030010063857,26119498,2017,середня загальноосвітня школа,urban,I-III,97.0,34.0,15.0,1475,45,17771.857,20338.645,no,2019
3,UA05020030010063857,26119498,2017,середня загальноосвітня школа,urban,I-III,83.0,35.0,12.0,1527,46,20798.588,26434.211,,2020
4,UA05020030010063857,26119498,2018,середня загальноосвітня школа,urban,I-III,97.0,34.0,15.0,1475,45,17771.857,20338.645,no,2019


### Join Location and Students take tests tables

In [41]:
df_st_take_test_new = df_st_take_tests[df_st_take_tests['year'].notna()]

# get regions names
df_location["KOATUU_2020"] = df_location["KOATUU_2020"].astype(str)
df_location["KATOTTG_2023"] = df_location["KATOTTG_2023"].astype(str)
df_location_prep = df_location.drop_duplicates(subset=['KATOTTG_2023'], keep='first')
df_location_prep['KATOTTG_2023_region'] = df_location_prep['KATOTTG_2023'].str[:4]

region_df=df_location_prep[df_location_prep.category.isin(['region', 'capital'])][['KATOTTG_2023', 'english_name']]
region_df['KATOTTG_2023_region'] = region_df['KATOTTG_2023'].str[:4]
region_df.drop(columns='KATOTTG_2023', inplace=True)
region_df.reset_index(inplace=True, drop=True)
region_df


abroad_df=df_location_prep[(df_location_prep.category=='abroad')&(df_location_prep.KATOTTG_2023.str[4:6]=='00')][['KATOTTG_2023', 'english_name']]
abroad_df['KATOTTG_2023_region'] = abroad_df['KATOTTG_2023'].str[:4]
abroad_df.drop(columns='KATOTTG_2023', inplace=True)
abroad_df.reset_index(inplace=True, drop=True)
abroad_df


df_reg = pd.concat([region_df, abroad_df], ignore_index=True).rename({'english_name': 'region_name'}, axis=1)

df_location_prep = df_location_prep.merge(df_reg, on='KATOTTG_2023_region', how='left')
df_location_centers = df_location_prep.rename(columns = {
    'KOATUU_2020':'KOATUU_2020_test_center', 
    'region_name':'region_name_test_center', 
    'KATOTTG_2023':'KATOTTG_2023_test_center', 
    'category':'category_test_center'
})

In [44]:
df_st_take_test_new = pd.merge(df_st_take_test_new, df_location_centers[['region_name_test_center','KOATUU_2020_test_center','KATOTTG_2023_test_center']], on='KATOTTG_2023_test_center', how='inner')[['outid','year', 'region_name_test_center', 'KOATUU_2020_test_center','KATOTTG_2023_test_center']]
df_st_take_test_new.head()

Unnamed: 0,outid,year,region_name_test_center,KOATUU_2020_test_center,KATOTTG_2023_test_center
0,00000AC7-CDDE-4C77-B979-8B0351AF1305,2017,Odeska,5110137300,UA51100270010275193
1,0001B809-F52B-41A9-A3BA-19A6E45FA413,2017,Odeska,5110137300,UA51100270010275193
2,000392C7-3909-4F01-9649-965FAD28B2DA,2017,Odeska,5110137300,UA51100270010275193
3,0004d800-8bb3-4976-beb3-2cd267be6d2d,2016,Odeska,5110137300,UA51100270010275193
4,000b1230-0ea3-42f8-a4a6-9b6b8c0d4cb5,2020,Odeska,5110137300,UA51100270010275193


### Join Location and Students tables

In [45]:
# get regions names
df_location["KOATUU_2020"] = df_location["KOATUU_2020"].astype(str)
df_location["KATOTTG_2023"] = df_location["KATOTTG_2023"].astype(str)
df_location_prep = df_location.drop_duplicates(subset=['KATOTTG_2023'], keep='first')
df_location_prep['KATOTTG_2023_region'] = df_location_prep['KATOTTG_2023'].str[:4]

region_df=df_location_prep[df_location_prep.category.isin(['region', 'capital'])][['KATOTTG_2023', 'english_name']]
region_df['KATOTTG_2023_region'] = region_df['KATOTTG_2023'].str[:4]
region_df.drop(columns='KATOTTG_2023', inplace=True)
region_df.reset_index(inplace=True, drop=True)
region_df


abroad_df=df_location_prep[(df_location_prep.category=='abroad')&(df_location_prep.KATOTTG_2023.str[4:6]=='00')][['KATOTTG_2023', 'english_name']]
abroad_df['KATOTTG_2023_region'] = abroad_df['KATOTTG_2023'].str[:4]
abroad_df.drop(columns='KATOTTG_2023', inplace=True)
abroad_df.reset_index(inplace=True, drop=True)
abroad_df


df_reg = pd.concat([region_df, abroad_df], ignore_index=True).rename({'english_name': 'region_name'}, axis=1)

df_location_prep = df_location_prep.merge(df_reg, on='KATOTTG_2023_region', how='left')
df_location_st = df_location_prep.rename(columns = {
    'KOATUU_2020':'KOATUU_2020_student', 
    'region_name':'region_name_student', 
    'category':'category_student'
})

In [49]:
df_students_new = pd.merge(df_students, df_location_st[['region_name_student','KOATUU_2020_student','KATOTTG_2023']], on='KATOTTG_2023', how='inner')[['outid','year', 'region_name_student', 'KOATUU_2020_student','KATOTTG_2023']]
df_students_new.head()

Unnamed: 0,outid,year,region_name_student,KOATUU_2020_student,KATOTTG_2023
0,a99c6c63-aa70-4aec-ba42-370f7261e857,2016,Zaporizka,2323085101,UA23080270010078454
1,45f4177a-57a6-418f-aebe-652a64ced34d,2016,Zaporizka,2323085101,UA23080270010078454
2,c377271c-63de-40c7-9e35-82e5c256c6b4,2016,Zaporizka,2323085101,UA23080270010078454
3,c5cc9493-82b1-4d99-81dd-44d392dd0c4d,2016,Zaporizka,2323085101,UA23080270010078454
4,8674b86e-9931-48bf-a16a-507c1792f7e1,2016,Zaporizka,2323085101,UA23080270010078454
