# Comparative Analysis of Literacy Assessments Scores of Native- and Foreign-born children using PIRLS 2021 Data

I started this wanting to look at refugee education results in various countries as a continuation of a previous research project looking at educational models for refugee children in Greece. 

However I wanted to use PIRLS data as opposed to another international student achievement database such as ….. Because I am first and foremost concerned with literacy education as one of the capabilities with the potential to unlock many other capabilities, using Amatry Sen’s capabilities approach on having a good life. I believe this aspect of schooling to be of the upmost importance for refugee children in particular as many refugee children face ‘an unknowable future’. 

Unfortunately PIRLS did not have data from Greece or other countries of interest. 


# Importing, Organising and Merging Datasets

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [None]:
#TurkeyHC = pd.read_excel("data_folder/ASHTURR5.xlsx")

I chose to use Turkey, Austria, Germany, Egypt, France, Iran, Jordan, The Netherlands and Sweden because:

After an initial look at what the datasets contained - I chose to use the Home Context Survey data (answered by parents) and the Student Context Survey data (answered by students) as the source data for this examination. Helpfully, the Student Context Data also included the (aggregated?) assessment scores for each child, meaning it was not necessary to use the original assessment score dataset.

[PIRLS](https://pirls2021.org/data/). This is where I was able to download all the SPSS files which are separated by country code and survey type.

SOURCE: IEA’s Progress in International Reading Literacy Study – PIRLS 2021 Copyright © 2023 International Association for the Evaluation of Educational Achievement (IEA). Publisher: TIMSS & PIRLS International Study Center, Lynch School of Education and Human Development, Boston College.

## Home Context Data

First I will read in the country files of the Home Context Data. This is data from a questionnaire given to the parents of each student. It contains crucial information on the immigration status of students.

In [2]:
TurkeyHC = pd.read_excel("ASHTURR5.xlsx")
AustriaHC = pd.read_excel("ASHAUTR5.xlsx")
GermanyHC = pd.read_excel("ASHDEUR5.xlsx")
EgyptHC = pd.read_excel("ASHEGYR5.xlsx")
FranceHC = pd.read_excel("ASHFRAR5.xlsx")
IranHC = pd.read_excel("ASHIRNR5.xlsx")
JordanHC = pd.read_excel("ASHJORR5.xlsx")
NetherlandsHC = pd.read_excel("ASHNLDR5.xlsx")
SwedenHC = pd.read_excel("ASHSWER5.xlsx")

In [None]:
'''TurkeyHC = pd.read_excel("data_folder/ASHTURR5.xlsx")
AustriaHC = pd.read_excel("data_folder/ASHAUTR5.xlsx")
GermanyHC = pd.read_excel("data_folder/ASHDEUR5.xlsx")
EgyptHC = pd.read_excel("data_folder/ASHEGYR5.xlsx")
FranceHC = pd.read_excel("data_folder/ASHFRAR5.xlsx")
IranHC = pd.read_excel("data_folder/ASHIRNR5.xlsx")
JordanHC = pd.read_excel("data_folder/ASHJORR5.xlsx")
NetherlandsHC = pd.read_excel("data_folder/ASHNLDR5.xlsx")
SwedenHC = pd.read_excel("data_folder/ASHSWER5.xlsx")

In [3]:
# Concatenating all the Home Context Data sets into one
HCdf = pd.concat([TurkeyHC, AustriaHC, GermanyHC, EgyptHC, FranceHC, IranHC, JordanHC, NetherlandsHC, SwedenHC])

In [4]:
HCdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50367 entries, 0 to 5174
Columns: 118 entries, IDCNTRY to SCOPE
dtypes: float64(8), int64(6), object(104)
memory usage: 45.7+ MB


It is interesting here that when the dataframes were concatenated the Dtype of the columns changed from int to object. I believe this is because the data for some countries has been inputted as strings and for other countries as integers. 

I will now examine the columns in more detail, the internation Supplement 1, available here.  codebook to understand which columns might be useful for my analysis. Those selected were: 
- ASBH02A Was your child born in country? 
- ASBH02B If No, How old was your child when he/she came to country?
- ASBH03A What language did your child speak before he/she began school? (language of test)
- ASBH04 How often does your child speak (language of test) at home?
- ASBH15A What is the highest level of education completed by the child/s parents/guardians? <Parent/Guardian A>
- ASBH15B What is the highest level of education completed by the child/s <parents/guardians>? <Parent/Guardian B>
- ASBH16 How far in his/her education do you expect your child to go? ASBH19
- ASBH17A What kind of work do the child's <parents/guardians> do for their
main jobs? <Parent/Guardian A>
- ASBH17B What kind of work do the child's <parents/guardians> do for their
main jobs? <Parent/Guardian B>
- ASBH18AA Do the child's <parents/guardians> talk with the child in the following languages? <Parent/Guardian A> (language of test)
- ASBH18AB Do the child's <parents/guardians> talk with the child in the following
languages? <Parent/Guardian B> (language of test)

In [5]:
HCcolumns_to_keep = ['IDCNTRY','IDSTUD','ASBH02A','ASBH02B','ASBH03A','ASBH04', 'ASBH15A','ASBH15B','ASBH16','ASBH17A','ASBH17B','ASBH18AA','ASBH18AB']

In [6]:
HCdf = HCdf[HCcolumns_to_keep]

In [7]:
HCdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50367 entries, 0 to 5174
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   IDCNTRY   50367 non-null  int64 
 1   IDSTUD    50367 non-null  int64 
 2   ASBH02A   41503 non-null  object
 3   ASBH02B   16269 non-null  object
 4   ASBH03A   41179 non-null  object
 5   ASBH04    41231 non-null  object
 6   ASBH15A   37405 non-null  object
 7   ASBH15B   34597 non-null  object
 8   ASBH16    40406 non-null  object
 9   ASBH17A   36503 non-null  object
 10  ASBH17B   32361 non-null  object
 11  ASBH18AA  40161 non-null  object
 12  ASBH18AB  35131 non-null  object
dtypes: int64(2), object(11)
memory usage: 5.4+ MB


## Student Context Data

As we know refugee children have distinct psycho social needs at school and so it was important for me to see their wellbeing as well as their assessment scores. 

SC is the Student Context data which is a questionnaire filled in by the students themselves as well as some totals on their achievement in the test. 

In [8]:
TurkeySC = pd.read_excel("ASGTURR5.xlsx")
AustriaSC = pd.read_excel("ASGAUTR5.xlsx")
EgyptSC = pd.read_excel("ASGEGYR5.xlsx")
FranceSC = pd.read_excel("ASGFRAR5.xlsx")
IranSC = pd.read_excel("ASGIRNR5.xlsx")
JordanSC = pd.read_excel("ASGJORR5.xlsx")
NetherlandsSC = pd.read_excel("ASGNLDR5.xlsx")
SwedenSC = pd.read_excel("ASGSWER5.xlsx")
GermanySC = pd.read_excel("ASGDEUR5.xlsx")

In [None]:
'''TurkeySC = pd.read_excel("data_folder/ASGTURR5.xlsx")
AustriaSC = pd.read_excel("data_folder/ASGAUTR5.xlsx")
EgyptSC = pd.read_excel("data_folder/ASGEGYR5.xlsx")
FranceSC = pd.read_excel("data_folder/ASGFRAR5.xlsx")
IranSC = pd.read_excel("data_folder/ASGIRNR5.xlsx")
JordanSC = pd.read_excel("data_folder/ASGJORR5.xlsx")
NetherlandsSC = pd.read_excel("data_folder/ASGNLDR5.xlsx")
SwedenSC = pd.read_excel("data_folder/ASGSWER5.xlsx")
GermanySC = pd.read_excel("data_folder/ASGDEUR5.xlsx")

In [9]:
SCdf = pd.concat([TurkeySC, AustriaSC, EgyptSC, FranceSC, IranSC, JordanSC, NetherlandsSC, SwedenSC, GermanySC])

In [10]:
column_list = SCdf.columns.to_list()
# Join the list into a single string separated by ', '
# Format each column name with quotes
formatted_columns = ', '.join(f"'{col}'" for col in column_list)

# Print the formatted string
print(formatted_columns)

'IDCNTRY', 'IDPOP', 'IDGRADER', 'IDGRADE', 'WAVE', 'IDSCHOOL', 'IDCLASS', 'IDSTUD', 'ITSEX', 'ITADMINI', 'LCID_SA', 'LCID_SQ', 'ITLANG_SA', 'ITLANG_SQ', 'IDBOOK', 'ASBG01', 'ASBG03', 'ASBG04', 'ASBG05A', 'ASBG05B', 'ASBG05C', 'ASBG05D', 'ASBG05E', 'ASBG05F', 'ASBG05G', 'ASBG05H', 'ASBG05I', 'ASBG05J', 'ASBG05K', 'ASBG06', 'ASBG07A', 'ASBG07B', 'ASBG08A', 'ASBG08B', 'ASBG09A', 'ASBG09B', 'ASBG09C', 'ASBG09D', 'ASBG09E', 'ASBG09F', 'ASBG09G', 'ASBG09H', 'ASBG10A', 'ASBG10B', 'ASBG10C', 'ASBG10D', 'ASBG10E', 'ASBG10F', 'ASBG11A', 'ASBG11B', 'ASBG11C', 'ASBG11D', 'ASBG11E', 'ASBG11F', 'ASBG11G', 'ASBG11H', 'ASBG11I', 'ASBG11J', 'ASBR01A', 'ASBR01B', 'ASBR01C', 'ASBR01D', 'ASBR01E', 'ASBR01F', 'ASBR01G', 'ASBR01H', 'ASBR01I', 'ASBR02A', 'ASBR02B', 'ASBR02C', 'ASBR02D', 'ASBR02E', 'ASBR03A', 'ASBR03B', 'ASBR03C', 'ASBR04', 'ASBR05', 'ASBR06A', 'ASBR06B', 'ASBR07A', 'ASBR07B', 'ASBR07C', 'ASBR07D', 'ASBR07E', 'ASBR07F', 'ASBR07G', 'ASBR07H', 'ASBR08A', 'ASBR08B', 'ASBR08C', 'ASBR08D', 'ASBR08

As there were many *more* columns in the Student Context Data, it was helpful to divide these into groups.

In [11]:
SCdemographic_info_columns = ['IDCNTRY','IDSTUD','ASBG01', 'ASBG03','ASDAGE']

In [12]:
SCexperience_in_school_columns = ['ASBG10A', 'ASBG10B', 'ASBG10C', 'ASBG10D', 'ASBG10E', 'ASBG10F', 'ASBG11A', 'ASBG11B', 'ASBG11C', 'ASBG11D', 'ASBG11E', 'ASBG11F', 'ASBG11G', 'ASBG11H', 'ASBG11I', 'ASBG11J']

In [13]:
SCassessment_score_columns = ['ASRREA01', 'ASRREA02', 'ASRREA03', 'ASRREA04', 'ASRREA05', 'ASRLIT01', 'ASRLIT02', 'ASRLIT03', 'ASRLIT04', 'ASRLIT05', 'ASRINF01', 'ASRINF02', 'ASRINF03', 'ASRINF04', 'ASRINF05', 'ASRIIE01', 'ASRIIE02', 'ASRIIE03', 'ASRIIE04', 'ASRIIE05', 'ASRRSI01', 'ASRRSI02', 'ASRRSI03', 'ASRRSI04', 'ASRRSI05']

In [14]:
allSCcolumns = SCdemographic_info_columns + SCexperience_in_school_columns +SCassessment_score_columns

In [15]:
SCdf = SCdf[allSCcolumns]

## Merging Home Context and Student Context Data

In [16]:
df = pd.merge(HCdf, SCdf, on=None)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50367 entries, 0 to 50366
Data columns (total 57 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   IDCNTRY   50367 non-null  int64  
 1   IDSTUD    50367 non-null  int64  
 2   ASBH02A   41503 non-null  object 
 3   ASBH02B   16269 non-null  object 
 4   ASBH03A   41179 non-null  object 
 5   ASBH04    41231 non-null  object 
 6   ASBH15A   37405 non-null  object 
 7   ASBH15B   34597 non-null  object 
 8   ASBH16    40406 non-null  object 
 9   ASBH17A   36503 non-null  object 
 10  ASBH17B   32361 non-null  object 
 11  ASBH18AA  40161 non-null  object 
 12  ASBH18AB  35131 non-null  object 
 13  ASBG01    49428 non-null  object 
 14  ASBG03    48168 non-null  object 
 15  ASDAGE    50358 non-null  float64
 16  ASBG10A   48311 non-null  object 
 17  ASBG10B   48077 non-null  object 
 18  ASBG10C   47893 non-null  object 
 19  ASBG10D   47822 non-null  object 
 20  ASBG10E   47844 non-null  ob

Our new dataset has 57 columns and 50367 rows of data. Many columns have object dtype, indicating they may contain string values. This will be examined further in the next stage of the process, (Cleaning and Filtering the Data). 

In [18]:
df.to_excel('data99.xlsx')