# Literacy Outcomes of Native- and Foreign-born children - Importing, Organising and Merging Datasets

## Introduction

I have been concerned with the Syrian Civil War and the ensuing refugee crisis, since the inception of the war in 2011 when I was a student of Arabic at University of Edinburgh. Since then, I have worked directly with displaced people from this and other conflicts in Egypt, Lebanon and Greece. Whilst working with unaccompanied asylum-seeking children in Greece, I observed first hand the country's struggles to integrate asylum-seeking children into its national school system, and ultimately focused on this for my [Master's dissertation](https://sophieespencer.wordpress.com/2019/09/02/growth-unlocked/). In this dissertation, I identified the specific learning needs of refugee and asylum-seeking children as psycho-social support, protection from discrimination and bullying, and language acquisiton support. I also reviewed how other countries with high numbers of asylum-seeking children were addressing this challenge. 

Subsequently, I wanted to undertake a comparative investigation of countries with high numbers of asylum-seeking children to understand which countries were most successful at integrating foreign-born children into their national school systems. In the following analysis, I use data from the IEA’s Progress in International Reading Literacy Study 2021 (PIRLS 2021), an international assessment of fourth-grade students’ reading abilities conducted across 57 countries, to assess literacy outcomes for native and foreign-born children across nine countries with significant refugee populations. 

The goal of this study is to serve as a preliminary quantitative investigation to identify countries that demonstrate best practice in integrating foreign-born children into host-country school systems, so that further targeted investigations into successful policies and practices can be made. 

## Research Questions

This study is guided by the following research questions:

- How do average literacy scores compare across countries with significant refugee populations?

- Are there significant differences in literacy scores between native- and foreign-born children in these countries?

- How does the age at which foreign-born children arrive in a host country affect their literacy outcomes?

## Importing the Data

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

## About PIRLS

PIRLS (Progress in International Reading Literacy Study) is a global assessment of reading proficiency among fourth graders, conducted every five years since its inception in 2001. In the 2021 cycle, nearly 400,000 children from 57 countries participated. Beyond reading assessments, PIRLS includes context questionnaires for students, parents, teachers, and schools, offering invaluable insights into the realities of learning environments.

For more information on PIRLS, please see [here.](https://www.iea.nl/studies/iea/pirls)

## Selecting Data for Analysis

To keep the analysis manageable, I wanted to focus my analysis on approximately 10 countries. I finally settled on Turkey, Austria, Germany, Egypt, France, Iran, Jordan, The Netherlands and Sweden - countries with high numbers of Syrian and Afghan refugees, two groups I had previously worked closely with.  

While I was hoping to include Greece as a continuation of research I undertook for my dissertation, Greece did not take part in PIRLS 2021. Additionally, although England did participate, it did not conduct the Home Context questionnaire, which reveals a child's nativity status. 

After an initial review of the datasets using the codebooks  [here](https://pirls2021.org/data/) - I chose to use the Home Context Survey data (answered by parents) and the Student Context Survey data (answered by students) as the primary data sources for this examination. Conveniently, the Student Context Data included the aggregated assessment scores for each child, making it unnecessary to use the original assessment score dataset.

I then downloaded the 18 SPSS files (the Home and Student Context data for the 9 countries included) and converted these files to excel.

SOURCE: IEA’s Progress in International Reading Literacy Study – PIRLS 2021 Copyright © 2023 International Association for the Evaluation of Educational Achievement (IEA). Publisher: TIMSS & PIRLS International Study Center, Lynch School of Education and Human Development, Boston College.

## Home Context Data

Next I read in the country files of the Home Context Data. This is data from a questionnaire given to the parents of each student.

In [2]:
TurkeyHC = pd.read_excel("ASHTURR5.xlsx")
AustriaHC = pd.read_excel("ASHAUTR5.xlsx")
GermanyHC = pd.read_excel("ASHDEUR5.xlsx")
EgyptHC = pd.read_excel("ASHEGYR5.xlsx")
FranceHC = pd.read_excel("ASHFRAR5.xlsx")
IranHC = pd.read_excel("ASHIRNR5.xlsx")
JordanHC = pd.read_excel("ASHJORR5.xlsx")
NetherlandsHC = pd.read_excel("ASHNLDR5.xlsx")
SwedenHC = pd.read_excel("ASHSWER5.xlsx")

I then concatenated these separate dataframes into one. 

In [3]:
# Concatenating all the Home Context Data sets into one
HCdf = pd.concat([TurkeyHC, AustriaHC, GermanyHC, EgyptHC, FranceHC, IranHC, JordanHC, NetherlandsHC, SwedenHC])

In [4]:
HCdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50367 entries, 0 to 5174
Columns: 118 entries, IDCNTRY to SCOPE
dtypes: float64(8), int64(6), object(104)
memory usage: 45.7+ MB


I then wanted to scale down this large dataframe and remove columns that would not be used in further analysis. I used the international Supplement 1, available here to select columns to keep.

Those selected were: 
- ASBH02A - Was your child born in country? 
- ASBH02B - If No, How old was your child when he/she came to country?
- ASBH03A - What language did your child speak before he/she began school? (language of test)
- ASBH04 - How often does your child speak (language of test) at home?
- ASBH15A - What is the highest level of education completed by the child/s parents/guardians? (Parent/Guardian A)
- ASBH15B - What is the highest level of education completed by the child/s <parents/guardians>? (Parent/Guardian B)
- ASBH16 - How far in his/her education do you expect your child to go? ASBH19
- ASBH17A - What kind of work do the child's (parents/guardians) do for their main jobs? (Parent/Guardian A)
- ASBH17B - What kind of work do the child's <parents/guardians> do for their main jobs? (Parent/Guardian B)
- ASBH18AA - Do the child's (parents/guardians) talk with the child in the following languages? (Parent/Guardian A) (language of test)
- ASBH18AB - Do the child's (parents/guardians) talk with the child in the following languages? (Parent/Guardian B) (language of test)

In [5]:
HCcolumns_to_keep = ['IDCNTRY','IDSTUD','ASBH02A','ASBH02B','ASBH03A','ASBH04', 'ASBH15A','ASBH15B','ASBH16','ASBH17A','ASBH17B','ASBH18AA','ASBH18AB']

In [6]:
HCdf = HCdf[HCcolumns_to_keep]

In [7]:
HCdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50367 entries, 0 to 5174
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   IDCNTRY   50367 non-null  int64 
 1   IDSTUD    50367 non-null  int64 
 2   ASBH02A   41503 non-null  object
 3   ASBH02B   16269 non-null  object
 4   ASBH03A   41179 non-null  object
 5   ASBH04    41231 non-null  object
 6   ASBH15A   37405 non-null  object
 7   ASBH15B   34597 non-null  object
 8   ASBH16    40406 non-null  object
 9   ASBH17A   36503 non-null  object
 10  ASBH17B   32361 non-null  object
 11  ASBH18AA  40161 non-null  object
 12  ASBH18AB  35131 non-null  object
dtypes: int64(2), object(11)
memory usage: 5.4+ MB


## Student Context Data

I copied the process above with the Student Context Data.

In [8]:
TurkeySC = pd.read_excel("ASGTURR5.xlsx")
AustriaSC = pd.read_excel("ASGAUTR5.xlsx")
EgyptSC = pd.read_excel("ASGEGYR5.xlsx")
FranceSC = pd.read_excel("ASGFRAR5.xlsx")
IranSC = pd.read_excel("ASGIRNR5.xlsx")
JordanSC = pd.read_excel("ASGJORR5.xlsx")
NetherlandsSC = pd.read_excel("ASGNLDR5.xlsx")
SwedenSC = pd.read_excel("ASGSWER5.xlsx")
GermanySC = pd.read_excel("ASGDEUR5.xlsx")

In [9]:
SCdf = pd.concat([TurkeySC, AustriaSC, EgyptSC, FranceSC, IranSC, JordanSC, NetherlandsSC, SwedenSC, GermanySC])

In [10]:
column_list = SCdf.columns.to_list()
# Join the list into a single string separated by ', '
# Format each column name with quotes
formatted_columns = ', '.join(f"'{col}'" for col in column_list)

# Print the formatted string
print(formatted_columns)

'IDCNTRY', 'IDPOP', 'IDGRADER', 'IDGRADE', 'WAVE', 'IDSCHOOL', 'IDCLASS', 'IDSTUD', 'ITSEX', 'ITADMINI', 'LCID_SA', 'LCID_SQ', 'ITLANG_SA', 'ITLANG_SQ', 'IDBOOK', 'ASBG01', 'ASBG03', 'ASBG04', 'ASBG05A', 'ASBG05B', 'ASBG05C', 'ASBG05D', 'ASBG05E', 'ASBG05F', 'ASBG05G', 'ASBG05H', 'ASBG05I', 'ASBG05J', 'ASBG05K', 'ASBG06', 'ASBG07A', 'ASBG07B', 'ASBG08A', 'ASBG08B', 'ASBG09A', 'ASBG09B', 'ASBG09C', 'ASBG09D', 'ASBG09E', 'ASBG09F', 'ASBG09G', 'ASBG09H', 'ASBG10A', 'ASBG10B', 'ASBG10C', 'ASBG10D', 'ASBG10E', 'ASBG10F', 'ASBG11A', 'ASBG11B', 'ASBG11C', 'ASBG11D', 'ASBG11E', 'ASBG11F', 'ASBG11G', 'ASBG11H', 'ASBG11I', 'ASBG11J', 'ASBR01A', 'ASBR01B', 'ASBR01C', 'ASBR01D', 'ASBR01E', 'ASBR01F', 'ASBR01G', 'ASBR01H', 'ASBR01I', 'ASBR02A', 'ASBR02B', 'ASBR02C', 'ASBR02D', 'ASBR02E', 'ASBR03A', 'ASBR03B', 'ASBR03C', 'ASBR04', 'ASBR05', 'ASBR06A', 'ASBR06B', 'ASBR07A', 'ASBR07B', 'ASBR07C', 'ASBR07D', 'ASBR07E', 'ASBR07F', 'ASBR07G', 'ASBR07H', 'ASBR08A', 'ASBR08B', 'ASBR08C', 'ASBR08D', 'ASBR08

As there were many more columns in the Student Context Data, it was helpful to identify data groupings and divide the columns into these groups.

In [11]:
SCdemographic_info_columns = ['IDCNTRY','IDSTUD','ASBG01', 'ASBG03','ASDAGE']

In [12]:
SCexperience_in_school_columns = ['ASBG10A', 'ASBG10B', 'ASBG10C', 'ASBG10D', 'ASBG10E', 'ASBG10F', 'ASBG11A', 'ASBG11B', 'ASBG11C', 'ASBG11D', 'ASBG11E', 'ASBG11F', 'ASBG11G', 'ASBG11H', 'ASBG11I', 'ASBG11J']

In [13]:
SCassessment_score_columns = ['ASRREA01', 'ASRREA02', 'ASRREA03', 'ASRREA04', 'ASRREA05', 'ASRLIT01', 'ASRLIT02', 'ASRLIT03', 'ASRLIT04', 'ASRLIT05', 'ASRINF01', 'ASRINF02', 'ASRINF03', 'ASRINF04', 'ASRINF05', 'ASRIIE01', 'ASRIIE02', 'ASRIIE03', 'ASRIIE04', 'ASRIIE05', 'ASRRSI01', 'ASRRSI02', 'ASRRSI03', 'ASRRSI04', 'ASRRSI05']

The feelings in school columns all begin with the prefix 'ASBG10' followed by:

What do you think about your school? Tell how much you agree with
these statements. 


A: I like being in school\
B: I feel safe when I am at school\
C: I feel like I belong at this school\
D: Teachers at my school are fair to me\
E: I am proud to go to this school\
F: I have friends at this school\

1 = Agree a lot\
2 = Agree a little\
3 = Disagree a little\
4 = Disagree a lot

During this year, how often have other students from your school done any of the following things to you, including through texting or the internet?


A: Made fun of me or called me names\
B: Left me out of their games or activities\
C: Spread lies about me\
D: Stole something from me\
E: Damaged something of mine on purpose\
F: Hit or hurt me (e.g., shoving, hitting, kicking)\
G: Made me do things I didn’t want to do\
H: Sent me nasty or hurtful messages online\
I: Shared nasty or hurtful information about me online\ 
J: Threatened me


1 = At least once a week\
2 = Once or twice a month\
3 = A few times a year\
4 = Never

As we know, refugee children have distinct psycho social needs at school and so it was important for me to see their wellbeing as well as their assessment scores. 

SC is the Student Context data which is a questionnaire filled in by the students themselves as well as some totals on their achievement in the test. 

In [14]:
allSCcolumns = SCdemographic_info_columns + SCexperience_in_school_columns +SCassessment_score_columns

In [15]:
SCdf = SCdf[allSCcolumns]

## Merging Home Context and Student Context Data

I then merged the data on None so that pandas would identify the columns present in both dataframes - 'IDCNTRY' and 'IDSTUD' - and use these as the keys for merging. 

In [16]:
df = pd.merge(HCdf, SCdf, on=None)

I then inspected the merge to make sure it had worked successfully.

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50367 entries, 0 to 50366
Data columns (total 57 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   IDCNTRY   50367 non-null  int64  
 1   IDSTUD    50367 non-null  int64  
 2   ASBH02A   41503 non-null  object 
 3   ASBH02B   16269 non-null  object 
 4   ASBH03A   41179 non-null  object 
 5   ASBH04    41231 non-null  object 
 6   ASBH15A   37405 non-null  object 
 7   ASBH15B   34597 non-null  object 
 8   ASBH16    40406 non-null  object 
 9   ASBH17A   36503 non-null  object 
 10  ASBH17B   32361 non-null  object 
 11  ASBH18AA  40161 non-null  object 
 12  ASBH18AB  35131 non-null  object 
 13  ASBG01    49428 non-null  object 
 14  ASBG03    48168 non-null  object 
 15  ASDAGE    50358 non-null  float64
 16  ASBG10A   48311 non-null  object 
 17  ASBG10B   48077 non-null  object 
 18  ASBG10C   47893 non-null  object 
 19  ASBG10D   47822 non-null  object 
 20  ASBG10E   47844 non-null  ob

The merge was successful as there is the correct number of columns and the same number of rows.
The new dataset has 57 columns and 50367 rows of data. Many columns have object dtype, indicating they may contain string values. This will be examined further in the next stage of the process, (Cleaning and Filtering the Data). 

In [18]:
df.to_excel('data99.xlsx')