# Problem statement
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:

* Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
* Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
* Synthesise/simulate a data set as closely matching their properties as possible.
* Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

Initial thoughts - Only a fraction of International students in China get to the HSK 5 or 6.

Despite the large number of international students in China, the minority attempt a standardized test (HSK). Even then, the bulk of people that do these tests go for HSK 4 or HSK 5, with HSK 6 having a noticable drop off. This is likely due to HSK 5 being an entry requirement for most college degrees in China, and so anything below HSK 4 or 5 is seen as having little value, whereas the exponential difficulty gap between each level means that HSK 6 is a large commitment to aim towards. Of those that do attempt the HSK though, the pass rate is quite high - likely due to preparation classes available as extras in Chinese universities and schools.

I will also be highlighting the distribution of countries that students come from, with South Korea being the clear leader.


## Variables

* HSK level (see if I can get statistics on how many are awarded)
* Education background
* Origin country
* Funding of study
* Level of program they enroll into
* Hours of study
* Scores

## Country of Origin

Turns out I only have access to the top 15 countries from 2018

* South Korea	50,600
* Thailand	28,608
* Pakistan	28,023
* India	23,198
* United States	20,996
* Russia	19,239
* Indonesia	15,050
* Laos	14,645
* Japan	14,230
* Kazakhstan	11,784
* Vietnam	11,299
* Bangladesh	10,735
* France	10,695
* Mongolia	10,158
* Malaysia	9,479

https://www.researchcghe.org/perch/resources/publications/to-publish-wp46.pdf

Also has top 10 countries 2000-2016, and total international students. As I have HSK test data from 2012, if I follow these proportions I could estimate how many American students took on each of the tests. If I use the normal distribution of scores based on the earlier paper, I could simulate what students took on the HSK, what level and what score.

It also has the % of students enrolled in fulltime degrees, etc. I know most degrees require HSK 5 at least in order to enrol, so I could extrapolate this out to make an educated guess on the number % of students in Chinese Language undergraduate degrees, as undergraduate Chinese language degrees would not require a HSK to enroll.

It also shows how many students were receiving scholarships until 2013, and what proportion were for non-degree students. If I were to assume it grew at about the same rate as overall international students, and that it's shared proportionally between students from various countries, I could look at who was self-funded versus on scholarship.



In 2018 there were 492,185 International Students in China (http://global.chinadaily.com.cn/a/201904/12/WS5cb05c3ea3104842260b5eed.html#:~:text=Almost%20500%2C000%20international%20students%20studied,ministry%20said%20in%20a%20statement.)

So based off of that number and the above breakdown we know that 278739 came from those 15 countries, therefore 213446 would come from "Rest of the World". At present I do not have a way to break these down further. 




I've since found that there were 81,562 African students in China in 2018 (https://www.studyinternational.com/news/african-students-china-alienated/). There's no breakdown by country though. This would bring rest of the world down to 131884.

* Rest of the World 131,884
* Africa 81,562
* South Korea	50,600
* Thailand	28,608
* Pakistan	28,023
* India	23,198
* United States	20,996
* Russia	19,239
* Indonesia	15,050
* Laos	14,645
* Japan	14,230
* Kazakhstan	11,784
* Vietnam	11,299
* Bangladesh	10,735
* France	10,695
* Mongolia	10,158
* Malaysia	9,479

Going to work out proportions from each country

World Population Dataset from https://data.worldbank.org/indicator/SP.POP.TOTL

## Selecting African and Rest of the World countries.

I'm going to use data from the World Bank to find out what proportion of the African population each country of African has, and I'll use this as the probability of the student coming from that country. This isn't a perfect measure, as in the real world there'd be political and academic exchanges with particular countries, meanwhile some countries are more likely to have a population that can afford to go to China for self-funded study. Nonetheless, I prefer this route as I do not want the 54 countries of African treated as one unit.

I will also be doing the same for the Rest of the World. This comes with the same caveats as the African countries in that they do not represent academic or political exchanges. That said, the countries with large populations that are not counted in this top 15 or the African countries do tend to have populations that may afford studying abroad. For example Germany would be one of the remaining countries with a larger population, and so it's probability will be higher than others, however in real life its population is also likely to be more able to afford self-funded study in China.

In [1]:
#Can alter this to change the school size
school_size = 2000

seed = 777

In [2]:
import pandas as pd
import numpy as np

#read_csv documentation https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html
#I've edited this dataset to remove aggregations (e.g. Eurozone, World).
#as well as China, Hong Kong and Macau - as these would not be considered International Students.
populations_df = pd.read_csv('world_populations.csv', usecols = ['Country Name', 'Country Code', '2019'])

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
populations_df.dropna(inplace=True)

populations_df['2019'] = populations_df['2019'].astype('int')


In [3]:
#Country Codes
african_country_codes = ['DZA','AGO','BWA','IOT','BDI','CMR','CPV','CAF','TCD',
                     'COM','MYT','COG','COD','BEN','GNQ','ETH','ERI','ATF',
                     'DJI','GAB','GMB','GHA','GIN','CIV','KEN','LSO','LBR',
                     'LBY','MDG','MWI','MLI','MRT','MUS','MAR','MOZ','NAM',
                     'NER','NGA','GNB','REU','RWA','SHN','STP','SEN','SYC',
                     'SLE','SOM','ZAF','ZWE','SSD','SDN','ESH','SWZ','TGO',
                     'TUN','UGA','EGY','TZA','BFA','ZMB']

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html
african_countries = populations_df.loc[populations_df['Country Code'].isin(african_country_codes)]

#Get the sum of african population to work out proportions per country in a moment.
africa_total_population = african_countries['2019'].sum()

african_countries['Proportion of African Population'] = african_countries['2019'] / africa_total_population

african_country = np.array(african_countries['Country Name'])
african_probabilities = np.array(african_countries['Proportion of African Population'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  african_countries['Proportion of African Population'] = african_countries['2019'] / africa_total_population


In [4]:
african_probabilities

array([3.30460644e-02, 2.44280173e-02, 9.05816334e-03, 1.76823970e-03,
       1.55980007e-02, 8.85048222e-03, 4.22111458e-04, 1.98618318e-02,
       3.64224310e-03, 1.22402813e-02, 6.53111241e-04, 6.66174962e-02,
       4.12989549e-03, 1.97391471e-02, 7.47271643e-04, 7.70544808e-02,
       1.04080887e-03, 8.81265656e-04, 8.60278326e-02, 1.66759797e-03,
       1.80201952e-03, 2.33477148e-02, 9.80277537e-03, 1.47443459e-03,
       4.03539989e-02, 1.63128365e-03, 3.78976086e-03, 5.20214234e-03,
       2.07007255e-02, 1.42987945e-02, 1.50888380e-02, 3.47377079e-03,
       9.71516844e-04, 2.79944932e-02, 2.33079395e-02, 1.91471664e-03,
       1.78925144e-02, 1.54252844e-01, 9.69201865e-03, 1.65069693e-04,
       1.25085364e-02, 7.49336396e-05, 5.99715889e-03, 1.18534502e-02,
       4.49473424e-02, 8.49090283e-03, 3.28619897e-02, 4.45230265e-02,
       6.20375007e-03, 8.97646975e-03, 3.39798392e-02, 1.37095210e-02,
       1.12413646e-02])

In [5]:
#Need to remove the top 15 countries from rest of the world too

top_15_country_codes = ['KOR', 'THA', 'PAK', 'IND', 'USA', 'RUS', 'IDN', 'LAO', 'JPN', 'KAZ', 'VNM', 'BGD', 'FRA', 'MNG', 'MYS']

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html
top_15_countries = populations_df.loc[populations_df['Country Code'].isin(top_15_country_codes)]

#Get the sum of african population to work out proportions per country in a moment.
top_15_total_population = top_15_countries['2019'].sum()

top_15_total_population

2961247792

In [6]:
#Now to work out the rest of the world

#World Population without China, Hong Kong & Macau, minus the African population, minus the top 15 countries
rest_of_world_population = 6267671127 - africa_total_population - top_15_total_population

rest_of_world = populations_df.loc[~populations_df['Country Code'].isin(african_country_codes)]
rest_of_world = rest_of_world.loc[~rest_of_world['Country Code'].isin(top_15_country_codes)]


rest_of_world['Proportion of World Population'] = rest_of_world['2019'] / rest_of_world_population

rest_of_world.head()

rest_of_world_country = np.array(rest_of_world['Country Name'])
rest_of_world_probabilities = np.array(rest_of_world['Proportion of World Population'])

#Probabilities do not add up to 1.0 right now, likely due to my removal of countries. 
#Need to plug the gap, and will assign it as the country "Other"
other = 1 - sum(rest_of_world_probabilities)

#Add "Other" into the selection and probabilties
rest_of_world_country = np.insert(rest_of_world_country, -1, 'Other')
rest_of_world_probabilities = np.insert(rest_of_world_probabilities, -1,  other)

Now that I have the probabilities and country names set up for the African and Rest of the World countries, I can use np.random.choice() to pick a student origin from the 17 regions (top 15 countries, Africa and Rest of the world). If Africa or Rest of the World is selected, it can then pick a country from those subsections.

Now I'll need to get the probabilities of each of the 17 regions to be selected.

In [7]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice

student_origins = np.array(['Rest of the World', 'Africa', 'South Korea', 'Thailand', 'Pakistan', 'India', 'United States',
                  'Russia', 'Indonesia', 'Laos', 'Japan', 'Kazakhstan', 'Vietnam', 'Bangladesh', 'France', 'Mongolia',
                  'Malaysia'])

#Count of students from each region in 2018
student_origin_counts = np.array([131884, 81562, 50600, 28608, 28023, 23198, 20996, 19239, 15050, 14645, 14230, 11784,
                        11299, 10735, 10695, 10158, 9479])


#Work out the probability that a student came from each region
student_origin_probabilities = student_origin_counts / sum(student_origin_counts)

sum(student_origin_probabilities)

0.9999999999999999

Now let's try pick a country location for each student.

In [8]:
rng = np.random.default_rng(seed)
students = rng.choice(student_origins, school_size, p=student_origin_probabilities)

students

array(['Pakistan', 'Africa', 'Pakistan', ..., 'Rest of the World',
       'United States', 'Rest of the World'], dtype='<U17')

In [9]:
#rest = rng.choice(rest_of_world_country, 139, p=rest_of_world_probabilities)

other = 1 - sum(rest_of_world_probabilities)
#rest
rest_of_world_country = np.insert(rest_of_world_country, -1, 'Other')
rest_of_world_probabilities = np.insert(rest_of_world_probabilities, -1,  other)

sum(rest_of_world_probabilities)

1.0

In [10]:
rng = np.random.default_rng()

#Creating an index of student IDs to use for this
index = np.arange(0,school_size)

#Create the dataframe
chinese_class_df = pd.DataFrame(index = index, columns = ['Nationality'])
#Generate the countries
chinese_class_df['Nationality'] = rng.choice(student_origins, school_size, p=student_origin_probabilities)

#Trying to ensure the amount of African countries selected is automatically generated
# https://stackoverflow.com/questions/49471442/using-pandas-value-counts-to-get-one-value

#african_picks = (chinese_class_df['Nationality'].values == 'Africa').sum()

#Iterating over rows
#African students have a country picked for them
chinese_class_df.loc[chinese_class_df['Nationality'] == 'Africa', 'Nationality'] = rng.choice(african_country, (chinese_class_df['Nationality'].values == 'Africa').sum(), p=african_probabilities) 

#row_picks = (chinese_class_df['Nationality'].values == 'Rest of the World').sum()

#Rest of the world students have a country picked for them
chinese_class_df.loc[chinese_class_df['Nationality'] == 'Rest of the World', 'Nationality'] = rng.choice(rest_of_world_country, (chinese_class_df['Nationality'].values == 'Rest of the World').sum(), p=rest_of_world_probabilities) 


chinese_class_df.value_counts()
#african_students

Nationality      
South Korea          204
Pakistan             116
Thailand             113
India                 89
United States         87
                    ... 
Jamaica                1
Singapore              1
Kosovo                 1
Equatorial Guinea      1
Papua New Guinea       1
Length: 142, dtype: int64

# Who sat the HSK tests?

Use 2012 figures for HSK test compared to International Students in China. Extrapolate what those numbers would be for 2018. Use the proportion that sat the tests as the success probability for a bernoulli distribution (via np.random.binomial).

Taken from Sina Weibo: 2009-2012 HSK takers - http://blog.sina.com.cn/s/blog_53e7c11d0101f02j.html
![here](https://screenshot.click/01_14-ryey4-tgwud.jpg)

Important to note that the current version of the HSK tests was introduced in 2010, so a sharp increase in the first few years isn't a surprise. (Wikipedia references the history - https://en.wikipedia.org/wiki/Hanyu_Shuiping_Kaoshi#Between_2010%E2%80%932020)

Also worthwhile noting that the table above doesn't specify students that took multiple HSK tests. For example someone could reasonably do HSK 1, 2 and 3 within the same year.

Looking at the numbers that sat the test within China (国内).

First let's look at the written HSK exams.
* HSK 一级
* HSK 二级
* HSK 三级
* HSK 四级
* HSK 五级
* HSK 六级

I'm going to calculate the annual total written HSK tests taken in each of these years, and compare to the number of international students in China for each year (from page 36 of https://www.researchcghe.org/perch/resources/publications/to-publish-wp46.pdf).

In [11]:
hsk_written_2010 = 146 + 210 + 1171 + 3842 + 6931 + 5566
hsk_written_2011 = 274 + 755 + 2504 + 11635 + 18018 + 12975
hsk_written_2012 = 658 + 1343 + 4003 + 16158 + 21278 + 17153

print("Total students taking HSK written exam in 2010: " + str(hsk_written_2010))
print("Total students taking HSK written exam in 2011: " + str(hsk_written_2011))
print("Total students taking HSK written exam in 2012: " + str(hsk_written_2012))

total_international_students_2010 = 265090
total_international_students_2011 = 292611
total_international_students_2012 = 328330

print("Proportion of international students taking HSK written exam in 2010: " + str(hsk_written_2010 / total_international_students_2010))
print("Proportion of international students taking HSK written exam in 2011: " + str(hsk_written_2011 / total_international_students_2011))
print("Proportion of international students taking HSK written exam in 2012: " + str(hsk_written_2012 / total_international_students_2012))

Total students taking HSK written exam in 2010: 17866
Total students taking HSK written exam in 2011: 46161
Total students taking HSK written exam in 2012: 60593
Proportion of international students taking HSK written exam in 2010: 0.06739597872420687
Proportion of international students taking HSK written exam in 2011: 0.15775551841865137
Proportion of international students taking HSK written exam in 2012: 0.18454908171656564


As 2012 is the latest year I can find for these types of figures, I will extrapolate the 2012 proportion to the number of international students in 2018. 

There were 492,185 international students in 2018, and if 18.5% took a HSK test we can expect:

In [12]:
hsk_written_total_2018 = 492185 * 0.185
print("Estimated total students taking HSK written exam in 2018: " + str(round(hsk_written_total_2018,0)))

Estimated total students taking HSK written exam in 2018: 91054.0


Now to see how many of my students sat a written HSK exam in 2018.

In [13]:
rng = np.random.default_rng()

#https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.binomial.html#numpy.random.Generator.binomial
chinese_class_df['HSK written test?'] = rng.binomial(1, 0.185, school_size)

chinese_class_df['HSK written test?'].value_counts()

0    1632
1     368
Name: HSK written test?, dtype: int64

And now the oral exams
* HSK 初级
* HSK 中级
* HSK 高级

In [14]:
hsk_oral_2010 = 51 + 300 + 672
hsk_oral_2011 = 67 + 506 + 1313
hsk_oral_2012 = 56 + 1708 + 1407

print("Total students taking HSK oral exam in 2010: " + str(hsk_oral_2010))
print("Total students taking HSK oral exam in 2011: " + str(hsk_oral_2011))
print("Total students taking HSK oral exam in 2012: " + str(hsk_oral_2012))

total_international_students_2010 = 265090
total_international_students_2011 = 292611
total_international_students_2012 = 328330

print("Proportion of international students taking HSK oral exam in 2010: " + str(hsk_oral_2010 / total_international_students_2010))
print("Proportion of international students taking HSK oral exam in 2011: " + str(hsk_oral_2011 / total_international_students_2011))
print("Proportion of international students taking HSK oral exam in 2012: " + str(hsk_oral_2012 / total_international_students_2012))

Total students taking HSK oral exam in 2010: 1023
Total students taking HSK oral exam in 2011: 1886
Total students taking HSK oral exam in 2012: 3171
Proportion of international students taking HSK oral exam in 2010: 0.0038590667320532648
Proportion of international students taking HSK oral exam in 2011: 0.00644541729463349
Proportion of international students taking HSK oral exam in 2012: 0.00965796607072153


Again, I'll use the 2012 figure, and round it up to 1%

In [15]:
hsk_oral_total_2018 = 492185 * 0.01
print("Estimated total students taking HSK oral exam in 2018: " + str(round(hsk_oral_total_2018,0)))

Estimated total students taking HSK oral exam in 2018: 4922.0


Now to see how many of my students sat the HSK oral exam in 2018

In [16]:
rng = np.random.default_rng()

#https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.binomial.html#numpy.random.Generator.binomial
chinese_class_df['HSK oral test?'] = rng.binomial(1, 0.01, school_size)

chinese_class_df['HSK oral test?'].value_counts()

0    1982
1      18
Name: HSK oral test?, dtype: int64

Did any students sit both the written and oral exams?

In [17]:
chinese_class_df.loc[(chinese_class_df['HSK written test?'] == 1) & (chinese_class_df['HSK oral test?'] == 1)]

Unnamed: 0,Nationality,HSK written test?,HSK oral test?
542,"Egypt, Arab Rep.",1,1
1209,Iraq,1,1
1940,United States,1,1


### Who sat which exam?

Look at the proportion of folks that did Hsk 1-6, and HSK beginner-advanced to use as probabilities. If a student has 1 in the oral or written test columns, np.random.choice based on the probabilities. As I'll be extrapolating 2012 figures for 2018, I'll use the 2012 ratios here too.

In [18]:
#Using the 2012 figures to work out what proportion of HSK takers sat each level that year.

#Listing the 6 HSK levels for the written exam
hsk_written_levels = np.array(['HSK1', 'HSK2', 'HSK3', 'HSK4', 'HSK5', 'HSK6'])

#Dividing the 2012 figures for each level by total HSK tests taken that year. These proportions will be my probabilities
hsk_written_levels_proportions = np.array([658, 1343, 4003, 16158, 21278, 17153]) / hsk_written_2012

hsk_written_levels_proportions

array([0.01085934, 0.02216428, 0.06606374, 0.26666447, 0.35116268,
       0.2830855 ])

We can use these proportions to estimate how many of our estimated 91054 HSK written exam takers took each level in 2018.

In [19]:
print("Estimated HSK 1 takers in 2018: " + str(round(hsk_written_levels_proportions[0] * 91054, 0)))
print("Estimated HSK 2 takers in 2018: " + str(round(hsk_written_levels_proportions[1] * 91054, 0)))
print("Estimated HSK 3 takers in 2018: " + str(round(hsk_written_levels_proportions[2] * 91054, 0)))
print("Estimated HSK 4 takers in 2018: " + str(round(hsk_written_levels_proportions[3] * 91054, 0)))
print("Estimated HSK 5 takers in 2018: " + str(round(hsk_written_levels_proportions[4] * 91054, 0)))
print("Estimated HSK 6 takers in 2018: " + str(round(hsk_written_levels_proportions[5] * 91054, 0)))


Estimated HSK 1 takers in 2018: 989.0
Estimated HSK 2 takers in 2018: 2018.0
Estimated HSK 3 takers in 2018: 6015.0
Estimated HSK 4 takers in 2018: 24281.0
Estimated HSK 5 takers in 2018: 31975.0
Estimated HSK 6 takers in 2018: 25776.0


Now to use these proportions as my probabilities to estimate what level each of my school's HSK takers attempted.

In [20]:
#Only want students that have "HSK written test?" set as 1. First count how many there are to use as the size for rng.choice()

chinese_class_df.loc[chinese_class_df['HSK written test?'] == 1, 'HSK Level'] = rng.choice(hsk_written_levels, chinese_class_df['HSK written test?'].sum(), p=hsk_written_levels_proportions) 

chinese_class_df['HSK Level'].value_counts()

HSK5    130
HSK6    108
HSK4     99
HSK3     22
HSK1      5
HSK2      4
Name: HSK Level, dtype: int64

Now to do the same for the oral test.

In [21]:
hsk_oral_levels = np.array(['Beginner', 'Intermediate', 'Advanced'])
hsk_oral_level_proportions = np.array([56, 1708, 1407]) / hsk_oral_2012

hsk_oral_level_proportions

array([0.01766004, 0.53863135, 0.44370861])

Once again, let's extrapolate this our for 2018. Early I estimated 4922 people sat a HSK oral test.

In [22]:
print("Estimated HSK Beginner takers in 2018: " + str(round(hsk_oral_level_proportions[0] * 4922, 0)))
print("Estimated HSK Intermediate takers in 2018: " + str(round(hsk_oral_level_proportions[1] * 4922, 0)))
print("Estimated HSK Advanced takers in 2018: " + str(round(hsk_oral_level_proportions[2] * 4922, 0)))

Estimated HSK Beginner takers in 2018: 87.0
Estimated HSK Intermediate takers in 2018: 2651.0
Estimated HSK Advanced takers in 2018: 2184.0


Now to see which oral tests the students in my school took.

In [23]:
#Only want students that have "HSK oral test?" set as 1. First count how many there are to use as the size for rng.choice()

chinese_class_df.loc[chinese_class_df['HSK oral test?'] == 1, 'HSK Oral Level'] = rng.choice(hsk_oral_levels, chinese_class_df['HSK oral test?'].sum(), p=hsk_oral_level_proportions) 

chinese_class_df['HSK Oral Level'].value_counts()

Intermediate    13
Advanced         4
Beginner         1
Name: HSK Oral Level, dtype: int64

# HSK Results

For those students that sat the HSK tests, we can now simulate what their scores might be. Finding data on HSK scores and results has proven to be very difficult (both in Chinese and in English), however according there has been a study on 108 American students and their HSK results before and after a semester of study in Beijing (https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625). It should be noted that this paper covers just 108 students from the same country, so in reality there's likely to be numerous other variables that may impact the results seen by a particular student or even a whole cohort of students. For example, Chinese textbooks will have grammar and other explanations written in English for HSK 1 through to HSK 3, and the quality of those translations, or how comparable the grammar rules are to the readers native language may influence how well they retain the information and therefore perform on the test. 

Similarly, all 108 students in this study took the HSK 4 written test and the intermediate oral test, so we do not have comparable results for the various other levels. That said, all of the tests follow a similar marking structure, with each section scored out of 100, and so for the purposes of this assignment I will be utilising the mean and standard deviation that was found among those 108 students.

This is by no means a perfect simulation of the scores that I can expect at my fictional school, but with the absence of data on HSK test scores or pass/fail rates, it will have to suffice.

## HSK Written Test Results

https://www.chinaeducenter.com/en/exams.php

The HSK written tests follow a similar structure, with each level becoming more difficult. Levels 1 and 2 do not include a writing section, as students at this level are not expected to be able to hand write a large number of characters, and so reading and listening skills are the only areas tested. For levels 3 through to 6; reading, writing and listening are each tested.

For levels 1-5, the student must achieve at least 60% in order to pass the test. For HSK level 6, only a score of 40% is required. The HSK tests are a simple pass/fail grading system.

As each of the tests includes a reading and listening portion scored out of 100, and as I will be using the mean and standard deviation from the above linked study to simulate the results regardless of level, I can use the same function for all the written HSK takers. For this I will use a normal distribution.

In [37]:
#Reading results https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.normal.html#numpy.random.Generator.normal

reading_mean = 75.39
reading_sd = 13.85
chinese_class_df.loc[chinese_class_df['HSK written test?'] == 1, 'Reading'] = rng.normal(reading_mean, reading_sd, chinese_class_df['HSK written test?'].sum())

#Reducing the HSK6 results by 1/3 to reflect the higher difficulty. The mean and SD for HSK4 is too high for this test.

chinese_class_df.loc[chinese_class_df['HSK Level'] == 'HSK6', ['Reading']] = (chinese_class_df['Reading'] / 3) * 2

#Normal distribution will pick a few values above 100. This goes beyond the possible score in the section, so I'll cap these at 100.
chinese_class_df.loc[chinese_class_df['Reading'] > 100, 'Reading'] = 100


#Listening test
listening_mean = 70.73
listening_sd = 15.22

chinese_class_df.loc[chinese_class_df['HSK written test?'] == 1, 'Listening'] = rng.normal(listening_mean, listening_sd, chinese_class_df['HSK written test?'].sum())
chinese_class_df.loc[chinese_class_df['HSK Level'] == 'HSK6', ['Listening']] = (chinese_class_df['Listening'] / 3) * 2
chinese_class_df.loc[chinese_class_df['Listening'] > 100, 'Listening'] = 100


#Only HSK 3 - 6 has a writing section, so I'll make a mask to help me subset those tests.
hsk_writing_sections = ['HSK3', 'HSK4', 'HSK5', 'HSK6']

writing_mean = 69.67
writing_sd = 11.95

chinese_class_df.loc[chinese_class_df['HSK Level'].isin(hsk_writing_sections), 'Writing'] = rng.normal(writing_mean, writing_sd, len(chinese_class_df.loc[chinese_class_df['HSK Level'].isin(hsk_writing_sections)]))
chinese_class_df.loc[chinese_class_df['HSK Level'] == 'HSK6', ['Writing']] = (chinese_class_df['Writing'] / 3) * 2
chinese_class_df.loc[chinese_class_df['Writing'] > 100, 'Writing'] = 100

#Combined the 3 sections to get the score
chinese_class_df['Total Written Score'] = chinese_class_df['Reading'] + chinese_class_df['Listening'] + chinese_class_df['Writing']

#HSK 3 - 6 has 3 sections, so final score has to be divided by 3.
chinese_class_df.loc[chinese_class_df['HSK Level'].isin(hsk_writing_sections), 'Total Written Score'] = chinese_class_df['Total Written Score'] / 3

chinese_class_df.loc[~chinese_class_df['HSK Level'].isin(hsk_writing_sections), 'Total Written Score'] = chinese_class_df['Total Written Score'] / 2

hsk_pass_60_tests = ['HSK1','HSK2','HSK3','HSK4','HSK5']

chinese_class_df.loc[(chinese_class_df['HSK Level'].isin(hsk_pass_60_tests)) & (chinese_class_df['Total Written Score'] >= 60), 'HSK Pass'] = 1
chinese_class_df.loc[(chinese_class_df['HSK Level'].isin(hsk_pass_60_tests)) & (chinese_class_df['Total Written Score'] < 60), 'HSK Pass'] = 0
chinese_class_df.loc[(~chinese_class_df['HSK Level'].isin(hsk_pass_60_tests)) & (chinese_class_df['Total Written Score'] >=40), 'HSK Pass'] = 1
chinese_class_df.loc[(~chinese_class_df['HSK Level'].isin(hsk_pass_60_tests)) & (chinese_class_df['Total Written Score'] <40), 'HSK Pass'] = 0

chinese_class_df['HSK Pass'].value_counts()

1.0    340
0.0     19
Name: HSK Pass, dtype: int64

## HSK Oral Test Results

Unlike the written tests which have 2 or 3 distinct sections, the HSK Oral Test just tests spoken language skills, and so just has the 1 speaking section. The passing grade for this is 60% for all three levels of the oral test (beginner, intermiedate, advanced).

The previous mentioned study for 108 students from the US also measured the results of the HSK Oral test for those students. One notable difference here is that due to the low participation rate of the HSK Oral test, a sample of 108 measurements is actually fairly sizable. Earlier in my extrapolation of 2012 trends to 2018 international student figures, we saw that only about 1% of students in China attempt the HSK Oral test. While the written HSK 4 is seen as the first valuable HSK certification, and HSK 5 is required for attending a college program in Chinese, the oral certifications have very few applications. 

In [41]:
hsk_oral_mean = 79
hsk_oral_sd = 9.84

chinese_class_df.loc[chinese_class_df['HSK oral test?'] == 1, 'Speaking'] = rng.normal(hsk_oral_mean, hsk_oral_sd, chinese_class_df['HSK oral test?'].sum())
chinese_class_df.loc[chinese_class_df['Speaking'] > 100, 'Speaking'] = 100

chinese_class_df.loc[chinese_class_df['Speaking'] >= 60, 'HSK Oral Pass'] = 1
chinese_class_df.loc[chinese_class_df['Speaking'] < 60, 'HSK Oral Pass'] = 0


chinese_class_df.describe()




Unnamed: 0,HSK written test?,HSK oral test?,Reading,Listening,Writing,Total Written Score,HSK Pass,Speaking,HSK Oral Pass
count,2000.0,2000.0,368.0,368.0,359.0,359.0,359.0,18.0,18.0
mean,0.184,0.009,68.201168,64.108616,62.230958,64.757627,0.947075,77.711778,1.0
std,0.387581,0.094464,16.662135,17.144918,15.393732,13.239013,0.224196,9.486115,0.0
min,0.0,0.0,25.117496,21.362975,23.886705,37.421959,0.0,65.58059,1.0
25%,0.0,0.0,54.649849,50.478581,50.043757,51.068796,1.0,69.64843,1.0
50%,0.0,0.0,69.561277,64.158689,61.840258,67.823972,1.0,78.374923,1.0
75%,0.0,0.0,80.185438,77.727972,72.301328,75.408398,1.0,83.31983,1.0
max,1.0,1.0,100.0,100.0,100.0,92.846115,1.0,95.568862,1.0


HSK6 scores are too high. The mean for the HSK results is around 60, whereas pass for HSK6 is only 40. I will reduce the values by 1/3 to account for this.

## HSK results

https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625

108 participants from the US did the intermedite spoken exam and HSK 4 written exam.

These students stayed in the country for 1 semester (about 3 months).

We also have the mean, min, max and std from that group.

![here](https://screenshot.click/28_19-215cg-skgcm.jpg)

I could use this to create a normal distribution of test scores from US students who have been studying in China. As I know what a passing score is, I could calculate if it was a pass or fail.

Another source of data on HSK 4 results http://dpi-proceedings.com/index.php/dtem/article/view/30976/29557

Shows the mean and std for 30 students from Beijing Language & Culture University

![here](https://screenshot.click/28_02-0i7p9-b37me.jpg)



Some more results for 2010 including pass rates and average scores for each HSK level http://www.chinesetest.cn/gonewcontent.do?id=5589387 (Note - these are for tests taken outside China)

## Next steps

* Add variables to show who is self-funded versus who is on a scholarship of some sort.
* Also look at who is on degree vs non-degree.
* May also be able to look at their education background too.


## Potential data points
* Country - np.random.choice with probabilities for top 15 countries
* Course type - degree vs non-degree - binomial with 1 meaning degree
* Self-funded / scholarship - binomial with 1 meaning scholarship
* Attempted HSK written - binomial
* Level attempted - np.random.choice
* Attempt HSK spoken - binomial
* Level attempted - np.random.choice
* Results for each section - normal distributions for each
* Total score - total of the results of each section
* Pass/Fail - total compared to required pass score for that level.


## Resources

https://ejournals.bc.edu/index.php/ihe/article/download/10945/9333/

Includes some statistics on education background and funding.

http://en.moe.gov.cn/documents/reports/201904/t20190418_378692.html

More information on funding, origin country, where they studied, education background.

https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625
https://www.researchgate.net/figure/Correlations-among-proficiency-subskills-and-total-scores-of-pre-HSK-and-post-HSK-data_tbl4_325299887

109 US students measured on their Chinese proficiency upon returning to the US after 1 year in Beijing.

https://www.kaggle.com/kerneler/starter-china-scholarship-data-may-8638c810-6

Data on scholarships provided by Chinese universities.

http://blog.sina.com.cn/s/blog_53e7c11d0101f02j.html

Number of people that took HSK from 2009-2012

http://global.chinadaily.com.cn/a/201905/31/WS5cf0b106a3104842260bee25.html
6.8 million tests taken in 2018


https://forum.duolingo.com/comment/30363109/Percentage-of-users-who-complete-their-tree-for-each-language
Duolingo stats from 2019 suggesting 0.0124% complete the content. This covers 1000 characters, so not even HSK 4 level.

https://www.statista.com/statistics/430717/china-foreign-students-by-country-of-origin/
Foreign students by country of origin 2018.

https://www.echinacities.com/china-news/Is-the-HSK-Level-6-Test-Too-Difficult-Foreign-Test-Takers-Seem-to-Think-So
Why people don't go above level 4/5.

https://educationdata.org/international-student-enrollment-statistics
statistics on US students abraod