# Problem statement
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:

* Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
* Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
* Synthesise/simulate a data set as closely matching their properties as possible.
* Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

Initial thoughts - Only a fraction of International students in China get to the HSK 5 or 6.

Option 1 - simulate number of American students that attempt the various HSK levels and their pass rates.

Option 2 - simulate a school made up of people from top 15 sources of international students. See who tried what test and pass rates. South Korea makes up the majority of students and each HSK level.

## Variables

* HSK level (see if I can get statistics on how many are awarded)
* Education background
* Origin country
* Funding of study
* Level of program they enroll into
* Hours of study
* Scores

## Country of Origin

Turns out I only have access to the top 15 countries from 2018

* South Korea	50,600
* Thailand	28,608
* Pakistan	28,023
* India	23,198
* United States	20,996
* Russia	19,239
* Indonesia	15,050
* Laos	14,645
* Japan	14,230
* Kazakhstan	11,784
* Vietnam	11,299
* Bangladesh	10,735
* France	10,695
* Mongolia	10,158
* Malaysia	9,479

https://www.researchcghe.org/perch/resources/publications/to-publish-wp46.pdf

Also has top 10 countries 2000-2016, and total international students. As I have HSK test data from 2012, if I follow these proportions I could estimate how many American students took on each of the tests. If I use the normal distribution of scores based on the earlier paper, I could simulate what students took on the HSK, what level and what score.

It also has the % of students enrolled in fulltime degrees, etc. I know most degrees require HSK 5 at least in order to enrol, so I could extrapolate this out to make an educated guess on the number % of students in Chinese Language undergraduate degrees, as undergraduate Chinese language degrees would not require a HSK to enroll.

It also shows how many students were receiving scholarships until 2013, and what proportion were for non-degree students. If I were to assume it grew at about the same rate as overall international students, and that it's shared proportionally between students from various countries, I could look at who was self-funded versus on scholarship.



In 2018 there were 492,185 International Students in China (http://global.chinadaily.com.cn/a/201904/12/WS5cb05c3ea3104842260b5eed.html#:~:text=Almost%20500%2C000%20international%20students%20studied,ministry%20said%20in%20a%20statement.)

So based off of that number and the above breakdown we know that 278739 came from those 15 countries, therefore 213446 would come from "Rest of the World". At present I do not have a way to break these down further. 




I've since found that there were 81,562 African students in China in 2018 (https://www.studyinternational.com/news/african-students-china-alienated/). There's no breakdown by country though. This would bring rest of the world down to 131884.

* Rest of the World 131,884
* Africa 81,562
* South Korea	50,600
* Thailand	28,608
* Pakistan	28,023
* India	23,198
* United States	20,996
* Russia	19,239
* Indonesia	15,050
* Laos	14,645
* Japan	14,230
* Kazakhstan	11,784
* Vietnam	11,299
* Bangladesh	10,735
* France	10,695
* Mongolia	10,158
* Malaysia	9,479

Going to work out proportions from each country

World Population Dataset from https://data.worldbank.org/indicator/SP.POP.TOTL

## Selecting African and Rest of the World countries.

I'm going to use data from the World Bank to find out what proportion of the African population each country of African has, and I'll use this as the probability of the student coming from that country. This isn't a perfect measure, as in the real world there'd be political and academic exchanges with particular countries, meanwhile some countries are more likely to have a population that can afford to go to China for self-funded study. Nonetheless, I prefer this route as I do not want the 54 countries of African treated as one unit.

I will also be doing the same for the Rest of the World. This comes with the same caveats as the African countries in that they do not represent academic or political exchanges. That said, the countries with large populations that are not counted in this top 15 or the African countries do tend to have populations that may afford studying abroad. For example Germany would be one of the remaining countries with a larger population, and so it's probability will be higher than others, however in real life its population is also likely to be more able to afford self-funded study in China.

In [1]:
import pandas as pd
import numpy as np

#read_csv documentation https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html
#I've edited this dataset to remove aggregations (e.g. Eurozone, World).
#as well as China, Hong Kong and Macau - as these would not be considered International Students.
populations_df = pd.read_csv('world_populations.csv', usecols = ['Country Name', 'Country Code', '2019'])

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
populations_df.dropna(inplace=True)

populations_df['2019'] = populations_df['2019'].astype('int')


In [32]:
#Country Codes
african_country_codes = ['DZA','AGO','BWA','IOT','BDI','CMR','CPV','CAF','TCD',
                     'COM','MYT','COG','COD','BEN','GNQ','ETH','ERI','ATF',
                     'DJI','GAB','GMB','GHA','GIN','CIV','KEN','LSO','LBR',
                     'LBY','MDG','MWI','MLI','MRT','MUS','MAR','MOZ','NAM',
                     'NER','NGA','GNB','REU','RWA','SHN','STP','SEN','SYC',
                     'SLE','SOM','ZAF','ZWE','SSD','SDN','ESH','SWZ','TGO',
                     'TUN','UGA','EGY','TZA','BFA','ZMB']

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html
african_countries = populations_df.loc[populations_df['Country Code'].isin(african_country_codes)]

#Get the sum of african population to work out proportions per country in a moment.
africa_total_population = african_countries['2019'].sum()

african_countries['Proportion of African Population'] = african_countries['2019'] / africa_total_population

african_country = np.array(african_countries['Country Name'])
african_probabilities = np.array(african_countries['Proportion of African Population'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  african_countries['Proportion of African Population'] = african_countries['2019'] / africa_total_population


In [3]:
african_probabilities

array([3.30460644e+00, 2.44280173e+00, 9.05816334e-01, 1.76823970e-01,
       1.55980007e+00, 8.85048222e-01, 4.22111458e-02, 1.98618318e+00,
       3.64224310e-01, 1.22402813e+00, 6.53111241e-02, 6.66174962e+00,
       4.12989549e-01, 1.97391471e+00, 7.47271643e-02, 7.70544808e+00,
       1.04080887e-01, 8.81265656e-02, 8.60278326e+00, 1.66759797e-01,
       1.80201952e-01, 2.33477148e+00, 9.80277537e-01, 1.47443459e-01,
       4.03539989e+00, 1.63128365e-01, 3.78976086e-01, 5.20214234e-01,
       2.07007255e+00, 1.42987945e+00, 1.50888380e+00, 3.47377079e-01,
       9.71516844e-02, 2.79944932e+00, 2.33079395e+00, 1.91471664e-01,
       1.78925144e+00, 1.54252844e+01, 9.69201865e-01, 1.65069693e-02,
       1.25085364e+00, 7.49336396e-03, 5.99715889e-01, 1.18534502e+00,
       4.49473424e+00, 8.49090283e-01, 3.28619897e+00, 4.45230265e+00,
       6.20375007e-01, 8.97646975e-01, 3.39798392e+00, 1.37095210e+00,
       1.12413646e+00])

In [4]:
#Need to remove the top 15 countries from rest of the world too

top_15_country_codes = ['KOR', 'THA', 'PAK', 'IND', 'USA', 'RUS', 'IDN', 'LAO', 'JPN', 'KAZ', 'VNM', 'BGD', 'FRA', 'MNG', 'MYS']

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html
top_15_countries = populations_df.loc[populations_df['Country Code'].isin(top_15_country_codes)]

#Get the sum of african population to work out proportions per country in a moment.
top_15_total_population = top_15_countries['2019'].sum()

top_15_total_population

2961247792

In [33]:
#Now to work out the rest of the world

#World Population without China, Hong Kong & Macau, minus the African population, minus the top 15 countries
rest_of_world_population = 6267671127 - africa_total_population - top_15_total_population

rest_of_world = populations_df.loc[~populations_df['Country Code'].isin(african_country_codes)]
rest_of_world = rest_of_world.loc[~rest_of_world['Country Code'].isin(top_15_country_codes)]


rest_of_world['Proportion of World Population'] = rest_of_world['2019'] / rest_of_world_population

rest_of_world.head()

rest_of_world_country = np.array(rest_of_world['Country Name'])
rest_of_world_probabilities = np.array(rest_of_world['Proportion of World Population'])

Now that I have the probabilities and country names set up for the African and Rest of the World countries, I can use np.random.choice() to pick a student origin from the 17 regions (top 15 countries, Africa and Rest of the world). If Africa or Rest of the World is selected, it can then pick a country from those subsections.

Now I'll need to get the probabilities of each of the 17 regions to be selected.

In [30]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice

student_origins = np.array(['Rest of the World', 'Africa', 'South Korea', 'Thailand', 'Pakistan', 'India', 'United States',
                  'Russia', 'Indonesia', 'Laos', 'Japan', 'Kazakhstan', 'Vietnam', 'Bangladesh', 'France', 'Mongolia',
                  'Malaysia'])

#Count of students from each region in 2018
student_origin_counts = np.array([131884, 81562, 50600, 28608, 28023, 23198, 20996, 19239, 15050, 14645, 14230, 11784,
                        11299, 10735, 10695, 10158, 9479])


#Work out the probability that a student came from each region
student_origin_probabilities = student_origin_counts / sum(student_origin_counts)

sum(student_origin_probabilities)

0.9999999999999999

Now let's try pick a country location for each student.

In [31]:
rng = np.random.default_rng(33)
students = rng.choice(student_origins, 100, p=student_origin_probabilities)

students



array(['South Korea', 'Thailand', 'Vietnam', 'Rest of the World',
       'Thailand', 'Africa', 'Russia', 'Thailand', 'Rest of the World',
       'South Korea', 'Rest of the World', 'Rest of the World',
       'Rest of the World', 'Africa', 'Rest of the World', 'Russia',
       'Indonesia', 'Rest of the World', 'South Korea', 'Africa',
       'Africa', 'Rest of the World', 'France', 'Africa', 'Bangladesh',
       'Africa', 'Rest of the World', 'Rest of the World',
       'Rest of the World', 'Indonesia', 'Laos', 'Indonesia',
       'Rest of the World', 'Africa', 'Thailand', 'United States',
       'United States', 'Japan', 'Russia', 'South Korea', 'India',
       'Russia', 'Rest of the World', 'Mongolia', 'Thailand', 'Pakistan',
       'India', 'United States', 'Rest of the World', 'Rest of the World',
       'Africa', 'Rest of the World', 'Russia', 'South Korea',
       'Rest of the World', 'Africa', 'Rest of the World', 'Russia',
       'Africa', 'Japan', 'United States', 'South Korea

In [27]:
len(student_origin_probabilities)

17

## HSK results

https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625

108 participants from the US did the intermedite spoken exam and HSK 4 written exam.

These students stayed in the country for 1 semester (about 3 months).

We also have the mean, min, max and std from that group.

![here](https://screenshot.click/28_19-215cg-skgcm.jpg)

I could use this to create a normal distribution of test scores from US students who have been studying in China. As I know what a passing score is, I could calculate if it was a pass or fail.

Another source of data on HSK 4 results http://dpi-proceedings.com/index.php/dtem/article/view/30976/29557

Shows the mean and std for 30 students from Beijing Language & Culture University

![here](https://screenshot.click/28_02-0i7p9-b37me.jpg)



Some more results for 2010 including pass rates and average scores for each HSK level http://www.chinesetest.cn/gonewcontent.do?id=5589387 (Note - these are for tests taken outside China)

## Next steps

* Work out what proportion of international students might sit HSK (i.e. remove all degrees except Chinese language bachelors)
* Look at how many international students were in China during 2010-2012, and compare to number that sat HSK in China. Extrapolate that number to 2018 figures.
* Look at proportion of HSK testers that took each level, and work out their respect probabilities.
* Do the same for the speaking/listening.
* This will give me the probability that a student in the class took each HSK exam.
* I can then simulate their score in each part of the test.
* From this I can work out if they passed or failed.


## Questions to be decided
* Am I just doing US students or shall I simulate international students too? 
This would give me another area to simulate, and does highlight the trend that most of the class are likely to be from Asian countries. If I use the 2010 results, it may also highlight that the average US student has a higher score than the average Korean student - that is for tests taken outside of China though, so isn't fully comparable.


## Potential data points
* Country - np.random.choice with probabilities for top 15 countries
* Course type - degree vs non-degree - binomial with 1 meaning degree
* Self-funded / scholarship - binomial with 1 meaning scholarship
* Attempted HSK written - binomial
* Level attempted - np.random.choice
* Attempt HSK spoken - binomial
* Level attempted - np.random.choice
* Results for each section - normal distributions for each
* Total score - total of the results of each section
* Pass/Fail - total compared to required pass score for that level.


## Resources

https://ejournals.bc.edu/index.php/ihe/article/download/10945/9333/

Includes some statistics on education background and funding.

http://en.moe.gov.cn/documents/reports/201904/t20190418_378692.html

More information on funding, origin country, where they studied, education background.

https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625
https://www.researchgate.net/figure/Correlations-among-proficiency-subskills-and-total-scores-of-pre-HSK-and-post-HSK-data_tbl4_325299887

109 US students measured on their Chinese proficiency upon returning to the US after 1 year in Beijing.

https://www.kaggle.com/kerneler/starter-china-scholarship-data-may-8638c810-6

Data on scholarships provided by Chinese universities.

http://blog.sina.com.cn/s/blog_53e7c11d0101f02j.html

Number of people that took HSK from 2009-2012

http://global.chinadaily.com.cn/a/201905/31/WS5cf0b106a3104842260bee25.html
6.8 million tests taken in 2018


https://forum.duolingo.com/comment/30363109/Percentage-of-users-who-complete-their-tree-for-each-language
Duolingo stats from 2019 suggesting 0.0124% complete the content. This covers 1000 characters, so not even HSK 4 level.

https://www.statista.com/statistics/430717/china-foreign-students-by-country-of-origin/
Foreign students by country of origin 2018.

https://www.echinacities.com/china-news/Is-the-HSK-Level-6-Test-Too-Difficult-Foreign-Test-Takers-Seem-to-Think-So
Why people don't go above level 4/5.

https://educationdata.org/international-student-enrollment-statistics
statistics on US students abraod