# Problem statement
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:

* Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
* Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
* Synthesise/simulate a data set as closely matching their properties as possible.
* Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

Initial thoughts - Only a fraction of International students in China get to the HSK 5 or 6.

Option 1 - simulate number of American students that attempt the various HSK levels and their pass rates.

Option 2 - simulate a school made up of people from top 15 sources of international students. See who tried what test and pass rates. South Korea makes up the majority of students and each HSK level.

## Variables

* HSK level (see if I can get statistics on how many are awarded)
* Education background
* Origin country
* Funding of study
* Level of program they enroll into
* Hours of study
* Scores

## Country of Origin

Turns out I only have access to the top 15 countries from 2018

* South Korea	50,600
* Thailand	28,608
* Pakistan	28,023
* India	23,198
* United States	20,996
* Russia	19,239
* Indonesia	15,050
* Laos	14,645
* Japan	14,230
* Kazakhstan	11,784
* Vietnam	11,299
* Bangladesh	10,735
* France	10,695
* Mongolia	10,158
* Malaysia	9,479

https://www.researchcghe.org/perch/resources/publications/to-publish-wp46.pdf

Also has top 10 countries 2000-2016, and total international students. As I have HSK test data from 2012, if I follow these proportions I could estimate how many American students took on each of the tests. If I use the normal distribution of scores based on the earlier paper, I could simulate what students took on the HSK, what level and what score.

It also has the % of students enrolled in fulltime degrees, etc. I know most degrees require HSK 5 at least in order to enrol, so I could extrapolate this out to make an educated guess on the number % of students in Chinese Language undergraduate degrees, as undergraduate Chinese language degrees would not require a HSK to enroll.

It also shows how many students were receiving scholarships until 2013, and what proportion were for non-degree students. If I were to assume it grew at about the same rate as overall international students, and that it's shared proportionally between students from various countries, I could look at who was self-funded versus on scholarship.



## HSK results

https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625

108 participants from the US did the intermedite spoken exam and HSK 4 written exam.

These students stayed in the country for 1 semester (about 3 months).

We also have the mean, min, max and std from that group.

![here](https://screenshot.click/28_19-215cg-skgcm.jpg)

I could use this to create a normal distribution of test scores from US students who have been studying in China. As I know what a passing score is, I could calculate if it was a pass or fail.

Another source of data on HSK 4 results http://dpi-proceedings.com/index.php/dtem/article/view/30976/29557

Shows the mean and std for 30 students from Beijing Language & Culture University

![here](https://screenshot.click/28_02-0i7p9-b37me.jpg)



Some more results for 2010 including pass rates and average scores for each HSK level http://www.chinesetest.cn/gonewcontent.do?id=5589387 (Note - these are for tests taken outside China)

## Next steps

* Work out what proportion of international students might sit HSK (i.e. remove all degrees except Chinese language bachelors)
* Look at how many international students were in China during 2010-2012, and compare to number that sat HSK in China. Extrapolate that number to 2018 figures.
* Look at proportion of HSK testers that took each level, and work out their respect probabilities.
* Do the same for the speaking/listening.
* This will give me the probability that a student in the class took each HSK exam.
* I can then simulate their score in each part of the test.
* From this I can work out if they passed or failed.


## Questions to be decided
* Am I just doing US students or shall I simulate international students too? 
This would give me another area to simulate, and does highlight the trend that most of the class are likely to be from Asian countries. If I use the 2010 results, it may also highlight that the average US student has a higher score than the average Korean student - that is for tests taken outside of China though, so isn't fully comparable.


## Potential data points
* Country - np.random.choice with probabilities for top 15 countries
* Course type - degree vs non-degree - binomial with 1 meaning degree
* Self-funded / scholarship - binomial with 1 meaning scholarship
* Attempted HSK written - binomial
* Level attempted - np.random.choice
* Attempt HSK spoken - binomial
* Level attempted - np.random.choice
* Results for each section - normal distributions for each
* Total score - total of the results of each section
* Pass/Fail - total compared to required pass score for that level.


## Resources

https://ejournals.bc.edu/index.php/ihe/article/download/10945/9333/

Includes some statistics on education background and funding.

http://en.moe.gov.cn/documents/reports/201904/t20190418_378692.html

More information on funding, origin country, where they studied, education background.

https://www.researchgate.net/figure/Descriptive-statistics-of-general-proficiency-measured-by-HSK_tbl1_312107625
https://www.researchgate.net/figure/Correlations-among-proficiency-subskills-and-total-scores-of-pre-HSK-and-post-HSK-data_tbl4_325299887

109 US students measured on their Chinese proficiency upon returning to the US after 1 year in Beijing.

https://www.kaggle.com/kerneler/starter-china-scholarship-data-may-8638c810-6

Data on scholarships provided by Chinese universities.

http://blog.sina.com.cn/s/blog_53e7c11d0101f02j.html

Number of people that took HSK from 2009-2012

http://global.chinadaily.com.cn/a/201905/31/WS5cf0b106a3104842260bee25.html
6.8 million tests taken in 2018


https://forum.duolingo.com/comment/30363109/Percentage-of-users-who-complete-their-tree-for-each-language
Duolingo stats from 2019 suggesting 0.0124% complete the content. This covers 1000 characters, so not even HSK 4 level.

https://www.statista.com/statistics/430717/china-foreign-students-by-country-of-origin/
Foreign students by country of origin 2018.

https://www.echinacities.com/china-news/Is-the-HSK-Level-6-Test-Too-Difficult-Foreign-Test-Takers-Seem-to-Think-So
Why people don't go above level 4/5.

https://educationdata.org/international-student-enrollment-statistics
statistics on US students abraod