## Programming for Data Analysis - Project

### Problem statement

For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. 

Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.


We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:

* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

* Investigate the types of variables involved, their likely distributions, and their relationships with each other.

* Synthesise/simulate a data set as closely matching their properties as possible.

* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

#### Note:
this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

### Example project idea

As a lecturer I might pick the real-world phenomenon of the performance of students
studying a ten-credit module. After some research, I decide that the most interesting
variable related to this is the mark a student receives in the module - this is going to be
one of my variables (grade).

Upon investigation of the problem, I find that the number of hours on average a
student studies per week (hours), the number of times they log onto Moodle in the
first three weeks of term (logins), and their previous level of degree qualification (qual)
are closely related to grade. 

The hours and grade variables will be non-negative real number with two decimal places, logins will be a non-zero integer and qual will be a categorical variable with four possible values: none, bachelors, masters, or phd.

After some online research, I find that full-time post-graduate students study on average four hours per week with a standard deviation of a quarter of an hour and that a normal distribution is an acceptable model of such a variable. Likewise, I investigate the other four variables, and I also look at the relationships between the variables. 

I devise an algorithm (or method) to generate such a data set, simulating values of the
four variables for two-hundred students. I detail all this work in my notebook, and then
I add some code in to generate a data set with those properties.

#### Reference: Malawi Evidence of Tobacco Companies Affecting Restrictions: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2741530/


#### Reference: Philip Morris smoking advertising in India (Reuters): https://www.reuters.com/investigates/special-report/pmi-india/

#### Reference: Number of tobacco smokers worldwide from 2000 to 2025, by country income: https://www.statista.com/statistics/937428/tobacco-smoking-numbers-globally-country-income/#__sid=js4

#### Reference: ASH Fact sheet: Tobacco and the Developing World (ASH): https://ash.org.uk/wp-content/uploads/2019/07/ASH-Factsheet_Developing-World_v3.pdf

#### Reference (The Conversation): https://theconversation.com/big-tobacco-goes-after-the-young-in-developing-markets-in-a-case-of-history-repeated-82043

#### Reference: Cigarette consumption per year, 1970-2015: https://www.bmj.com/content/bmj/365/bmj.l2231.full.pdf

#### Reference: tobacco industry Indonesia: https://www.statista.com/topics/5728/tobacco-industry-in-indonesia/

#### Reference: https://tobacco.publichealth.gsu.edu/resources/data/

#### Reference: Improving the implementation of tobacco control policies in low-and middle-income countries: a proposed framework: https://gh.bmj.com/content/4/6/e002078

#### Reference: cigarette labels: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4276461/

#### WHO report 2019, tobacoo controls:https://www.who.int/publications/i/item/WHO-NMH-PND-2019.5   file:///C:/Users/HP/Downloads/WHO-NMH-PND-2019.5-eng.pdf

#### links for tobacco use: https://tobacco.publichealth.gsu.edu/resources/data/

#### Who info on India tobacco controls: https://www.who.int/tobacco/about/partners/bloomberg/ind/en/#:~:text=Several%20provisions%20of%20the%20law,is%20also%20restricted%20in%20India.

#### Percentage of deaths from smoking as a share overall: Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2017 (GBD 2017) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2018.: http://ghdx.healthdata.org/gbd-results-tool

#### India smoking controls and effect on death rates: https://www.researchgate.net/publication/333906455_Recent_trends_of_tobacco_use_in_India

In [1]:
import pandas as pd

### The Variables

1. **Country:** Categrical variable 

> I will chose 5 Developed countries where campaigns have been introduced to combat deaths from smoking. I will also      choose 5 developing nations where little or no effort has been made to combat smoking.


2. **Years:** Categorical variable

> I will be choosing a 20 year period, from 1990 to 2000, with one observation for each country per year. 


3. **Deaths from smoking:** Numerical variable 

> This datapoint should represent deaths attributed to smoking per 100,000 of the population. It will closely resemble the distribution of real data on this subject I discovered on *Our World in Data's* smoking series. 


4. **Rating of Government anti-smoking campaigns:** Numerical variable

> This figure should be a rating that reflects standardised rating variables commonly utilised to represent subjects that contain many variable elements. This hypothetical variable, will be a floating point number between 1 and 5 and will be said to be a rating of the various efforts of government in restricting tobacco sale into a single figure. This variable will be the most difficult to simulate in a realistic way. I will achieve this by basing it on a similar rating used in a dataset on the number of deaths associated with opiod use per country.


5. **Combined Profits of the Big 4 Smoking Companies:** Numerical variable

> This figure should represent the combined profits of the big 5 tobacco companies (listed above) for every year from 1990 - 2000. Although the real figures of this data are available, as this is an effort in simulation, we will simulate this data to, but keeping the distribution realistic. For example, the combined profits in 2018 were just under 125 billion dollars. 

6. **Influence of Tobacco Lobbying:** Categorical variable
 
> This variable should represent the influence that Big Tobacco and lobbying groups have in the countries and an overall reflection of their efforts to curtail government imposed tobacco control. The variable will contain 4 different categories stored as the following strings: "Weak", "Moderate", "Significant", "Strong". 



## Basis for Variables

### Variable 1: Country

#### Comparison of Trends in Smoking Consumption

The countries included in this simulation will fall into two categories: Low-income and high-income nations. For each category, 3 countries have been chosen. A seperate dataset will be created for each category of country to allow for effective comparison of trends. All the countries included have been chosen as they exhibit varying degrees of smoking consumption, tobacco control and lobbying from the tobacco industry. It will be shown that lower income countries exhibit a trend towards increased or stable tobacco related deaths.

The 6 countries included in this project are presented in the table below:

| Low-Income Countries| High-income Countries         |
|:---:| :---:|
| Indonesia |   Sweden       |
|    Myanmar    |   France       |
|  India      |   Austria       |

Section ___ below investigates the smoking trends in 1.Indonesia, 2.India and 3.Sweden.

#### Indonesia

The two graphs below display the percentage of smokers in South East Asia and the Indian sub-continent aged 15 and over who smoked in 1. 2007 and 2. 2018 (World Health Organization, Global Health Observatory Data Repository)

##### 2007: 
<img src="Smoking_se_asia_2007.PNG
" alt="Drawing" style="width: 650px;"/>

#####  2018:
<img src="Smoking_se_asia_2018.PNG
" alt="Drawing" style="width: 650px;"/>

Firstly, we will look at Indonesia, a low-income countries where tobacco lobby influence is said to be strong and where there has prevailed a high level of tobacco consumption for the past 30 years, for a myriad of reasons. Recent studies on smoking in Indonesia point to factors such as a high degree of reliance on the tobacco crop, in conjunction with strong influence from Big Tobacco impeding restrictions as a means to explain the resiliance of the smoking industry. The alarming nature of the situation was clear back in 1999 when Catherine Reynolds noted specific details of the exponential increase in smoking: 

> * Male participation estimates range from 50% to 85%
> * Since 1970–72, per adult consumption of cigarettes (all forms) has more than doubled, from 500 to 1180 per adult
> * By 1985, a Jakarta study reported that 49% of boys and 9% of girls aged 10–14 were daily smokers
> * By 1995, a health department survey estimated that 22.9% of urban 10 year olds, and 24.8% of rural 10 year olds smoke.
 
This appears to have been viewed with enthusiasm among some Indonesian members of government, as Reynold's notes that an government report issued in 1991 stated: “Prospects for further market growth are considered good. Consumption levels per head of population are low by international standards. . . . A high proportion of Indonesia’s population is in the younger age groups, meaning that the potential population of smokers will be growing rapidly in the next decade at least.” 

Keeping this in mind, Indonesia will be utilised as an example of a country where Big Tobacco influence is strong and smoking controls are weak. Myanmar displays a similar trend due to the same issues as Indonesia. As a result, similar distributions of random data will be created for both countries.

#### India

In contrast, India is a country that has successfully introduced smoking controls in recent years. In 2003, India overcame challenges in the courts to prevent the introduction of smoke-free public places, restrictions on tobacco advertising and promotion, amongst other measures. Following this, the country joined the WHO Framework Convention on Tobacco Control. Throughout the 2010's, the government incrementally brought in measures such as warning labels on tobacco products, higher tax on products and an increased power for police to sanction those who break advertising laws (WHO, 2015).

Despite this, the percentage of deaths attributed to smoking, as a share of overall deaths in the country, rose from 7.76% in 1990 to 9.03% in 2017 (Global Burden of Disease Collaborative Network, 2017). This underpins the reality that control measures have a limit to their potential and are limited in terms of how long it takes for measures to begin to show results. Chhabra et al.(2019) outline that smoking among middle aged adults remains high and among the less educated people and those lvining in rural areas. However, it is noted that the number of young and educated adult smokers has dropped significantly (Chabbra et al., 2019). This can account for the graph of current tobacco use above, whereby, India drops from having 38.5% of the population in 2007 to 27% in 2018. 

With this understanding of the demographics, we can assume that the distribution of datapoints for India should differ from that of Indonesia. As this project will attempt to recreate realistic distributions using random data, in terms of deaths it is evident that India still suffers from a high death. Therefore, the datapoints of the Death Rates variable should be created in the Uniform Distribution, with some permutations added. For the variables of government controls and tobacco companies profits, however, the random data produced should be separated into two categories: pre 2003 and post 2003. The introduction of the FCTC catalysed the introduction of control measures. The data before should resemble that of Indonesia and Myanmar. But following 2003, it should represent an comparative improvement.

### Variable 2: Years



### Variable 4: Rating of Government Anti-smoking Campaigns

#### Smoking Deaths and the Relationship with Investment in Anti-smoking Campaigs

In the past two decades there has been a considerable effort by many developed nations to decrease smoking. This effort has taken many forms. Many countries have hiked taxes that consumers pay on cigarette purchases, in order to both deter people from smoking and to cover the cost of the healtcare that long-term smokers require. Additionally, restrictions have been put on the sale of cigarettes, such as getting rid of branding that tobacco companies are allowed to use. This follows on from laws introduced in the 1990's and 2000's making it illegal to advertise cigarettes to consumers. Finally, campaigns have been launched across Europe, America and in many developing nations (though not all) to highlight the high correlation between smoking and various forms of cancer. 

The result is a considerable decrease in smoking related deaths for 100,000 of the populations of these countries where restrictions and campaigns were launched. This success should not be diminished - it highlights that a conscious effort can significantly affect the mindset of peoples and also forms part of a greater awareness worldwide of being more aware of what we put into our bodies.

Despite this success however, there is evidence that in some developing nations, where smoking is more prevelant to begin with and where litte has been done in the way of restricting the operations of tobacco combanies in the marketing and sale of product, that smoking related deaths have not decreased, but rather increased. 

Looking at the sales of the top 5 tobacco companies, or 'Big Tobacco' as they are collectively known as, in some countries these companies have seen an increase in revenues in recent years. Some articles have even highlighted how this is arguably the most profitable time in history for tobacco companies.

Having researched this phenomenon online and comparing it against datasets on deaths from smoking accessed on *Our World in Data's*, I have chosen to investigate this example and synthesise random data into particular distributions. These distributions are based on plausible conclusions drawn from reliable resources. As the function of this project is on the methodology used in creating random data, giving it a realistic shape and investigating this simulation, this project is not intended to be empirical evidence.

### Variable 5: Combined Profits of the 'Big Four' Cigarette Companies



### Variable 6: Influence of Tobacco Lobbying

#### Concern Over the Tactics of Tobacco Industry in Low-income Countries

In July 2019, a major group responsible for highlighting the consequences of smoking, Ash (or Action on Smoking and Health), opened their report on smoking in developing countries with the following statement:

> "Around 1.1 billion people aged 15 and over smoke, with 80% living in LMICs (low and middle income
countries). Tobacco growing and consumption have become concentrated in the developing world where
the health, economic, and environmental burden is heaviest and likely to increase."

Ash's assertment is backed up by data of a surge in Tobacco company profits, studies carried out throghout Africa and Southeast Asia of tactics from 'Big Tobacco' to curtail government efforts to tackle smoking and numerous verified reports of illegal marketing ploys in low-income countries.

Since 1984, British American Tobacco (BAT), in association with the International Tobacco Growers’ Association (ITGA), an organisation that was founded to lobby for Big Tobacco, have used varius tactics to pressure low-income countries to abandom restrictions on the growing and sale of Tobacco. In Malawi, a country that has a long history of tobacco crop farming, this lead to a government official to push back against the World Health Organisation on the issue of tobacco control, on the basis that it would negatively impact the economy of Malawi. Indeed, today tobacco sale accounts for 70% of Malawi's foreign earnings (Mamudu et al., 2009). This economic dependency and the tact of the tobacco lobbies, lead the  International Tobacco Growers’ Association (ITGA) to platform tobacco growers representatives against the bodies of the United Nations responsible for tobacco control. This tactic formed part of a larger strategy, drawing on representatives from the industry in low-income countries and eventually lead to the weakening of the focus on health in the UN's tobacco control narrative (Mamudu et al, 2009).

In 2017, in an expose on the marketing ploys of Philip Morris in India, Reuters reported that the company responsibe fr the Marlboro brand, breached India's anti-smoking laws. Amongst the many tactics used included the placement of colourful advertisements in kiosks across New Delhi and handing out free cigarettes to young people at parties (Reuters, 2017).

Indonesia is a country where tobacco growing and consumption are notably high and where tobacco companies focus on quelling restrictions. In 2017, it was reported that Indonesia consumed 322 million cigarettes. A survey in 2019 found that curiosity was the primary reason for Indonesians to start smoking, with the average age that a smoker had their first cigarette was 16-18 (Hirschmann, December 2020). 

Overall, there is consensus that tobacco lobbying in low-income countries has had a considerble effect on the introduction of even minor forms of tobacco control. A study in 2016 specifically examined the introduction of health warning labels on cigarette packaging. These labels were compliant with the WHO's Framework Convention on Tobacco Control (FCTC). They concluded that countries where state capacity were low were less likely to introduce the warning labels (Hiilam and Glantz, 2016). This remains as an ever present issue when it comes to tackling cigarette comsumption globally.

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import hist
import seaborn as sns
# Set seed of the generator
rng = np.random.default_rng(0)

In [48]:
# list containing 61 strings of countrys (beginning with just 3)
df3 = pd.DataFrame({"countries": ["Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
           "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
           "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
           "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
          "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesia", 
          "Indonesia","Indonesia","Indonesia",
          "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesia", 
          "Indonesia","Indonesia","Indonesia", "Indonesia",
          "Ireland", "Ireland", "Ireland", "Ireland", "Ireland", "Ireland", "Ireland", "Ireland",
          "Ireland", "Ireland",
          "Ireland", "Ireland", "Ireland", "Ireland", "Ireland", "Ireland", "Ireland", "Ireland",
          "Ireland", "Ireland", "Ireland"],
           "years": [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
         2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020,
         2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
         2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020,
         2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
         2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] })
                   
#print(df3)

# Use np.uniform() to create a range of random 60 floats to represent the deaths per 100000
death = rng.uniform(low=0.5, high=13.3, size=(63,))
# Round to 2 decimal places and assign new variable
deaths = death.round(1)



# Generate 60 random integers in the range 45-405 to represent profits (in billions)
profits = rng.integers(low=45, high=405, size=(63,))

In [63]:
rating1 = rng.uniform(low=0.5, high= 0.9, size=(63,) )
rate1 = rating1.round(1)

In [None]:
rating2 = rng.uniform(low=1.5, high= 1.9, size=(63,) )
rate2 = rating2.round(1)

In [66]:
x = rng.integers(low=2.5, high=4.8, size=(63,))
y = a*x**(a-1.)


In [None]:
# Use np.uniform() to create a range of random 60 floats to represent the rating of restrictions against smoking companies
rating = rng.uniform(low=0.01, high=4.99, size=(63,))
# Round to 2 decimal places and assign new variable
rate = rating.round(2)

In [86]:
a = 5. # shape
samples = 100
s = np.random.power(6, 63)
s

array([0.61594829, 0.86043806, 0.83900168, 0.99087033, 0.97414128,
       0.98824251, 0.92702514, 0.90145151, 0.74227649, 0.80293491,
       0.8996813 , 0.99889234, 0.66094081, 0.91975935, 0.69867993,
       0.85695967, 0.87900948, 0.94530683, 0.77349556, 0.94945809,
       0.80205785, 0.90667127, 0.92743803, 0.87739437, 0.97186139,
       0.69591275, 0.94530739, 0.9097082 , 0.76857325, 0.88488496,
       0.65627534, 0.89533285, 0.96099708, 0.6485367 , 0.94393608,
       0.98735524, 0.72617011, 0.89316335, 0.66949062, 0.81952475,
       0.99854345, 0.72695854, 0.96621761, 0.6424083 , 0.928477  ,
       0.7229036 , 0.91373522, 0.59821234, 0.82769347, 0.97983369,
       0.90700589, 0.9588419 , 0.98714671, 0.85701171, 0.53082645,
       0.9817696 , 0.96865464, 0.92894731, 0.84367126, 0.74869936,
       0.99441162, 0.92065726, 0.98904831])

In [89]:
a = 63. # shape
samples = (2.5, 4.4)
s = rng.exponential(scale=0.6, size=63)
s

array([0.47889256, 0.31042901, 0.29759433, 0.94377475, 0.11065601,
       1.51689731, 0.08162642, 0.55409974, 0.99495037, 0.56962026,
       3.52856377, 1.17718533, 0.70641409, 0.27593868, 0.40030991,
       0.85159645, 0.20944205, 0.29927152, 1.1479483 , 0.94257671,
       1.55142506, 1.690961  , 0.54506534, 1.1278349 , 3.29753965,
       0.21758614, 0.23932282, 0.2524039 , 0.63449364, 0.54644301,
       0.32211444, 0.08428122, 2.95328211, 0.18998437, 0.38275771,
       0.99631936, 1.96274853, 0.40899286, 0.37166419, 0.0791759 ,
       0.64325502, 0.39663984, 0.20256586, 1.05641139, 0.73390525,
       0.69146781, 0.34182378, 0.6762029 , 4.24361198, 0.13978488,
       1.36574852, 0.32021056, 0.36615747, 0.05121025, 1.15855869,
       0.02692571, 0.03368447, 0.80158545, 0.12356494, 0.96279874,
       0.19948641, 2.00154919, 0.0098469 ])

In [90]:
# Create dataframe of numeric variables using ravel() method
df1 = pd.DataFrame({'deaths': deaths.ravel(), 'rate': rate.ravel(),
                     'profits': profits.ravel() })

# Create dataframe for list of strings (countrys) - Split the list to have the first element as a column and the rest as 
# data 
data = list(zip(*[iter(df3)]))
df2 = pd.DataFrame(data[1:], columns=data[0])

# Create a variable listing both dataframes together
dataframes = [df3, df1]

# Concatenate the dataframes along the x-axis
result = pd.concat([df3, df1], axis=1, join='inner')
print(result)

      countries  years  deaths  rate  profits
0   Afghanistan   2000     8.0  1.48       97
1   Afghanistan   2001     2.0  2.10      212
2   Afghanistan   2002     9.1  4.80      301
3   Afghanistan   2003     0.6  2.30      308
4   Afghanistan   2004     2.8  4.74       62
5   Afghanistan   2005     5.9  0.16      248
6   Afghanistan   2006     5.3  0.34       76
7   Afghanistan   2007     2.0  0.15      397
8   Afghanistan   2008     6.0  3.33      230
9   Afghanistan   2009     8.5  1.11      196
10  Afghanistan   2010     5.3  2.88      189
11  Afghanistan   2011     9.6  3.97      400
12  Afghanistan   2012     3.5  1.66      311
13  Afghanistan   2013     2.3  1.23      194
14  Afghanistan   2014    10.1  3.62      307
15  Afghanistan   2015     9.1  2.38      110
16  Afghanistan   2016     6.0  0.75      188
17  Afghanistan   2017     2.3  0.45      326
18  Afghanistan   2018     9.0  3.68      336
19  Afghanistan   2019    10.1  4.29      142
20  Afghanistan   2020     2.6  4.

In [None]:
# Create dataframe of numeric variables using ravel() method
df1 = pd.DataFrame({'deaths': [deaths],
                    'rate': [rate],
                   'profits': [profits] })
                    
                    
                    #deaths.ravel(), 'rate': rate.ravel,
                     #'profits': profits.ravel() })


#print(df1)

# Create dataframe for list of strings (countrys) - Split the list to have the first element as a column and the rest as 
# data 
#data = list(zip(*[iter(country)]))
#df2 = pd.DataFrame(data[1:], columns=data[0])

# Create a variable listing both dataframes together
dataframes = [df3, df1]

# Concatenate the dataframes along the x-axis
result = pd.concat([df3, df1], axis=1, join='inner')
print(result)