# Introduction to Project

## Project Description

For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:

- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook the data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set.

## Main Sources Used For Project

- Python[1] is the main code used for project 

- Seaborn[2] is a library used for making attractive and informative statistical graphics in Python.

- Pandas[3] are used for data manipulation and analysis.

- Numpy.Random [4] is a subpackage of the NumPy package for working with random numbers.

## Aim for project

The aim for my project is to look at military spending by countries for 2021. I firstly looked at World Population Review[5] I wanted to see real life data and what research had been done in this area already. A lot of the information, has been based on research by Stockholm International Peace Research (SIPRI Military Expenditure Database)[6] as a basis for a lot of my research. 

The main variables which I will research are the population of the country, how much they are spending on their military, country's GDP and finally the percentage of GDP that makes up the countries spending per year. 

As of 2021, world military expenditure has passed $2 trillion for first time. Total global military expenditure increased by 0.7 per cent in real terms in 2021, to reach $2113 billion with the 5 biggest countries being the US, China, India, the UK and Russia.[3]

In [7]:
# import python libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import string
import random

I firstly decided to look at real life data. I had to very slightly adapt the database which I got from SIPRI. I removed the first approximately 10 lines on each excel sheet as I was finding it difficult read in the data. The main pieces that I removed provided just general info on the database.

In [3]:
df = pd.read_excel(r'C:\Users\Kenne\OneDrive\Desktop\Programming\PfDA_Project\data\SIPRI DATA.xlsx', sheet_name= 'Current US')

GDP = pd.read_excel(r'C:\Users\Kenne\OneDrive\Desktop\Programming\PfDA_Project\data\SIPRI DATA.xlsx', sheet_name= 'Share of GDP')

percentage = pd.read_excel(r'C:\Users\Kenne\OneDrive\Desktop\Programming\PfDA_Project\data\SIPRI DATA.xlsx', sheet_name= 'Share of Govt. spending')

print("The shape of the data from Current US$ is: \n",df.shape)

print("The shape of the data from GDP is: \n",GDP.shape)

print("The shape of the data from percentage of spending is: \n",percentage.shape)

The shape of the data from Current US$ is: 
 (191, 75)
The shape of the data from GDP is: 
 (191, 75)
The shape of the data from percentage of spending is: 
 (191, 37)


I decided to look at the shape of the data firstly and it shows that there 191 columns in the database, mostly displaying countries and continents in which they are in. There are also 75 rows of data which displays the amount spent by a country, in dollars. The GDP data on the dataset is the exact same in terms of columns and rows.

I also wanted to dipslay that my idea for the 

In [14]:
df = pd.read_excel(r'C:\Users\Kenne\OneDrive\Desktop\Programming\PfDA_Project\data\SIPRI DATA.xlsx', sheet_name= 'Current US')

df.isnull().any(axis = 1).sum()

56

I saw a bit of an issue with dataset as there was a lot of missing data. 

If I was to ever do further analysis on this dataset, I think I would take specific years as opposed to all the data.

## 2. Investigate the types of variables involved, their likely distributions, and their relationships with each other

In this part of the assignment, I am going to create the dataset based on numpy random and begin to investigate distributions and relationships with one another. 

Firstly, I will investigate the variables in the datasets used by SIPRI. I will use histograms and boxplots to show some of the variables in this dataset. I will then explore relationships that may exist between the variables using visualisations such as scatterplot, pairplots etc and statistics such as correlation and covariance statistics.

I will use this to create a synthesised data set.

I have decided to use the actual country names, just purely as I find it easier to analyse with a real name as opposed to a pseudo name. 

I will create the country's GDP, total spending and percentages of spending using random numpy.

I will then use this data to visualise relationships which we may find.

For the purposes of this assignment, I am going to say that this is the data for 2021.

In my analysis of the dataset, I realised that I couldn't use randint for all 4 columns so I decided to append the datasets together using code. 

In [37]:
import numpy as np
import pandas as pd

#Used 248 as that is the number of countries

Total_Spend = np.random.randint(2000000,190000000,size=(248,1))

#I created the array by taking all countries from this website - https://pytutorial.com/python-country-list

countries = ['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of the', 'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', "Korea, Democratic People's Republic of", 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'South Sudan', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Yemen', 'Zambia', 'Zimbabwe']

df1 = pd.DataFrame(Total_Spend, countries, columns= ['Total Spending'])



GDP = np.random.randint(1,6,size=(248,1))

#I created the array by taking all countries from this website - https://pytutorial.com/python-country-list


df2 = pd.DataFrame(GDP, countries, columns= ['Total GDP'])



Percentage_Of_Spending = np.random.randint(1,36,size=(248,1))

#I created the array by taking all countries from this website - https://pytutorial.com/python-country-list

df3 = pd.DataFrame(Percentage_Of_Spending, countries, columns= ['Percentage of spending'])


Per_Capita = np.random.randint(4,2025,size=(248,1))

#I created the array by taking all countries from this website - https://pytutorial.com/python-country-list

df4 = pd.DataFrame(Per_Capita, countries, columns= ['Per Capita'])


df1.to_csv("sythesized.csv")
df2.to_csv("sythesized.csv", mode="a")
df3.to_csv("sythesized.csv", mode="a")
df4.to_csv("sythesized.csv", mode="a")


## Information on how I Sythesized Data

I was unable to get the code to go next to each other on the output of this code, so just due to time constraints, I have amended the data manually using excel.

I have called the new file sythesized2. I added a dollar sign using excel, just for simplicity due to the size of the numbers.

I tried to add the percentage sign to the GDP but I kept facing errors.

I tried to base the sythesized data as much as possible around the original data file, which is why I ran the randint code 4 times, as I wanted to provide different values, so that the data could be as close to the original as I could possibly could get it.