# Project 1 (Due Nov 13)

The goal of the first project is to non-parametrically model some phenomenon of interest, and generate sequences of values. There are six options below:

- Chordonomicon: 680,000 chord progressions of popular music songs. Create a chord generator, similar to what we did with Bach in class, but for a particular artist or genre. (https://github.com/spyroskantarelis/chordonomicon)
- Financial Time series, S&P500 Stocks: There are 500 time series here. Model how individual time series adjust over time, either together or separately. (https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks)
- MIT-BIT Arrythmia Database: Arrythmia is an abnormal heart rhythm. This is a classic dataset that a day of ECG time series measurements for 4,000 patients. (https://www.physionet.org/content/mitdb/1.0.0/)
- Ukraine conflict monitor: The ACLED Ukraine Conflict Monitor provides near real-time information on the ongoing war in Ukraine, including an interactive map, a curated data file, and weekly situation updates Ukraine Conflict Monitor, maintained by the Armed Conflict Location & Event Data Project, starting in 2022, including battles, explosions/remote violence, violence against civilians, protests, and riots:
https://acleddata.com/monitor/ukraine-conflict-monitor
- SIPRI Arms Trade: The SIPRI Arms Transfers Database is a comprehensive public resource tracking all international transfers of major conventional arms from 1950 to the present. For each deal, information includes: number ordered, supplier/recipient identities, weapon types, delivery dates, and deal comments. The database can address questions about: who are suppliers and recipients of major weapons, what weapons have been transferred by specific countries, and how supplier-recipient relationships have changed over time.
https://www.sipri.org/databases/armstransfers
- Environmental Protection Agency data: The EPA, in general, has excellent data on the release of toxic substances, and I also tracked down air quality and asthma. You can put these together to look at how changes in toxic release correlate with air quality and respiratory disease over time:
https://www.epa.gov/data
https://www.epa.gov/toxics-release-inventory-tri-program/tri-toolbox
https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm
https://www.earthdata.nasa.gov/topics/atmosphere/air-quality/data-access-tools

If you have other data sources that you're interested in, I am willing to consider them, as long as they lend themselves to an interesting analysis.

Submit a document or notebook that clearly addresses the following:
1. Describe the data clearly -- particularly any missing data that might impact your analysis -- and the provenance of your dataset. Who collected the data and why? (10/100 pts)
2. What phenomenon are you modeling? Provide a brief background on the topic, including definitions and details that are relevant to your analysis. Clearly describe its main features, and support those claims with data where appropriate. (10/100 pts)
3. Describe your non-parametric model (empirical cumulative distribution functions, kernel density function, local constant least squares regression, Markov transition models). How are you fitting your model to the phenomenon to get realistic properties of the data? What challenges did you have to overcome? (15/100 pts)
4. Either use your model to create new sequences (if the model is more generative) or bootstrap a quantity of interest (if the model is more inferential). (15/100 pts)
5. Critically evaluate your work in part 4. Do your sequences have the properties of the training data, and if not, why not? Are your estimates credible and reliable, or is there substantial uncertainty in your results? (15/100 pts)
6. Write a conclusion that explains the limitations of your analysis and potential for future work on this topic. (10/100 pts)

In addition, submit a GitHub repo containing your code and a description of how to obtain the original data from the source. Make sure the code is commented, where appropriate. Include a .gitignore file. We will look at your commit history briefly to determine whether everyone in the group contributed. (10/100 pts)

In class, we'll briefly do presentations and criticize each other's work, and participation in your group's presentation and constructively critiquing the other groups' presentations accounts for the remaining 15/100 pts.


In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('trade-register.csv', encoding='latin-1', skiprows = 11)

In [3]:
data

Unnamed: 0,Recipient,Supplier,Year of order,Unnamed: 4,Number ordered,.1,Weapon designation,Weapon description,Number delivered,.2,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
0,Afghanistan,Turkiye,2007.0,,24.0,,M-114 155mm,towed gun,24.0,,2007,Second hand,Second-hand; aid,0.20,4.80,4.80
1,Afghanistan,United States,2004.0,?,188.0,?,M-113,armoured personnel carrier,188.0,?,2005,Second hand,Second-hand; aid; M-113A2 version; incl 15 M-5...,0.10,18.80,18.80
2,Afghanistan,United States,2016.0,,53.0,,S-70 Black Hawk,transport helicopter,53.0,?,2017; 2018; 2019; 2020,Second hand but modernized,Second-hand UH-60A modernized to UH-60A+ befor...,4.29,227.37,227.37
3,Afghanistan,Soviet Union,1973.0,?,100.0,?,T-62,tank,100.0,?,1975; 1976,New,,1.80,180.00,180.00
4,Afghanistan,Soviet Union,1978.0,?,500.0,?,T-55,tank,500.0,?,1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986...,Second hand,Second-hand; aid,0.50,250.00,250.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28345,Zimbabwe,China,2004.0,?,10.0,?,Type-85,armoured personnel carrier,10.0,?,2004,New,,0.30,3.00,3.00
28346,Zimbabwe,China,2004.0,?,5.0,?,Type-89/ZSD-89,armoured personnel carrier,5.0,?,2004,New,ARV version,0.30,1.50,1.50
28347,Zimbabwe,Soviet Union,1975.0,?,15.0,?,T-34-85,tank,15.0,?,1975,Second hand,Second-hand; supplier uncertain,0.38,5.70,5.70
28348,Zimbabwe,Ukraine,2005.0,,6.0,,AI-25,turbofan,6.0,,2005,New,For 6 K-8 trainer aircraft from China,0.60,3.60,3.60


In [4]:
data = data.drop(data.columns[[3, 5, 9]], axis=1)

In [5]:
data.head()

Unnamed: 0,Recipient,Supplier,Year of order,Number ordered,Weapon designation,Weapon description,Number delivered,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
0,Afghanistan,Turkiye,2007.0,24.0,M-114 155mm,towed gun,24.0,2007,Second hand,Second-hand; aid,0.2,4.8,4.8
1,Afghanistan,United States,2004.0,188.0,M-113,armoured personnel carrier,188.0,2005,Second hand,Second-hand; aid; M-113A2 version; incl 15 M-5...,0.1,18.8,18.8
2,Afghanistan,United States,2016.0,53.0,S-70 Black Hawk,transport helicopter,53.0,2017; 2018; 2019; 2020,Second hand but modernized,Second-hand UH-60A modernized to UH-60A+ befor...,4.29,227.37,227.37
3,Afghanistan,Soviet Union,1973.0,100.0,T-62,tank,100.0,1975; 1976,New,,1.8,180.0,180.0
4,Afghanistan,Soviet Union,1978.0,500.0,T-55,tank,500.0,1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986...,Second hand,Second-hand; aid,0.5,250.0,250.0


In [6]:
# Lets look at the number of missing values in each column

data.isnull().sum()

Recipient                            2
Supplier                             0
Year of order                        0
Number ordered                     110
Weapon designation                   2
Weapon description                   2
Number delivered                     2
Year(s) of delivery                  2
status                               2
Comments                          5179
SIPRI TIV per unit                   4
SIPRI TIV for total order            4
SIPRI TIV of delivered weapons       4
dtype: int64

In [7]:
data[data["SIPRI TIV of delivered weapons"].isna()]

Unnamed: 0,Recipient,Supplier,Year of order,Number ordered,Weapon designation,Weapon description,Number delivered,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
24138,Thailand,Ukraine,2008.0,14.0,BTR-3,armoured personnel carrier,13.0,2010; 2012,New,Part of THB4b ($120 m) deal (for 96 BTR-3 in s...,,,
24139,,0.25,3.5,,,,,,,,,,
26821,United Nations**,Israel,2015.0,3.0,Hermes-900,UAV,3.0,2016,New,3-year lease; for use with UN peacekeeping for...,,,
26822,,3,9.0,,,,,,,,,,


In [8]:
data = data.drop([24139, 26822], axis=0)

In [9]:
data[data["SIPRI TIV of delivered weapons"].isna()]

Unnamed: 0,Recipient,Supplier,Year of order,Number ordered,Weapon designation,Weapon description,Number delivered,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
24138,Thailand,Ukraine,2008.0,14.0,BTR-3,armoured personnel carrier,13.0,2010; 2012,New,Part of THB4b ($120 m) deal (for 96 BTR-3 in s...,,,
26821,United Nations**,Israel,2015.0,3.0,Hermes-900,UAV,3.0,2016,New,3-year lease; for use with UN peacekeeping for...,,,


In [10]:
data.head()

Unnamed: 0,Recipient,Supplier,Year of order,Number ordered,Weapon designation,Weapon description,Number delivered,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
0,Afghanistan,Turkiye,2007.0,24.0,M-114 155mm,towed gun,24.0,2007,Second hand,Second-hand; aid,0.2,4.8,4.8
1,Afghanistan,United States,2004.0,188.0,M-113,armoured personnel carrier,188.0,2005,Second hand,Second-hand; aid; M-113A2 version; incl 15 M-5...,0.1,18.8,18.8
2,Afghanistan,United States,2016.0,53.0,S-70 Black Hawk,transport helicopter,53.0,2017; 2018; 2019; 2020,Second hand but modernized,Second-hand UH-60A modernized to UH-60A+ befor...,4.29,227.37,227.37
3,Afghanistan,Soviet Union,1973.0,100.0,T-62,tank,100.0,1975; 1976,New,,1.8,180.0,180.0
4,Afghanistan,Soviet Union,1978.0,500.0,T-55,tank,500.0,1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986...,Second hand,Second-hand; aid,0.5,250.0,250.0


## Question 1

The data provides an overview of the arms trade that has happened since 1950 between countries. It provides the recieving country, the supplier country, the type of arms, the number of arms, the date of the deal, comments of the deal, and the SIPRI TIV. TIV is a universal unit describing the military capability of a weapon based on volume rather than financial cost. The data can be found in the following link: https://www.sipri.org/databases/armstransfers from The Stockholm International Peace Research Institute. They created the database in order for analysts, researchers, policymakers, and the media to better understand the arms trade and learn how it has changed over time. We did not find any missing data in the dataset however we did find a few extra columns that were not useful and some extra rows and values that were not useful.