# **Arms Data: ETL**

## Objectives

* Evaluate size of data set
* Check data types
* Normalise values and columns
* Prepare features for K-means clustering 

## Inputs

* This ETL notebook only requires the unprocessed arms transfer dataset as found in the Data/Raw folder

## Outputs

* By the end of the notebook I would have produced a processed version of the dataset with added features necessary for clustering 




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\Capstone\\Arms-Import-Export-Analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\Capstone\\Arms-Import-Export-Analysis'

In [4]:
print(os.listdir())


['.git', '.gitignore', '.python-version', '.slugignore', '.venv', 'Data', 'jupyter_notebooks', 'Procfile', 'README.md', 'requirements.txt', 'setup.sh']


# Cleaning and normalising 

In this section the focus will be on transforming the existing data as opposed to feature engineering 

First the necessary libraries will be imported 

In [5]:
import pandas as pd

Now the CSV file containing the dataset can be parsed into a DataFrame 

In [15]:
arms_df = pd.read_csv('Data/Raw/trade-register-military.csv')
arms_df.head()

Unnamed: 0,Recipient,Supplier,Year of order,Unnamed: 4,Number ordered,.1,Weapon designation,Weapon description,Number delivered,.2,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
0,Afghanistan,Russia,2002.0,,3.0,,Mi-17,transport helicopter,3.0,,2002,Second hand,Second-hand; aid,2.9,8.7,8.7
1,Afghanistan,Turkiye,2007.0,,24.0,,M-114 155mm,towed gun,24.0,,2007,Second hand,Second-hand; aid,0.2,4.8,4.8
2,Afghanistan,United States,2004.0,?,188.0,?,M-113,armoured personnel carrier,188.0,?,2005,Second hand,Second-hand; aid; M-113A2 version; incl 15 M-5...,0.1,18.8,18.8
3,Afghanistan,United States,2016.0,,53.0,,S-70 Black Hawk,transport helicopter,53.0,?,2017; 2018; 2019; 2020,Second hand but modernized,Second-hand UH-60A modernized to UH-60A+ befor...,4.29,227.37,227.37
4,Afghanistan,Soviet Union,1973.0,?,100.0,?,T-62,tank,100.0,?,1975; 1976,New,,1.8,180.0,180.0


It is worth noting that this dataset measures each order by 'TIV' (Trend Indicator Value), this is SIPRI's way of calculating military capability and is used in place of the total market value

First the structure and data types will be checked

In [17]:
arms_df.shape

(29507, 13)

At just under 30,000 rows this dataset will be excellent for analysis

In [18]:
arms_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29507 entries, 0 to 29506
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Recipient                       29503 non-null  object 
 1   Supplier                        29507 non-null  object 
 2   Year of order                   29507 non-null  float64
 3   Number ordered                  29217 non-null  float64
 4   Weapon designation              29503 non-null  object 
 5   Weapon description              29503 non-null  object 
 6   Number delivered                29503 non-null  float64
 7   Year(s) of delivery             28348 non-null  object 
 8   status                          29503 non-null  object 
 9   Comments                        24211 non-null  object 
 10  SIPRI TIV per unit              29499 non-null  float64
 11  SIPRI TIV for total order       29499 non-null  float64
 12  SIPRI TIV of delivered weapons  

Looks like the Year of order values need to be converted to datetime, as there are often multiple years in the Year(s) of delivery column different action may need to be taken

In [19]:
arms_df.describe()

Unnamed: 0,Year of order,Number ordered,Number delivered,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
count,29507.0,29217.0,29503.0,29499.0,29499.0,29499.0
mean,1989.283792,124.013896,122.131071,7.553335,76.846316,70.923255
std,31.289445,806.611909,837.095692,27.676713,264.162104,245.60212
min,0.36,1.0,0.0,0.0,0.0,0.0
25%,1973.0,3.0,3.0,0.24,4.4,3.7
50%,1988.0,10.0,10.0,1.0,15.0,14.0
75%,2009.0,50.0,43.0,5.0,50.0,47.0
max,2024.0,50000.0,50000.0,1250.0,10117.5,10117.5


The values shown above will provide a good benchmark for exploratory analysis

Now the number of duplicated rows and null values are displayed

In [23]:
arms_df.duplicated().sum()


12

In [22]:
arms_df.isnull().sum()

Recipient                            4
Supplier                             0
Year of order                        0
Number ordered                     290
Weapon designation                   4
Weapon description                   4
Number delivered                     4
Year(s) of delivery               1159
status                               4
Comments                          5296
SIPRI TIV per unit                   8
SIPRI TIV for total order            8
SIPRI TIV of delivered weapons       8
dtype: int64

There are some unlabeled columns with unknown values which will be dropped 

In [16]:
arms_df = arms_df.drop(columns=[arms_df.columns[9], 
                                arms_df.columns[5], 
                                arms_df.columns[3]]
)
arms_df.head()


Unnamed: 0,Recipient,Supplier,Year of order,Number ordered,Weapon designation,Weapon description,Number delivered,Year(s) of delivery,status,Comments,SIPRI TIV per unit,SIPRI TIV for total order,SIPRI TIV of delivered weapons
0,Afghanistan,Russia,2002.0,3.0,Mi-17,transport helicopter,3.0,2002,Second hand,Second-hand; aid,2.9,8.7,8.7
1,Afghanistan,Turkiye,2007.0,24.0,M-114 155mm,towed gun,24.0,2007,Second hand,Second-hand; aid,0.2,4.8,4.8
2,Afghanistan,United States,2004.0,188.0,M-113,armoured personnel carrier,188.0,2005,Second hand,Second-hand; aid; M-113A2 version; incl 15 M-5...,0.1,18.8,18.8
3,Afghanistan,United States,2016.0,53.0,S-70 Black Hawk,transport helicopter,53.0,2017; 2018; 2019; 2020,Second hand but modernized,Second-hand UH-60A modernized to UH-60A+ befor...,4.29,227.37,227.37
4,Afghanistan,Soviet Union,1973.0,100.0,T-62,tank,100.0,1975; 1976,New,,1.8,180.0,180.0


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
