# Process Documentation by Sophia Wang

This note book will guide you through every step taken to creat this subset of data from the original dataset.

The original dataset records the national material capabilities of 71 countries from 1816-2016 year by year. This sub datadet will be culling out the raw data value of USA solely. The six indicators of a state's material capability are  iron and steel production (irst), military expenditure (milex), military personnel (milper), total population (tpop), urban population (upop), and energy consumption (cinc).

The purpose of creating this subset of data is to help people who want to compare the national material capabilities of the United States and China and observe how they changed in each year from 1861 to 2007. 

The link below can lead you to the original dataset
https://correlatesofwar.org/wp-content/uploads/NMC_v4_0.csv

### Acknowledgements

This subset data is created using *Jupyter Lab* through *Anaconda*. You can download [Anaconda](https://unc-libraries-data.github.io/Python/Setup.html) here.

### Getting Started

you'll begin by importing the packages that you'll need to use with Python. 

You'll want to load pandas with the usual `import pandas` and an extra `as pd` statement. This allows you to call functions from `pandas` with `pd.<function>` instead of `pandas.<function>` for convenience. `as pd` is **not** necessary to load the package.

Note, you also need to import the `numpy` package, which is going to help pandas do some of its math.

In [4]:
import numpy as np
import pandas as pd

You'll also need to create your dataframe object again, by using pandas to read in your .csv file.

`pd.read_csv` reads the tabular data from a Comma Separated Values (csv) file into a dataframe object that you'll define as `df`.

To create your dataframe object you'll define your object `df` by executing the `pd.read_csv()`function on your data file by inserting the relative file path into the parathenses.

In [5]:
df=pd.read_csv("NMC_v4_0.csv")

### Creating the Subset

After successfully imported the original dataset, you'll now start to create the subset by filtering the data with the following steps. First, you'll use the '.loc' function to create a dataframe by entering the index range from 45 to 191, which is all the USA rows from 1861 to 2007.

Since the "ccode (country code)" column is not very useful to your purpose, you'll want to only include the columns that named "stateabb","year","irst","milex","milper","tpop","upop",and "cinc" (representing state abbrevation, year, iron and steel production, military expenditure, military personnel, total population,urban population, and energy consumption).

In [6]:
df.loc[45:191,["stateabb","year","irst","milex","milper","tpop","upop","cinc"]]

Unnamed: 0,stateabb,year,irst,milex,milper,tpop,upop,cinc
45,USA,1861,664,95362,217,32351.0,2759.0,0.144429
46,USA,1862,715,136769,673,33188.0,2886.0,0.176391
47,USA,1863,860,119662,960,34026.0,3018.0,0.178967
48,USA,1864,1031,157390,1032,34863.0,3156.0,0.192913
49,USA,1865,845,30471,1063,35701.0,3301.0,0.135465
...,...,...,...,...,...,...,...,...
187,USA,2003,93677,404920000,1427,290448.0,78621.0,0.142094
188,USA,2004,99681,455908000,1450,293192.0,79745.0,0.143169
189,USA,2005,94897,495326000,1473,295896.0,80805.0,0.148290
190,USA,2006,98557,521840000,1546,298755.0,81880.0,0.146377


Second, still using the '.loc' function, create a dataframe that includes the index range from 45 to 191, which is all the CHN(China) rows from 1861 to 2007.

You can simply copy the code in the last step and change the range of row numbers from "45:191" to "12208:12354".

In [7]:
df.loc[12208:12354,["stateabb","year","irst","milex","milper","tpop","upop","cinc"]]

Unnamed: 0,stateabb,year,irst,milex,milper,tpop,upop,cinc
12208,CHN,1861,10,-9,1000,374670.0,-9.0,0.172672
12209,CHN,1862,10,-9,1000,370015.0,-9.0,0.170761
12210,CHN,1863,10,-9,1000,365419.0,-9.0,0.164349
12211,CHN,1864,10,-9,1000,360880.0,-9.0,0.159389
12212,CHN,1865,10,-9,1000,356397.0,-9.0,0.161060
...,...,...,...,...,...,...,...,...
12350,CHN,2003,222413,75500000,2250,1288400.0,664989.0,0.169247
12351,CHN,2004,280486,87150000,2252,1296075.0,692653.0,0.182570
12352,CHN,2005,355790,29873000,2255,1303720.0,710800.0,0.183922
12353,CHN,2006,422989,35223000,2255,1311020.0,729423.0,0.190264


### Exporting Your New Subsets

Now you've got the two subsets data of USA and China to compare and analyze.

Finally, the last step is to export the new subsets you just created by using the method `.to_csv()` - adding the filename and extension within the parentheses at the end.

You can further add `index=false` to your statement, which tells it not to include the default index numbers.

`RI_subset.to_csv("RI_subset.csv", index=False)`

Define the subset of USA as RI_usa and the subset of China as RI_chn

In [8]:
RI_usa = df.loc[45:191,["stateabb","year","irst","milex","milper","tpop","upop","cinc"]]

In [9]:
RI_chn = df.loc[12208:12354,["stateabb","year","irst","milex","milper","tpop","upop","cinc"]]

For the final step, you simply need to use the ".to_csv" function as below to export the two subsets as new .csv files.

In [10]:
RI_usa.to_csv("RI_usa.csv", index=False)

In [11]:
RI_chn.to_csv("RI_chn.csv", index=False)