# Cleaning and Formatting my data




This is my data:

In [1]:
IRdisplay::display_html('<iframe width="700" height="300" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR-ubcCBaveg-58jcVmbErpO5kZswjFyHN5YlB8tB1a8B4fzU4sqZ08jkOKx4kBz1qtDNkJJWH8vBYF/pubhtml?gid=2024244899&single=true"></iframe>')

You can find it [here](https://docs.google.com/spreadsheets/d/1e1Pll_MGF6dVi4KJTXTiLfRXzkjf7ZdBhd58yt3Vkl8/edit?usp=sharing) too, on GoogleDrive.

**FOR this STEP, you should read from GitHub**.

This is the link to my CSV:

In [2]:
# the link as CSV
linkToData="https://docs.google.com/spreadsheets/d/e/2PACX-1vR-ubcCBaveg-58jcVmbErpO5kZswjFyHN5YlB8tB1a8B4fzU4sqZ08jkOKx4kBz1qtDNkJJWH8vBYF/pub?gid=0&single=true&output=csv"

Read the data:

In [3]:
dirty=read.csv(linkToData,check.names=F)

As usual, I check the data types:

In [4]:
str(dirty)

'data.frame':	6 obs. of  6 variables:
 $ identification : chr  "Perú" "USA" "Canada" "Côte D'Ivoire" ...
 $ identification2: chr  "Peru, South America" "USA, North America" "Canada, North America" "Côte D'Ivoire, Africa" ...
 $ var1           : chr  "1500" "2500" "3500" "2500" ...
 $ var 2          : chr  "1'200" "1'300" "--" "" ...
 $ var@3          : chr  "500" "$1 500" "1.5" "_" ...
 $ category       : chr  "a" "A" "Ba" "Ba" ...


Now, I identify which are textual, numerical, or categorical.

* Columns **identification1** and **identification2** are *textual*.
* The columns from **var1** to **var@3** are all *numerical*. But if the type is _object_ the column should have some non numerical characters.
* Column **category** is *categorical*. Keep in mind that categorical types will NEVER be recognised as such when read from a CSV. They will always be understood as text (_object_).

The **column names** are always *textual*.



# PART 1. EXPLORATION



### 1.1. **Exploring TEXT**


When data is textual, you need to explore the cells to verify all the characters are part of the **alphabet**.

Let me use R's **grep()** function:

In [5]:
# show me the cells that have a character outside the alphabet
dirty$identification[grep("[^a-zA-Z]",dirty$identification)]

United Kingdom is not dirty. But the space is outside the alphabet. What about:

In [6]:
dirty$identification[grep("\\W",dirty$identification)]

or...

In [7]:
dirty$identification[grep("[^\\w\\s]",dirty$identification,perl=T)]


Then the safe option is:

In [8]:
dirty$identification[grep("[^a-zA-Z\\s]",dirty$identification,perl = T)]

A similar exploration should be done in the **column names**:

In [9]:
# allowing numbers, not spaces
names(dirty)[grep("[^0-9a-zA-Z]",names(dirty),perl = T)]

And in the case of the column with **categorical data**:

In [10]:
dirty$category[grep("[^a-zA-Z]",dirty$category,perl = T)]

### 1.2. **Exploring NUMBERS**

If numbers are recognised as so, there is no cleaning needed. But if not, it means it has been recognised as text, then we use the regex **\d** (and its variations):

In [11]:
dirty$var1[grep("\\D",dirty$var1,perl = T)]

In [12]:
dirty$'var 2'[grep("\\D",dirty$'var 2',perl = T)]

In [13]:
### Why the error?
# dirty$var@3[grep("\\D",dirty$var@3,perl = T)]

Notice I need to use **""** to access the variables with dirty names (space between words, and the **@** special character). That is why you clean the column names first:

In [14]:
dirty$'var@3'[grep("\\D",dirty$'var@3',perl=T)]

There are cells with good values, but other values can not be kept. Use **\D** with care, numbers are complex. So I prefer something like this:

In [15]:
dirty$'var@3'[grep("[^\\d+\\.*\\d*]", dirty$'var@3', perl=T,invert = F)]

# PART 2. CLEANING

As mentioned, cleaning may mean:

a. Making bad characters disappear.

b. Keeping good characters stay.


Let's start with the _column names_:

## 2.1  The Column names

In [16]:
names(dirty)[grep("[^0-9a-zA-Z]",names(dirty),perl = T)]

How can you say: if "a space" or a "weird character", disappear? (that is *replace* by "")

In [17]:
# option 1
gsub("\\W",'',names(dirty), perl=T )


In [18]:
# option 2
gsub("[^\\w]",'',names(dirty), perl=T )

In [19]:
# # option 3
gsub("[^0-9a-zA-Z]",'',names(dirty), perl=T )

Choose any and make the change:

In [20]:
names(dirty)=gsub("[^0-9a-zA-Z]",'',names(dirty), perl=T )
dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,"Peru, South America",1500,1'200,500,a
USA,"USA, North America",2500,1'300,$1 500,A
Canada,"Canada, North America",3500,--,1.5,Ba
Côte D'Ivoire,"Côte D'Ivoire, Africa",2500,,_,Ba
Israel [note],"Israel [note], Asia",Dk,250k,-,?
United Kingdom,"United Kingdom, Europe",2550,310000,330,Ba


The column names were cleaned by **Making bad characters disappear** 🙂

## 2.2  The dataframe contents

The contents include:
* The data columns. Generally numbers and categories.
* The identifier column(s). Generally text.



### 2.2.1 The idientifier column(s)

We have two of those. Let's check the **identification** column:

In [21]:
dirty$identification[grep("[^a-zA-Z\\s]",dirty$identification,perl = T)]

Not all characters detected are invalid. The **only** problem here is the brackets. Then:

* Option 1: Whatever inside brackets (including the brackets) have to go!

In [22]:
gsub("\\[.*\\]",'',dirty$identification,perl = T)

* Option 2: Splitting

In [23]:
strsplit(dirty$identification,split = '[',fixed=T)

You got a list. BUT you need a data frame column. Then:

In [24]:
## saving result
resultSplitIn2=strsplit(dirty$identification,split = '[',fixed=T)
# as matrix
goodColumn=c()
for (elements in resultSplitIn2){
  goodColumn=c(goodColumn,elements[1])

}
goodColumn

When you are happy, make the change:

In [25]:
dirty$identification=goodColumn
dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,"Peru, South America",1500,1'200,500,a
USA,"USA, North America",2500,1'300,$1 500,A
Canada,"Canada, North America",3500,--,1.5,Ba
Côte D'Ivoire,"Côte D'Ivoire, Africa",2500,,_,Ba
Israel,"Israel [note], Asia",Dk,250k,-,?
United Kingdom,"United Kingdom, Europe",2550,310000,330,Ba


The **splitting** option seems very convenient for **identification2**:

In [26]:
## you want to keep [2]:
## saving result
resultSplitIn2=strsplit(dirty$identification2,split = ',', fixed = T)
# as matrix
goodColumn=c()
for (elements in resultSplitIn2){
  goodColumn=c(goodColumn,elements[2]) #keepig the right part!

}
goodColumn

If this is OK, then:

In [27]:
dirty$identification2=goodColumn
dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500,1'200,500,a
USA,North America,2500,1'300,$1 500,A
Canada,North America,3500,--,1.5,Ba
Côte D'Ivoire,Africa,2500,,_,Ba
Israel,Asia,Dk,250k,-,?
United Kingdom,Europe,2550,310000,330,Ba


### 2.2.2 The Categorical columns

The **category** requires a frequency table:

In [28]:
table(dirty$category)


 ?  A Ba  a 
 1  1  3  1 

You can conclude that the **a** is wrong, it should be **A**.

In [29]:
#what about:
gsub('a','A', dirty$category,fixed=T) #fixed uses NO REGEX

That changed **Ba** to **BA**!

In [30]:
## maybe
## ^: start of string
## $: end  of string
gsub('^a$','A', dirty$category)

The simpler way:

In [31]:
dirty[dirty$category=='a','category']='A'

dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500,1'200,500,A
USA,North America,2500,1'300,$1 500,A
Canada,North America,3500,--,1.5,Ba
Côte D'Ivoire,Africa,2500,,_,Ba
Israel,Asia,Dk,250k,-,?
United Kingdom,Europe,2550,310000,330,Ba



As you seem there are some symbols for missing. We could change it now. Or later.

Let me first check the **numeric columns**

### 2.2.3. The numerical columns

In [32]:
gsub(',','',dirty$var1)


Then,

In [33]:
dirty$var1=gsub(',','',dirty$var1)
dirty


identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500,1'200,500,A
USA,North America,2500,1'300,$1 500,A
Canada,North America,3500,--,1.5,Ba
Côte D'Ivoire,Africa,2500,,_,Ba
Israel,Asia,Dk,250k,-,?
United Kingdom,Europe,2550,310000,330,Ba


The **var2** is more complicated.

In [34]:
# save where you have the issue
dirty$var2_temp=grepl("\\'|k",dirty$var2,fixed=F)
dirty

identification,identification2,var1,var2,var3,category,var2_temp
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>
Perú,South America,1500,1'200,500,A,True
USA,North America,2500,1'300,$1 500,A,True
Canada,North America,3500,--,1.5,Ba,False
Côte D'Ivoire,Africa,2500,,_,Ba,False
Israel,Asia,Dk,250k,-,?,True
United Kingdom,Europe,2550,310000,330,Ba,False


In [35]:
## now replace
dirty$var2=gsub("\\'|k",'',dirty$var2)
dirty

identification,identification2,var1,var2,var3,category,var2_temp
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>
Perú,South America,1500,1200,500,A,True
USA,North America,2500,1300,$1 500,A,True
Canada,North America,3500,--,1.5,Ba,False
Côte D'Ivoire,Africa,2500,,_,Ba,False
Israel,Asia,Dk,250,-,?,True
United Kingdom,Europe,2550,310000,330,Ba,False


In [36]:
# now the real value
ifelse(dirty$var2_temp,paste0(dirty$var2,'000'),dirty$var2)

In [37]:
# then
dirty$var2=ifelse(dirty$var2_temp,paste0(dirty$var2,'000'),dirty$var2)
dirty$var2_temp=NULL
dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500,1200000,500,A
USA,North America,2500,1300000,$1 500,A
Canada,North America,3500,--,1.5,Ba
Côte D'Ivoire,Africa,2500,,_,Ba
Israel,Asia,Dk,250000,-,?
United Kingdom,Europe,2550,310000,330,Ba


The **var3** can be solved like this:

In [38]:
dirty['var3']=gsub("\\$|\\s",'',dirty$var3)
dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500,1200000,500,A
USA,North America,2500,1300000,1500,A
Canada,North America,3500,--,1.5,Ba
Côte D'Ivoire,Africa,2500,,_,Ba
Israel,Asia,Dk,250000,-,?
United Kingdom,Europe,2550,310000,330,Ba


## 2.3. Detecting missing values:


Wrong missing values representation should be replace with care. Do it according to the data type.

Then, let's start with the **categorical** column:

In [39]:
badSymbolCat=grep('\\W+',dirty$category,value = T)
badSymbolCat

Once found:

In [40]:
dirty$category=gsub(badSymbolCat,NA,dirty$category,fixed = T)
dirty

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500,1200000,500,A
USA,North America,2500,1300000,1500,A
Canada,North America,3500,--,1.5,Ba
Côte D'Ivoire,Africa,2500,,_,Ba
Israel,Asia,Dk,250000,-,
United Kingdom,Europe,2550,310000,330,Ba


Let's go for the **numerical** cases:

In [41]:
dirty$var1[grep("[^\\d+\\.*\\d*]",dirty$var1,perl = T)]

In [42]:
dirty$var2[grep("[^\\d+\\.*\\d*]",dirty$var2,perl = T)]

In [43]:
dirty$var3[grep("[^\\d+\\.*\\d*]", dirty$var3, perl=T,invert = F)]

Here I apply a function to several columns, instead of one by one:

In [44]:
sapply(dirty[, c('var1','var2','var3')], function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]})

Nice output:

In [45]:
unlist(sapply(dirty[, c('var1','var2','var3')], function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]}))

In [46]:
unique(unlist(sapply(dirty[, c('var1','var2','var3')], function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]})))

Let's improve readability:

In [47]:
detectWrongNA= function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]}
badSymbolNum=sapply(dirty[, c('var1','var2','var3')],detectWrongNA)
badSymbolNum_unlist=unlist(badSymbolNum)
badSymbolNum_vector=unique(badSymbolNum_unlist)
badSymbolNum_vector

Let's clean those columns:

In [48]:


dirty[, c('var1','var2','var3')]=lapply(dirty[, c('var1','var2','var3')],function(col) ifelse((col %in% badSymbolNum_vector), NA, col))

dirty


identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500.0,1200000.0,500.0,A
USA,North America,2500.0,1300000.0,1500.0,A
Canada,North America,3500.0,,1.5,Ba
Côte D'Ivoire,Africa,2500.0,,,Ba
Israel,Asia,,250000.0,,
United Kingdom,Europe,2550.0,310000.0,330.0,Ba


In [49]:
str(dirty)

'data.frame':	6 obs. of  6 variables:
 $ identification : chr  "Perú" "USA" "Canada" "Côte D'Ivoire" ...
 $ identification2: chr  " South America" " North America" " North America" " Africa" ...
 $ var1           : chr  "1500" "2500" "3500" "2500" ...
 $ var2           : chr  "1200000" "1300000" NA "" ...
 $ var3           : chr  "500" "1500" "1.5" NA ...
 $ category       : chr  "A" "A" "Ba" "Ba" ...


Always be preventive with leading and trailing spaces!

In [50]:
dirty[,]=sapply(dirty[,],trimws) #use it when all are CHR!
nowClean=dirty[,]
str(nowClean)

'data.frame':	6 obs. of  6 variables:
 $ identification : chr  "Perú" "USA" "Canada" "Côte D'Ivoire" ...
 $ identification2: chr  "South America" "North America" "North America" "Africa" ...
 $ var1           : chr  "1500" "2500" "3500" "2500" ...
 $ var2           : chr  "1200000" "1300000" NA "" ...
 $ var3           : chr  "500" "1500" "1.5" NA ...
 $ category       : chr  "A" "A" "Ba" "Ba" ...


In [51]:
nowClean

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Perú,South America,1500.0,1200000.0,500.0,A
USA,North America,2500.0,1300000.0,1500.0,A
Canada,North America,3500.0,,1.5,Ba
Côte D'Ivoire,Africa,2500.0,,,Ba
Israel,Asia,,250000.0,,
United Kingdom,Europe,2550.0,310000.0,330.0,Ba


## 2.4. Saving the cleaned data

I will save the cleaned dataframe **locally**:

In [54]:
folder <- " DataCleanAndFormatted"

# Check if the folder exists
if (!dir.exists(folder)) {
  # Create the folder
  dir.create(folder)
  write.csv(nowClean,file.path(folder,"nowClean.csv"),row.names=F)

} else {
  write.csv(nowClean,file.path(folder,"nowClean.csv"),row.names=F)}

**The cleaned fill will be sent to Github**

The formatting part will read this file from GitHub.

# PART 3. FORMATTING

Let me read the cleaned data from **GITHUB**

In [55]:
linkCleanData='https://github.com/MAGALLANESJoseManuel/deli2_test/raw/refs/heads/main/%20DataCleanAndFormatted/nowClean.csv'
cleanData=read.csv(linkCleanData)
str(cleanData)

'data.frame':	6 obs. of  6 variables:
 $ identification : chr  "Perú" "USA" "Canada" "Côte D'Ivoire" ...
 $ identification2: chr  "South America" "North America" "North America" "Africa" ...
 $ var1           : int  1500 2500 3500 2500 NA 2550
 $ var2           : int  1200000 1300000 NA NA 250000 310000
 $ var3           : num  500 1500 1.5 NA NA 330
 $ category       : chr  "A" "A" "Ba" "Ba" ...


In [56]:
cleanData

identification,identification2,var1,var2,var3,category
<chr>,<chr>,<int>,<int>,<dbl>,<chr>
Perú,South America,1500.0,1200000.0,500.0,A
USA,North America,2500.0,1300000.0,1500.0,A
Canada,North America,3500.0,,1.5,Ba
Côte D'Ivoire,Africa,2500.0,,,Ba
Israel,Asia,,250000.0,,
United Kingdom,Europe,2550.0,310000.0,330.0,Ba


## The numerical data

Since the numeric data was clean, you need not format those columns.

## The categorical data

We have one categorical column, currently as text.

In [57]:
cleanData$category

Create a column of labels as categorical. If the levels are NOT ordinal, just use letters (if you had ordinal levels, you should add numbers at the beginning).

In [58]:
# create and rename

cleanData$category_label=factor(cleanData$category,
                                levels = c('A','Ba'),
                                labels = c('Not Allied', 'Allied'))

# result
cleanData

identification,identification2,var1,var2,var3,category,category_label
<chr>,<chr>,<int>,<int>,<dbl>,<chr>,<fct>
Perú,South America,1500.0,1200000.0,500.0,A,Not Allied
USA,North America,2500.0,1300000.0,1500.0,A,Not Allied
Canada,North America,3500.0,,1.5,Ba,Allied
Côte D'Ivoire,Africa,2500.0,,,Ba,Allied
Israel,Asia,,250000.0,,,
United Kingdom,Europe,2550.0,310000.0,330.0,Ba,Allied


In [59]:
# verifying
str(cleanData)

'data.frame':	6 obs. of  7 variables:
 $ identification : chr  "Perú" "USA" "Canada" "Côte D'Ivoire" ...
 $ identification2: chr  "South America" "North America" "North America" "Africa" ...
 $ var1           : int  1500 2500 3500 2500 NA 2550
 $ var2           : int  1200000 1300000 NA NA 250000 310000
 $ var3           : num  500 1500 1.5 NA NA 330
 $ category       : chr  "A" "A" "Ba" "Ba" ...
 $ category_label : Factor w/ 2 levels "Not Allied","Allied": 1 1 2 2 NA 2


Now create a representation of the categories using numbers:

In [60]:
RENAME_category <- c("Not Allied"=0 ,"Allied"=1)
cleanData$category_int=RENAME_category[cleanData$category_label]

#result
cleanData

identification,identification2,var1,var2,var3,category,category_label,category_int
<chr>,<chr>,<int>,<int>,<dbl>,<chr>,<fct>,<dbl>
Perú,South America,1500.0,1200000.0,500.0,A,Not Allied,0.0
USA,North America,2500.0,1300000.0,1500.0,A,Not Allied,0.0
Canada,North America,3500.0,,1.5,Ba,Allied,1.0
Côte D'Ivoire,Africa,2500.0,,,Ba,Allied,1.0
Israel,Asia,,250000.0,,,,
United Kingdom,Europe,2550.0,310000.0,330.0,Ba,Allied,1.0


In [61]:
# verifying
str(cleanData)

'data.frame':	6 obs. of  8 variables:
 $ identification : chr  "Perú" "USA" "Canada" "Côte D'Ivoire" ...
 $ identification2: chr  "South America" "North America" "North America" "Africa" ...
 $ var1           : int  1500 2500 3500 2500 NA 2550
 $ var2           : int  1200000 1300000 NA NA 250000 310000
 $ var3           : num  500 1500 1.5 NA NA 330
 $ category       : chr  "A" "A" "Ba" "Ba" ...
 $ category_label : Factor w/ 2 levels "Not Allied","Allied": 1 1 2 2 NA 2
 $ category_int   : num  0 0 1 1 NA 1


## The TEXT data

We have two columns of text data.

In [62]:
cleanData[,1:2]

identification,identification2
<chr>,<chr>
Perú,South America
USA,North America
Canada,North America
Côte D'Ivoire,Africa
Israel,Asia
United Kingdom,Europe


The text format should have all characters in lower or upper case. This will be needed during the **integration** stage. Let me choose upper case:

In [63]:
lapply(cleanData[,1:2],toupper)

We need to get rid of non-ASCII characters.

In [66]:
lapply(lapply(cleanData[,1:2],toupper),iconv,from="UTF-8",to="ASCII")

Let's create a function to improve readability:

In [67]:
formatText=function(column){iconv(toupper(column),from="UTF-8",to="ASCII")}
cleanData[,1:2]=lapply(cleanData[,1:2],formatText)

# result
cleanData

identification,identification2,var1,var2,var3,category,category_label,category_int
<chr>,<chr>,<int>,<int>,<dbl>,<chr>,<fct>,<dbl>
PER'U,SOUTH AMERICA,1500.0,1200000.0,500.0,A,Not Allied,0.0
USA,NORTH AMERICA,2500.0,1300000.0,1500.0,A,Not Allied,0.0
CANADA,NORTH AMERICA,3500.0,,1.5,Ba,Allied,1.0
C^OTE D'IVOIRE,AFRICA,2500.0,,,Ba,Allied,1.0
ISRAEL,ASIA,,250000.0,,,,
UNITED KINGDOM,EUROPE,2550.0,310000.0,330.0,Ba,Allied,1.0


In [68]:
str(cleanData)

'data.frame':	6 obs. of  8 variables:
 $ identification : chr  "PER'U" "USA" "CANADA" "C^OTE D'IVOIRE" ...
 $ identification2: chr  "SOUTH AMERICA" "NORTH AMERICA" "NORTH AMERICA" "AFRICA" ...
 $ var1           : int  1500 2500 3500 2500 NA 2550
 $ var2           : int  1200000 1300000 NA NA 250000 310000
 $ var3           : num  500 1500 1.5 NA NA 330
 $ category       : chr  "A" "A" "Ba" "Ba" ...
 $ category_label : Factor w/ 2 levels "Not Allied","Allied": 1 1 2 2 NA 2
 $ category_int   : num  0 0 1 1 NA 1


## Saving

The data formatted **SHOULD NOT** be saved as CSV. In R, choose **RDS**:

In [71]:
folder = "DataCleanAndFormatted"

# Check if the folder exists
if (!dir.exists(folder)) {
  # Create the folder
  dir.create(folder)
  saveRDS(cleanData,file.path(folder,"formatted_Data.RDS"))
  write.csv(cleanData,file.path(folder,"formatted_Data.csv"),row.names=F)

} else {
  saveRDS(cleanData,file.path(folder,"formatted_Data.RDS"))
  write.csv(cleanData,file.path(folder,"formatted_Data.csv"),row.names=F)

}

Which can be read like this:

In [72]:
formatted_Data=readRDS(file.path(folder,"formatted_Data.RDS"))
str(formatted_Data)

'data.frame':	6 obs. of  8 variables:
 $ identification : chr  "PER'U" "USA" "CANADA" "C^OTE D'IVOIRE" ...
 $ identification2: chr  "SOUTH AMERICA" "NORTH AMERICA" "NORTH AMERICA" "AFRICA" ...
 $ var1           : int  1500 2500 3500 2500 NA 2550
 $ var2           : int  1200000 1300000 NA NA 250000 310000
 $ var3           : num  500 1500 1.5 NA NA 330
 $ category       : chr  "A" "A" "Ba" "Ba" ...
 $ category_label : Factor w/ 2 levels "Not Allied","Allied": 1 1 2 2 NA 2
 $ category_int   : num  0 0 1 1 NA 1


In [73]:
formatted_Data_csv=read.csv(file.path(folder,"formatted_Data.csv"))
str(formatted_Data_csv)

'data.frame':	6 obs. of  8 variables:
 $ identification : chr  "PER'U" "USA" "CANADA" "C^OTE D'IVOIRE" ...
 $ identification2: chr  "SOUTH AMERICA" "NORTH AMERICA" "NORTH AMERICA" "AFRICA" ...
 $ var1           : int  1500 2500 3500 2500 NA 2550
 $ var2           : int  1200000 1300000 NA NA 250000 310000
 $ var3           : num  500 1500 1.5 NA NA 330
 $ category       : chr  "A" "A" "Ba" "Ba" ...
 $ category_label : chr  "Not Allied" "Not Allied" "Allied" "Allied" ...
 $ category_int   : int  0 0 1 1 NA 1


In [74]:
summary(formatted_Data_csv)

 identification     identification2         var1           var2        
 Length:6           Length:6           Min.   :1500   Min.   : 250000  
 Class :character   Class :character   1st Qu.:2500   1st Qu.: 295000  
 Mode  :character   Mode  :character   Median :2500   Median : 755000  
                                       Mean   :2510   Mean   : 765000  
                                       3rd Qu.:2550   3rd Qu.:1225000  
                                       Max.   :3500   Max.   :1300000  
                                       NA's   :1      NA's   :2        
      var3          category         category_label      category_int
 Min.   :   1.5   Length:6           Length:6           Min.   :0.0  
 1st Qu.: 247.9   Class :character   Class :character   1st Qu.:0.0  
 Median : 415.0   Mode  :character   Mode  :character   Median :1.0  
 Mean   : 582.9                                         Mean   :0.6  
 3rd Qu.: 750.0                                         3rd Qu.:1.0  
 Max

In [75]:
summary(formatted_Data)

 identification     identification2         var1           var2        
 Length:6           Length:6           Min.   :1500   Min.   : 250000  
 Class :character   Class :character   1st Qu.:2500   1st Qu.: 295000  
 Mode  :character   Mode  :character   Median :2500   Median : 755000  
                                       Mean   :2510   Mean   : 765000  
                                       3rd Qu.:2550   3rd Qu.:1225000  
                                       Max.   :3500   Max.   :1300000  
                                       NA's   :1      NA's   :2        
      var3          category            category_label  category_int
 Min.   :   1.5   Length:6           Not Allied:2      Min.   :0.0  
 1st Qu.: 247.9   Class :character   Allied    :3      1st Qu.:0.0  
 Median : 415.0   Mode  :character   NA's      :1      Median :1.0  
 Mean   : 582.9                                        Mean   :0.6  
 3rd Qu.: 750.0                                        3rd Qu.:1.0  
 Max.   :1