# Cleaning and Formatting my data




This is my data:

In [1]:
IRdisplay::display_html('<iframe width="700" height="300" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR-ubcCBaveg-58jcVmbErpO5kZswjFyHN5YlB8tB1a8B4fzU4sqZ08jkOKx4kBz1qtDNkJJWH8vBYF/pubhtml?gid=2024244899&single=true"></iframe>')

You can find it [here](https://docs.google.com/spreadsheets/d/1e1Pll_MGF6dVi4KJTXTiLfRXzkjf7ZdBhd58yt3Vkl8/edit?usp=sharing) too, on GoogleDrive.

**FOR this STEP, you should read from GitHub**.

This is the link to my CSV:

In [None]:
# the link as CSV
linkToData="https://docs.google.com/spreadsheets/d/e/2PACX-1vR-ubcCBaveg-58jcVmbErpO5kZswjFyHN5YlB8tB1a8B4fzU4sqZ08jkOKx4kBz1qtDNkJJWH8vBYF/pub?gid=0&single=true&output=csv"

Read the data:

In [None]:
dirty=read.csv(linkToData,check.names=F)

As usual, I check the data types:

In [None]:
str(dirty)

Now, I identify which are textual, numerical, or categorical.

* Columns **identification1** and **identification2** are *textual*.
* The columns from **var1** to **var@3** are all *numerical*. But if the type is _object_ the column should have some non numerical characters.
* Column **category** is *categorical*. Keep in mind that categorical types will NEVER be recognised as such when read from a CSV. They will always be understood as text (_object_).

The **column names** are always *textual*.



# PART 1. EXPLORATION



### 1.1. **Exploring TEXT**


When data is textual, you need to explore the cells to verify all the characters are part of the **alphabet**.

Let me use R's **grep()** function:

In [None]:
# show me the cells that have a character outside the alphabet
dirty$identification[grep("[^a-zA-Z]",dirty$identification)]

United Kingdom is not dirty. But the space is outside the alphabet. What about:

In [None]:
dirty$identification[grep("\\W",dirty$identification)]

or...

In [None]:
dirty$identification[grep("[^\\w\\s]",dirty$identification,perl=T)]


Then the safe option is:

In [None]:
dirty$identification[grep("[^a-zA-Z\\s]",dirty$identification,perl = T)]

A similar exploration should be done in the **column names**:

In [None]:
# allowing numbers, not spaces
names(dirty)[grep("[^0-9a-zA-Z]",names(dirty),perl = T)]

And in the case of the column with **categorical data**:

In [None]:
dirty$category[grep("[^a-zA-Z]",dirty$category,perl = T)]

### 1.2. **Exploring NUMBERS**

If numbers are recognised as so, there is no cleaning needed. But if not, it means it has been recognised as text, then we use the regex **\d** (and its variations):

In [None]:
dirty$var1[grep("\\D",dirty$var1,perl = T)]

In [None]:
dirty$'var 2'[grep("\\D",dirty$'var 2',perl = T)]

In [None]:
### Why the error?
# dirty$var@3[grep("\\D",dirty$var@3,perl = T)]

Notice I need to use **""** to access the variables with dirty names (space between words, and the **@** special character). That is why you clean the column names first:

In [None]:
dirty$'var@3'[grep("\\D",dirty$'var@3',perl=T)]

There are cells with good values, but other values can not be kept. Use **\D** with care, numbers are complex. So I prefer something like this:

In [None]:
dirty$'var@3'[grep("[^\\d+\\.*\\d*]", dirty$'var@3', perl=T,invert = F)]

# PART 2. CLEANING

As mentioned, cleaning may mean:

a. Making bad characters disappear.

b. Keeping good characters stay.


Let's start with the _column names_:

## 2.1  The Column names

In [None]:
names(dirty)[grep("[^0-9a-zA-Z]",names(dirty),perl = T)]

How can you say: if "a space" or a "weird character", disappear? (that is *replace* by "")

In [None]:
# option 1
gsub("\\W",'',names(dirty), perl=T )


In [None]:
# option 2
gsub("[^\\w]",'',names(dirty), perl=T )

In [None]:
# # option 3
gsub("[^0-9a-zA-Z]",'',names(dirty), perl=T )

Choose any and make the change:

In [None]:
names(dirty)=gsub("[^0-9a-zA-Z]",'',names(dirty), perl=T )
dirty

The column names were cleaned by **Making bad characters disappear** 🙂

## 2.2  The dataframe contents

The contents include:
* The data columns. Generally numbers and categories.
* The identifier column(s). Generally text.



### 2.2.1 The idientifier column(s)

We have two of those. Let's check the **identification** column:

In [None]:
dirty$identification[grep("[^a-zA-Z\\s]",dirty$identification,perl = T)]

Not all characters detected are invalid. The **only** problem here is the brackets. Then:

* Option 1: Whatever inside brackets (including the brackets) have to go!

In [None]:
gsub("\\[.*\\]",'',dirty$identification,perl = T)

* Option 2: Splitting

In [None]:
strsplit(dirty$identification,split = '[',fixed=T)

You got a list. BUT you need a data frame column. Then:

In [None]:
## saving result
resultSplitIn2=strsplit(dirty$identification,split = '[',fixed=T)
# as matrix
goodColumn=c()
for (elements in resultSplitIn2){
  goodColumn=c(goodColumn,elements[1])

}
goodColumn

When you are happy, make the change:

In [None]:
dirty$identification=goodColumn
dirty

The **splitting** option seems very convenient for **identification2**:

In [None]:
## you want to keep [2]:
## saving result
resultSplitIn2=strsplit(dirty$identification2,split = ',', fixed = T)
# as matrix
goodColumn=c()
for (elements in resultSplitIn2){
  goodColumn=c(goodColumn,elements[2]) #keepig the right part!

}
goodColumn

If this is OK, then:

In [None]:
dirty$identification2=goodColumn
dirty

### 2.2.2 The Categorical columns

The **category** requires a frequency table:

In [None]:
table(dirty$category)

You can conclude that the **a** is wrong, it should be **A**.

In [None]:
#what about:
gsub('a','A', dirty$category,fixed=T) #fixed uses NO REGEX

That changed **Ba** to **BA**!

In [None]:
## maybe
## ^: start of string
## $: end  of string
gsub('^a$','A', dirty$category)

The simpler way:

In [None]:
dirty[dirty$category=='a','category']='A'

dirty


As you seem there are some symbols for missing. We could change it now. Or later.

Let me first check the **numeric columns**

### 2.2.3. The numerical columns

In [None]:
gsub(',','',dirty$var1)


Then,

In [None]:
dirty$var1=gsub(',','',dirty$var1)
dirty


The **var2** is more complicated.

In [None]:
# save where you have the issue
dirty$var2_temp=grepl("\\'|k",dirty$var2,fixed=F)
dirty

In [None]:
## now replace
dirty$var2=gsub("\\'|k",'',dirty$var2)
dirty

In [None]:
# now the real value
ifelse(dirty$var2_temp,paste0(dirty$var2,'000'),dirty$var2)

In [None]:
# then
dirty$var2=ifelse(dirty$var2_temp,paste0(dirty$var2,'000'),dirty$var2)
dirty$var2_temp=NULL
dirty

The **var3** can be solved like this:

In [None]:
dirty['var3']=gsub("\\$|\\s",'',dirty$var3)
dirty

## 2.3. Detecting missing values:


Wrong missing values representation should be replace with care. Do it according to the data type.

Then, let's start with the **categorical** column:

In [None]:
badSymbolCat=grep('\\W+',dirty$category,value = T)
badSymbolCat

Once found:

In [None]:
dirty$category=gsub(badSymbolCat,NA,dirty$category,fixed = T)
dirty

Let's go for the **numerical** cases:

In [None]:
dirty$var1[grep("[^\\d+\\.*\\d*]",dirty$var1,perl = T)]

In [None]:
dirty$var2[grep("[^\\d+\\.*\\d*]",dirty$var2,perl = T)]

In [None]:
dirty$var3[grep("[^\\d+\\.*\\d*]", dirty$var3, perl=T,invert = F)]

Here I apply a function to several columns, instead of one by one:

In [None]:
sapply(dirty[, c('var1','var2','var3')], function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]})

Nice output:

In [None]:
unlist(sapply(dirty[, c('var1','var2','var3')], function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]}))

In [None]:
unique(unlist(sapply(dirty[, c('var1','var2','var3')], function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]})))

Let's improve readability:

In [None]:
detectWrongNA= function(col){col[grep("[^\\d+\\.*\\d*]", col, perl=T,invert = F)]}
badSymbolNum=sapply(dirty[, c('var1','var2','var3')],detectWrongNA)
badSymbolNum_unlist=unlist(badSymbolNum)
badSymbolNum_vector=unique(badSymbolNum_unlist)
badSymbolNum_vector

Let's clean those columns:

In [None]:


dirty[, c('var1','var2','var3')]=lapply(dirty[, c('var1','var2','var3')],function(col) ifelse((col %in% badSymbolNum_vector), NA, col))

dirty


In [None]:
str(dirty)

Always be preventive with leading and trailing spaces!

In [None]:
dirty[,]=sapply(dirty[,],trimws) #use it when all are CHR!
nowClean=dirty[,]
str(nowClean)

In [None]:
nowClean

## 2.4. Saving the cleaned data

I will save the cleaned dataframe **locally**:

In [None]:
folder <- "dataFormatted"

# Check if the folder exists
if (!dir.exists(folder)) {
  # Create the folder
  dir.create(folder)
  write.csv(nowClean,file.path(folder,"nowClean.csv"),row.names=F)

} else {
  write.csv(nowClean,file.path(folder,"nowClean.csv"),row.names=F)}

**The cleaned fill will be sent to Github**

The formatting part will read this file from GitHub.

# PART 3. FORMATTING

Let me read the cleaned data from **GITHUB**

In [None]:
linkCleanData='https://github.com/MAGALLANESJoseManuel/deli2_test/raw/refs/heads/main/DataCleanAndFormatted/nowClean.csv'
cleanData=read.csv(linkCleanData)
str(cleanData)

In [None]:
cleanData

## The numerical data

Since the numeric data was clean, you need not format those columns.

## The categorical data

We have one categorical column, currently as text.

In [None]:
cleanData$category

Create a column of labels as categorical. If the levels are NOT ordinal, just use letters (if you had ordinal levels, you should add numbers at the beginning).

In [None]:
# create and rename

cleanData$category_label=factor(cleanData$category,
                                levels = c('A','Ba'),
                                labels = c('Not Allied', 'Allied'))

# result
cleanData

In [None]:
# verifying
str(cleanData)

Now create a representation of the categories using numbers:

In [None]:
RENAME_category <- c("Not Allied"=0 ,"Allied"=1)
cleanData$category_int=RENAME_category[cleanData$category_label]

#result
cleanData

In [None]:
# verifying
str(cleanData)

## The TEXT data

We have two columns of text data.

In [None]:
cleanData[,1:2]

The text format should have all characters in lower or upper case. This will be needed during the **integration** stage. Let me choose upper case:

In [None]:
lapply(cleanData[,1:2],toupper)

We need to get rid of non-ASCII characters.

In [None]:
lapply(lapply(cleanData[,1:2],toupper),iconv,from="UTF-8",to="ASCII//TRANSLIT")

Let's create a function to improve readability:

In [None]:
formatText=function(column){iconv(toupper(column),from="UTF-8",to="ASCII//TRANSLIT")}
cleanData[,1:2]=lapply(cleanData[,1:2],formatText)

# result
cleanData

## Saving

The data formatted **SHOULD NOT** be saved as CSV. In R, choose **RDS**:

In [None]:
folder = "DataCleanAndFormatted"

# Check if the folder exists
if (!dir.exists(folder)) {
  # Create the folder
  dir.create(folder)
  saveRDS(cleanData,file.path(folder,"formatted_Data.RDS"))

} else {
  saveRDS(cleanData,file.path(folder,"formatted_Data.RDS"))

}

Which can be read like this:

In [None]:
formatted_Data=readRDS(file.path(folder,"formatted_Data.RDS"))
str(formatted_Data)

In [None]:
formatted_Data