#  Casey's R Notebook

### Introduction

In [1]:
path1 = 'C:/Users/casey/Dropbox/SMU_DataScience/MSDS_7333_QuantifyingTheWorld/Homework/CaseStudy1/offline.final.trace.txt'

rawData = readLines(path1)

### Explore and Parse the Raw Data

In [2]:
paste0("Total Records in Offline dataset: ",length(rawData))

In [4]:
print(rawData[1:6])

[1] "# timestamp=2006-02-11 08:31:58"                                                                                                                                                                                                                                                                                                                                                                                                                                 
[2] "# usec=250"                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[3] "# minReadings=110"                                                                   

The first three lines are comments so lets remove them and take a look at the raw data string.

In [5]:
rawData2 = rawData[substr(rawData,1,1) != "#"]

In [6]:
rawRecord = strsplit(rawData2[1], ";")[[1]]
rawRecord

Splitting the character string on ",", "=", and ";" with a regular expression separates out the data nicely.

In [7]:
dataStrings = strsplit(rawData2[1], "[,=;]")[[1]]
dataStrings

Now that the large string has been broken up into its value and label parts, we can organize the pieces into a proper matrix. We start with the MAC addresses first since each record may have a different number of observations. We will build two matrices and then bind them together.

In [8]:
m1 = matrix(dataStrings[-(1:10)], ncol = 4, byrow = TRUE)
m1

0,1,2,3
00:14:bf:b1:97:8a,-38,2437000000,3
00:14:bf:b1:97:90,-56,2427000000,3
00:0f:a3:39:e1:c0,-53,2462000000,3
00:14:bf:b1:97:8d,-65,2442000000,3
00:14:bf:b1:97:81,-65,2422000000,3
00:14:bf:3b:c7:c6,-66,2432000000,3
00:0f:a3:39:dd:cd,-75,2412000000,3
00:0f:a3:39:e0:4b,-78,2462000000,3
00:0f:a3:39:e2:10,-87,2437000000,3
02:64:fb:68:52:e6,-88,2447000000,1


In [9]:
m2 = matrix(dataStrings[c(2,4,6:8,10)], nrow = nrow(m1), ncol = 6, byrow = TRUE)
m2

0,1,2,3,4,5
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0


In [10]:
m3 = cbind(m1,m2)
m3

0,1,2,3,4,5,6,7,8,9
00:14:bf:b1:97:8a,-38,2437000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:14:bf:b1:97:90,-56,2427000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:0f:a3:39:e1:c0,-53,2462000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:14:bf:b1:97:8d,-65,2442000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:14:bf:b1:97:81,-65,2422000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:14:bf:3b:c7:c6,-66,2432000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:0f:a3:39:dd:cd,-75,2412000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:0f:a3:39:e0:4b,-78,2462000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
00:0f:a3:39:e2:10,-87,2437000000,3,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0
02:64:fb:68:52:e6,-88,2447000000,1,1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0


Now we can use this logic to parse all the lines in the raw data file

In [27]:
parseFunc = function(x){
    dataStrings = strsplit(rawData[4], "[,=;]")[[1]]
    if (length(dataStrings) == 10)
        return(NULL)
    m1 = matrix(dataStrings[-(1:10)], ,ncol = 4, byrow = TRUE)
    m2 = matrix(dataStrings[c(2,4,6:8,10)], nrow = nrow(m1), ncol = 6, byrow = TRUE)
    m3 = cbind(m2,m1)
    return(m3)
}

In [97]:
data = lapply(rawData2, parseFunc)
offline = as.data.frame(do.call("rbind", data))

names(offline) = c('time','scanMAC','posX','posY','posZ','orientation','MAC','signal','channel','type')
offline[1:10,]

time,scanMAC,posX,posY,posZ,orientation,MAC,signal,channel,type
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:14:bf:b1:97:8a,-38,2437000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:14:bf:b1:97:90,-56,2427000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:0f:a3:39:e1:c0,-53,2462000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:14:bf:b1:97:8d,-65,2442000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:14:bf:b1:97:81,-65,2422000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:14:bf:3b:c7:c6,-66,2432000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:0f:a3:39:dd:cd,-75,2412000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:0f:a3:39:e0:4b,-78,2462000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,00:0f:a3:39:e2:10,-87,2437000000,3
1139643118358,00:02:2D:21:0F:33,0.0,0.0,0.0,0.0,02:64:fb:68:52:e6,-88,2447000000,1


#### Convert the numeric columns data type

In [98]:
numVars = c('time','posX','posY','posZ','orientation','signal')

offline = lapply(offline, as.character)
offline[numVars] = lapply(offline[numVars], as.numeric)

# offline[1:10,]

## Why are my numeric columns getting jacked up???

In [99]:
# data2[numVars] = lapply(data2[numVars], as.numeric)

unlist(lapply(offline, class))

In [100]:
typeof(offline)

In [101]:
colnames(offline)

NULL

In [None]:
# offline