# Archaeological Data Analysis: lab module 2

### Author: Kassandra Merino Muniz, Paul Topazio

## Download delimited-text data

We'll make the standard Scala `Source` object available by `import`ing it, then use it to retrieve the content of a URL.

In [1]:
import scala.io.Source
val vases2020cex = "https://raw.githubusercontent.com/kassandramerinomuniz/clas299/master/Painters"

[32mimport [39m[36mscala.io.Source
[39m
[36mvases2020cex[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/kassandramerinomuniz/clas299/master/Painters"[39m

We'll extract a sequence of lines from the URL source, and convert them to our favorite type of Scala collection, a `Vector`.

(The following cell downloads the data:  depending on your internet connection, this might take a moment.)

In [2]:
val lines = Source.fromURL(vases2020cex).getLines.toVector

[36mlines[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"painter#beazley#museumId#shape#findspot#subject#pleiades"[39m,
  [32m"The Painter of Acropolis#1#Athens Act. 24#Plates#Athens#Woman seated#NA"[39m,
  [32m"The Painter of Acropolis#2#Oxford, Beazley#Plates#unknown#Woman, holding wreath, seated at wool-basket#NA"[39m,
  [32m"The Painter of Acropolis#3#Athens Act. 562#Pyxides#Athens#Women#NA"[39m,
  [32m"The Painter of Acropolis#4#Athens#Pyxides#Attica#Arms and thighs of a woman seated to the right, holding out a wreath or necklace, upper part of a youth in a himation leaning on his stick to left, left arm extended)#NA"[39m,
  [32m"Phintias#1#Lourve G 42#Amphorae#Vulci#A, Apollo and Tityos. B, athletes, Greek letters#NA"[39m,
  [32m"Phintias#2#Tarquinia RC 6843#Amphorae#Tarquinia#A, Dionysos with satyrs and maenads. B, Herakles and Apollo: the struggle for the tripod. On A and B, Greek letters#NA"[39m,
  [32m"Phintias#3#Lourve C 10784#Pelike#unkn

## Examine header line

To start with, let's see what the first line looks like, and compare it with the first data line.

In [3]:
lines.head // same as lines(0)

[36mres2[39m: [32mString[39m = [32m"painter#beazley#museumId#shape#findspot#subject#pleiades"[39m

In [4]:
lines(1)

[36mres3[39m: [32mString[39m = [32m"The Painter of Acropolis#1#Athens Act. 24#Plates#Athens#Woman seated#NA"[39m

## Split data strings into columns

Every line is a `String`.  If we break it up using the `split` method, we get an `Array` of `String`s, which we'll convert to a `Vector` of `String`s.  The end result will be that from a Vector of Strings, we create a Vector of Vectors of Strings.  Notice that Scala identifies the class of the new `data` expression as  `Vector[Vector[String]]`.
 

In [5]:
val data = lines.tail.map(ln => ln.split("#").toVector)

[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mString[39m]] = [33mVector[39m(
  [33mVector[39m(
    [32m"The Painter of Acropolis"[39m,
    [32m"1"[39m,
    [32m"Athens Act. 24"[39m,
    [32m"Plates"[39m,
    [32m"Athens"[39m,
    [32m"Woman seated"[39m,
    [32m"NA"[39m
  ),
  [33mVector[39m(
    [32m"The Painter of Acropolis"[39m,
    [32m"2"[39m,
    [32m"Oxford, Beazley"[39m,
    [32m"Plates"[39m,
    [32m"unknown"[39m,
    [32m"Woman, holding wreath, seated at wool-basket"[39m,
    [32m"NA"[39m
  ),
  [33mVector[39m(
    [32m"The Painter of Acropolis"[39m,
    [32m"3"[39m,
    [32m"Athens Act. 562"[39m,
    [32m"Pyxides"[39m,
    [32m"Athens"[39m,
    [32m"Women"[39m,
    [32m"NA"[39m
  ),
  [33mVector[39m(
    [32m"The Painter of Acropolis"[39m,
    [32m"4"[39m,
    [32m"Athens"[39m,
    [32m"Pyxides"[39m,
    [32m"Attica"[39m,
    [32m"Arms and thighs of a woman seated to the right, holding out a wreath or n

Mapping each Vector to the first item in the Vector is equivalent to extracting the first column from each Vector.  The header line told us that the first column should contain ID values.

In [6]:
val painters = data.map(columns => columns(0))

[36mpainters[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"The Painter of Acropolis"[39m,
  [32m"The Painter of Acropolis"[39m,
  [32m"The Painter of Acropolis"[39m,
  [32m"The Painter of Acropolis"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"Compare"[39m,
  [32m"Phintias, Potter"[39m
)

In [7]:
painters.distinct.sorted

[36mres6[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"Compare"[39m,
  [32m"Phintias"[39m,
  [32m"Phintias, Potter"[39m,
  [32m"The Painter of Acropolis"[39m
)

In [0]:
val shapes = data.map(columns => columns(2))

cmd0.sc:1: not found: value data
val shapes = data.map(columns => columns(2))
             ^Compilation Failed

: 

We want to be sure that all ID values are unique.  We can verify that by comparing the number of items in the `ids` Vector with the number of *distinct values* in the `ids` Vector.  If they're the same, then every value is unique.

In [11]:
//println("Records: " + ids.size)
//println("Distinct IDs: " + ids.distinct.size)
if(ids.size == ids.distinct.size) {
    println("All records uniquely identified.")
} else {
    println("Duplicate identifiers in data set.")
}

Duplicate identifiers in data set.
