# Intro to D4M

Load the D4M module

In [1]:
using D4M,PyPlot.axis

Loaded /usr/java/jdk1.7.0_72/jre/lib/amd64/server/libjvm.so


## Create, Display, Save an Associative Array

Create lists of row, column, and values substrings. Note: the last character in the string is the divider. It can be any character. Common choices are ",", " ", tab, and newline.

In [2]:
row = "a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,a,aa,aaa,b,bb,bbb,"
column = "a,aa,aaa,b,bb,bbb,a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,";
values = "a-a,a-aa,a-aaa,a-b,a-bb,a-bbb,a-a,aa-a,aaa-a,b-a,bb-a,bbb-a,a-a,aa-aa,aaa-aaa,b-b,bb-bb,bbb-bbb,";

Create an associtavie array, A, from row, column, and values.

In [3]:
A = Assoc(row,column,values)

Assoc(Union{AbstractString, Number}["a", "aa", "aaa", "b", "bb", "bbb"], Union{AbstractString, Number}["a", "aa", "aaa", "b", "bb", "bbb"], Union{AbstractString, Number}["a-a", "a-aa", "a-aaa", "a-b", "a-bb", "a-bbb", "aa-a", "aa-aa", "aaa-a", "aaa-aaa", "b-a", "b-b", "bb-a", "bb-bb", "bbb-a", "bbb-bbb"], 
  [1, 1]  =  1
  [2, 1]  =  7
  [3, 1]  =  9
  [4, 1]  =  11
  [5, 1]  =  13
  [6, 1]  =  15
  [1, 2]  =  2
  [2, 2]  =  8
  [1, 3]  =  3
  [3, 3]  =  10
  [1, 4]  =  4
  [4, 4]  =  12
  [1, 5]  =  5
  [5, 5]  =  14
  [1, 6]  =  6
  [6, 6]  =  16)

Display the associative array in tabular form.

In [4]:
printFull(A)

7×7 Array{Union{AbstractString, Number},2}:
 ""     "a"      "aa"     "aaa"      "b"    "bb"     "bbb"    
 "a"    "a-a"    "a-aa"   "a-aaa"    "a-b"  "a-bb"   "a-bbb"  
 "aa"   "aa-a"   "aa-aa"  ""         ""     ""       ""       
 "aaa"  "aaa-a"  ""       "aaa-aaa"  ""     ""       ""       
 "b"    "b-a"    ""       ""         "b-b"  ""       ""       
 "bb"   "bb-a"   ""       ""         ""     "bb-bb"  ""       
 "bbb"  "bbb-a"  ""       ""         ""     ""       "bbb-bbb"

In [5]:
WriteCSV(A,"data/A.csv");

## Read and Select Sub Associative Arrays

Read CSV file into an associative array.

In [None]:
A = ReadCSV("data/A.csv");

Select a subset of rows.

In [None]:
printFull(  A["a,b,",:]  );

Convert values to 0 and 1.

In [None]:
printFull(  logical(A["a,b,",:])  );

Select a subset of columns.

In [None]:
printFull(  A[:,"a,b,"]  );

Convert values to 0 and 1.

In [None]:
printFull(  logical(A[:,"a,b,"])  );

# Analyze Entities in News Articles

Load entities from 10,000 news articles and print the first few rows.

In [None]:
A = ReadCSV("data/entity.csv");

printFull(  A[1:5,:]  );

Show dimensions and number entries of A.

In [None]:
print( [size(A),nnz(A)] );

nnz(A)/(size(A)[1]*size(A)[2])

## Construct and Display a Sparse Associative Array of the Data

Grab doc, entity, position, and type columns and combine type and entity with '|' seperator.

In [None]:
row, col, doc      = find(A[:,"doc,"]);              # Get doc column.
row, col, entity   = find(A[:,"entity,"]);           # Get entity column.
row, col, position = find(A[:,"position,"]);         # Get position column.
row, col, rowType     = find(A[:,"type,"]);             # Get type column.
typeEntity = CatStr(rowType,"|",entity);          # Interleave type and entity strings.

Create a sparse associative array of all the data and show a few rows.

In [None]:
E = Assoc(doc,typeEntity,position);

print(E[1:2,:])

printFull(E[1:2,:])

Display dimensions of data, number of non-zero entries, and density of A.

In [None]:
print( [size(E), nnz(E)]  );

nnz(E)/(size(E)[1]*size(E)[2])

Plot transpose of the sparse data.

In [None]:
spy(transpose(E[1:1000,:]));
axis("auto")

Create an adjacency matrix by multiplying E<sup>T</sup> * E.

In [None]:
E = logical(E)
spy(E'*E);

## Analyze Relationships

Define relationships to examine.

In [None]:
l = "LOCATION|boston,";
P = StartsWith("PERSON|,");
L = StartsWith("LOCATION|,");

Show all people mentioned in news articles in Boston.

In [None]:
people = col(sum(E[row(E[:,l]),P],1)>1)

Show the most common locations for those found in Boston.

In [None]:
print(sum(  E[:,people].' * E[:,L]  ,1) > 15)

Do it all in 1 line of code.

In [None]:
print(sum(  E[:,col(sum(E[row(E[:,l]),P],1)>1)].' * E[:,L]  ,1) > 15)

Scale to multiple cites at once.

In [None]:
l = "LOCATION|boston,LOCATION|chicago,LOCATION|detroit,";
print(sum(  E[:,col(sum(E[row(E[:,l]),P],1)>1)].' * E[:,L]  ,1) > 15)

Let's make a Location-Location graph:

In [None]:
Locs = E[:,L]'*E[:,L]
Locs = Locs - diag(Locs)

spy(Locs);

Which location pairs occur together the most?

In [None]:
print(Locs > 200)

# Analyze DNA Data

In [None]:
function SplitSequenceCSV(CSVfile::String,DNAwordsize::Integer)

    A = ReadCSV(CSVfile)
    r, c, v = find(A);      # Read in file
    v = map(lowercase,v)   # Convert sequence to lower case.

    # Create the new column keys
    col=matchall.(Regex("(.{" * string(DNAwordsize) * "})") ,v)
    sizes = length.(col) # Save the lengths to create the row strings
    oneString=join(join.(col,"\n"),"\n")
    col = split(oneString,"\n")
    
    # Create the new row keys
    oneString = join(map(^,r.*"\n",sizes),"")
    newR = split(oneString[1:end-1],"\n")
    
    # Create the Associative Array
    A = Assoc(newR,col,1)
    
    return A
   
end

Read in bacteria reference DNA and palm sample DNA data into an associative arrays.

In [None]:
DNAwordsize = 10;
Eref = SplitSequenceCSV("data/bacteria.csv",DNAwordsize);
Esamp = SplitSequenceCSV("data/palm.csv",DNAwordsize);

Perform BLAST DNA sequeance analysis in 1 line of code to find best bacteria match.

In [None]:
bestMatches = sum( Eref * Esamp.' ,2) > 20;

print(bestMatches);

# Analyze Network Data

Read in 80,000 simulated network traffic logs from 1 day and print the first few rows.

In [None]:
A = ReadCSV("data/network.csv");

print(  A[1:5,:]  );

Make data sparse and show dimensions and number of entries.

In [None]:
E = val2col(A,"|");

display( [size(E) nnz(E)] )

print(E[1:5,:])

In [None]:
size(E[:,StartsWith("src|,")])

Select fields and time windows to explore.

In [None]:
S = StartsWith("src|,");         T1 = StartsWith("time|01:,");
D = StartsWith("dest|,");        T2 = StartsWith("time|05:,");

E1 = E[row(E[:,T1]),:];          # Data from time window 1.
E2 = E[row(E[:,T2]),:];          # Data from time winod 2.

Create adjacency array of network traffic in each time window.

In [None]:
A1 = E1[:,S]' * E1[:,D];
A2 = E2[:,S]' * E2[:,D];

Find source/destination pairs that are common to both time windows.

In [None]:
print(A1 .* A2)