# Insertando datos de WDB en Cassandra

![png](http://cdn.albertcoronado.com/wp-content/uploads/2014/08/cassandra_logo.png)

## Creacción del KeySpace

In [782]:
%load_ext cql

The cql extension is already loaded. To reload it, use:
  %reload_ext cql


In [783]:
%%cql
DROP KEYSPACE "SWP";

'No results.'

In [784]:
%%cql
CREATE KEYSPACE "SWP"
WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};

'No results.'

## Uso del KeySpace

Se utiliza USE para cambiar el keyspace por defecto


In [785]:
%cql USE "SWP";

'No results.'

## Creacción de tablas

* Los Keyspaces contienen tablas
* Las tablas contienen datos

In [786]:
%%cql
CREATE TABLE "Accused"
(
    "AccusedRef" text,
    "Age" int,
    PRIMARY KEY ("AccusedRef")
);

'No results.'

In [787]:
from cassandra.cluster import Cluster, BatchStatement, ConsistencyLevel
cluster = Cluster()
session = cluster.connect('SWP')

In [788]:
print "Accused", session.execute("SELECT count(1) FROM \"Accused\" WHERE \"Age\" = 0 ALLOW FILTERING")[0].count

Accused 0




In [789]:
print "Accused", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 50 ALLOW FILTERING")[0].count

Accused 0




## Inserción de la información

In [790]:
from cassandra.cluster import Cluster, BatchStatement, ConsistencyLevel
cluster = Cluster()
session = cluster.connect('SWP')

In [792]:
# Seleccionamos registros con campo edad informado

!cat /home/learner/notebooks/data/WDB_Accused.txt      \
    | perl -pe 's/^(\r\n)//g'                          \
    | perl -pe 's/([^:]..)\r\n/\1\\r\\n/g'             \
    | sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta' \
    | cut -d "|" -f 1,12 | sed 's/"//g'                \
    | sed -e '/|$/d'                                   \
    > /home/learner/notebooks/data/WDB_Accused_loadCassandra.txt

#    | perl -pe 's/^(\r\n)//g' \                          # Delete lines whith return alone
#    | perl -pe 's/([^:]..)\r\n/\1\\r\\n/g' \             # Delete return whith third char to the left not :
#    | sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta' \ # Replace delimiter , with | (not between "")
#    | cut -d "|" -f 1,12 | sed 's/"//g' \                # Select fields 1 (key) and 12 (Age)
#    | sed -e '/|$/d' \                                   # Filter line with Age not informed

!head /home/learner/notebooks/data/WDB_Accused_loadCassandra.txt
!wc -l /home/learner/notebooks/data/WDB_Accused_loadCassandra.txt

A/EGD/1005|50
A/EGD/1014|42
A/EGD/1018|43
A/EGD/1022|25
A/EGD/1023|50
A/EGD/1047|50
A/EGD/1082|75
A/EGD/1132|50
A/EGD/1143|50
A/EGD/1153|50
166 /home/learner/notebooks/data/WDB_Accused_loadCassandra.txt


In [793]:
def insert_Accused(WDB_Accused_csvLine):

    AccusedRef = WDB_Accused_csvLine.split("|")[0]
    Age = int(WDB_Accused_csvLine.split("|")[1])
    
    session.execute(
"""
INSERT INTO \"Accused\" (
\"AccusedRef\",
\"Age\"
) VALUES (
%s,
%s
)
"""
    , [
        AccusedRef,
        Age
    ])
    

In [794]:
import csv
from pprintpp import pprint as pp
import sys

WDB_Accused_data_path = '/home/learner/notebooks/data/WDB_Accused_loadCassandra.txt'

WDB_Accused_file = open(WDB_Accused_data_path, "r")

for line in WDB_Accused_file:
    insert_Accused(line)

## Querys

In [795]:
print "Accused", session.execute("SELECT * from \"Accused\"")[0]
print "Accused", session.execute("SELECT * from \"Accused\"")[1]

Accused Row(AccusedRef=u'A/EGD/683', Age=53)
Accused Row(AccusedRef=u'A/EGD/1721', Age=25)


## Query by Age

In [796]:
%%cql
CREATE INDEX Accused_Age
   ON "Accused" ("Age");

'No results.'

In [801]:
%%cql
SELECT "Age", count(1) from "Accused" WHERE "Age" = 50 ALLOW FILTERING;



Age,count
50,36


#### ***La instrucción anterior puede dar error si se ejecuta sin pausa, debido a que aún no se ha consolidado la información


#### Para obtener una distribución de acusados por edad, tenemos que recorrernos todas las edades agrupándolas en una estructura. Algo parecido a esto:

In [802]:
print "50 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 50 ALLOW FILTERING")[0].count, "Accuseds."
print "45 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 45 ALLOW FILTERING")[0].count, "Accuseds."
print "25 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 25 ALLOW FILTERING")[0].count, "Accuseds."
print "55 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 55 ALLOW FILTERING")[0].count, "Accuseds."
print "60 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 60 ALLOW FILTERING")[0].count, "Accuseds."
print "39 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 39 ALLOW FILTERING")[0].count, "Accuseds."
print "30 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 30 ALLOW FILTERING")[0].count, "Accuseds."
print "36 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 36 ALLOW FILTERING")[0].count, "Accuseds."
print "43 years old: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Age\" = 43 ALLOW FILTERING")[0].count, "Accuseds."

50 years old:  36 Accuseds.
45 years old:  12 Accuseds.
25 years old:  7 Accuseds.
55 years old:  7 Accuseds.
60 years old:  6 Accuseds.
39 years old:  6 Accuseds.
30 years old:  5 Accuseds.
36 years old:  5 Accuseds.
43 years old:  5 Accuseds.




#### Dado que en Cassandra no disponemos de funciones de agrupación, exploraremos el dominio de edades almacenando el resultado para tratarlo por código.

#### Pero ántes definiremos unas vistas materializadas para evitar problemas de rendimiento,que el índice no soluciona por tratarse de campos con poca cardinalidad.

# Vistas materializadas
#### Podemos evitar los warnings definiendo vistas materializadas cuya organización permita un acceso eficiente a los datos que queremos consultar

#### Borrado de la vista materializada si existiera por ejecucuiones anteriores.

%%cql
DROP MATERIALIZED VIEW "Accused_Age"


In [805]:
%%cql
CREATE MATERIALIZED VIEW "Accused_Age" AS
       SELECT "AccusedRef", "Age" FROM "Accused"
       WHERE "Age" IS NOT NULL
       PRIMARY KEY ("Age","AccusedRef")
    ;

'No results.'

#### Incluimos los conteos por edad en un array de numpy

In [806]:
import numpy as np

acussedsByAge = np.zeros((100,2), dtype=[('age',int),('count',int)] )

for age in range(101):
    acussedsByAge[age-1]['age'] = age
    acussedsByAge[age-1]['count'] = session.execute("SELECT count(*) from \"Accused_Age\" WHERE \"Age\" =" + str(age) + " ALLOW FILTERING")[0].count


#### Ahora lo ordenamos y mostramos las edades más frecuentes

In [807]:
acussedsByAgeReverseOrder = np.sort ( acussedsByAge , axis = 0 , order = ['count'] )[::-1]
for item in acussedsByAgeReverseOrder:
    if item["count"][0] < 5:
        break
    print item["age"][0], "years old: ", item["count"][0] , "Accuseds."

50 years old:  31 Accuseds.
45 years old:  12 Accuseds.
60 years old:  6 Accuseds.
55 years old:  6 Accuseds.
43 years old:  5 Accuseds.


#### Vamos a añadir la columna de género

In [808]:
# Seleccionamos registros con campo sexo informado

!cat /home/learner/notebooks/data/WDB_Accused.txt     \
    | perl -pe 's/^(\r\n)//g'                          \
    | perl -pe 's/([^:]..)\r\n/\1\\r\\n/g'             \
    | sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta' \
    | cut -d "|" -f 1,11 | sed 's/"//g'             \
    | sed -e '/|$/d'                                   \
    | sed 's/$/|/' \
    > /home/learner/notebooks/data/WDB_Accused_loadCassandra_Sex.txt

#    | perl -pe 's/^(\r\n)//g' \                          # Delete lines whith return alone
#    | perl -pe 's/([^:]..)\r\n/\1\\r\\n/g' \             # Delete return whith third char to the left not :
#    | sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta' \ # Replace delimiter , with | (not between "")
#    | cut -d "|" -f 1,12 | sed 's/"//g' \                # Select fields 1 (key) and 12 (Age)
#    | sed -e '/|$/d' \                                   # Filter line with Age not informed

!head /home/learner/notebooks/data/WDB_Accused_loadCassandra_Sex.txt
!wc -l /home/learner/notebooks/data/WDB_Accused_loadCassandra_Sex.txt

A/EGD/10|Female|
A/EGD/100|Male|
A/EGD/1000|Female|
A/EGD/1001|Female|
A/EGD/1002|Female|
A/EGD/1003|Female|
A/EGD/1004|Female|
A/EGD/1005|Female|
A/EGD/1006|Female|
A/EGD/1007|Female|
3170 /home/learner/notebooks/data/WDB_Accused_loadCassandra_Sex.txt


#### Añadimos la columna **Sex**

In [809]:
%%cql
ALTER TABLE "SWP"."Accused" ADD "Sex" ascii;

'No results.'

#### Esta función nos permitirá añadir la columna para una determinada clave *AccusedRef*

In [810]:
def load_Accused_sex(AccusedRef,Sex):

    session.execute(
"""
INSERT INTO \"Accused\" (
\"AccusedRef\",
\"Sex\"
) VALUES (
%s,
%s
)
"""
    , [
        AccusedRef,
        Sex
    ])

#### Cargamos los registros que tienen el género informado.

In [811]:
import csv
from pprintpp import pprint as pp
import sys

WDB_Accused_Sex_data_path = '/home/learner/notebooks/data/WDB_Accused_loadCassandra_Sex.txt'

WDB_Accused_Sex_file = open(WDB_Accused_Sex_data_path, "r")

sRegs = !wc -l /home/learner/notebooks/data/WDB_Accused_loadCassandra_Sex.txt | cut -d " " -f1
nRegs = map(int, sRegs)
i = 0

for line in WDB_Accused_Sex_file:
    i = i + 1
    load_Accused_sex(line.split("|")[0],line.split("|")[1])
    print "\rLoading",i,"of",nRegs[0],

print "\n",i,"records loaded."

Loading 3170 of 3170 
3170 records loaded.


In [812]:
print "Male: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Sex\" = 'Male' ALLOW FILTERING")[0].count, "Accuseds."
print "Female: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Sex\" = 'Female' ALLOW FILTERING")[0].count, "Accuseds."


Male:  468 Accuseds.
Female:  2702 Accuseds.




In [813]:
%%cql
SELECT * FROM "Accused" WHERE "Sex" = 'Female' LIMIT 8 ALLOW FILTERING;

AccusedRef,Age,Sex
A/EGD/2291,,Female
A/EGD/683,53.0,Female
A/EGD/598,,Female
A/EGD/1391,,Female
A/EGD/2280,,Female
A/EGD/1721,25.0,Female
A/JO/2694,,Female
A/EGD/869,,Female


In [814]:
%%cql
SELECT * from "Accused" WHERE "AccusedRef" IN ('A/EGD/10','A/EGD/1005');

AccusedRef,Age,Sex
A/EGD/10,,Female
A/EGD/1005,50.0,Female


In [815]:
%%cql
CREATE INDEX Accused_Sex_idx
   ON "Accused" ("Sex");

'No results.'

In [817]:
print "Male: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Sex\" = 'Male' ALLOW FILTERING")[0].count, "Accuseds."
print "Female: ", session.execute("SELECT count(*) from \"Accused\" WHERE \"Sex\" = 'Female' ALLOW FILTERING")[0].count, "Accuseds."

 Male:  468 Accuseds.
Female:  2702 Accuseds.




#### Para evitar problemas de rendimiento, las vistas materializadas son una mejor solución cuando el campo sobre el que queremos filtrar no es discriminante (categórico)

%%cql
DROP MATERIALIZED VIEW "Accused_Sex";

In [819]:
%%cql
CREATE MATERIALIZED VIEW "Accused_Sex" AS
       SELECT "AccusedRef", "Sex" FROM "Accused"
       WHERE "Sex" IS NOT NULL
       PRIMARY KEY ("Sex","AccusedRef")
       WITH CLUSTERING ORDER BY ("Sex" DESC)
    ;

'No results.'

#### Las mismas consultas, pero a partir de las vistas materializadas

In [820]:
%%cql
SELECT "Sex", count(1) FROM "Accused_Sex" WHERE "Sex" = 'Male';

Sex,count
Male,11


In [821]:
%%cql
SELECT "Sex", count(1) FROM "Accused_Sex" WHERE "Sex" = 'Female';

Sex,count
Female,220


In [823]:
print "Male: ", session.execute("SELECT count(*) from \"Accused_Sex\" WHERE \"Sex\" = 'Male' ALLOW FILTERING")[0].count, "Accuseds."
print "Female: ", session.execute("SELECT count(*) from \"Accused_Sex\" WHERE \"Sex\" = 'Female' ALLOW FILTERING")[0].count, "Accuseds."

Male:  468 Accuseds.
Female:  2702 Accuseds.


### CONCLUSIÓN:
#### Cassandra no es una buena elección para un sistema informacional, en el que necesitáramos construir consultas complejas para obtener información de los datos.
#### Cada consulta requiere una tabla específica o una vista diseñada adhoc. Los índices solo deben utilizarse si los campos a indexar son discriminantes (alta cardinalidad)
#### Definitivamente, no es recomendable para tratar el dataset elegido