[Guia al Conjunto de Datos](https://sorry.vse.cz/~berka/challenge/pkdd1999/berka.htm)

[Extracto del Diseño](https://dbdesigner.page.link/7iRemaQAUMsbFb2o7)


Estudiante: `....`

# Creación de la Base de Datos

Importamos las librerías Python que vamos a utilizar para acceder a las bases
de datos.

In [None]:
import pandas as pd
import sqlite3 as sql
import codecs


Vamos a crear nuestra base de datos. 

SQLite se maneja a nivel de fichero, si este no exite entonces de crea. En Colab siempre de forma no permanente. 

Para manejar este fichero manejaremos el objeto de **conexión** a la base de datos.  

>Una **conexión** a base de datos es la forma que un servidor de base de datos y su software cliente se comunican entre sí. 

>El cliente utiliza una conexión a base de datos para enviar comando y recibir respuestas del servidor



In [None]:
dbfile = "data_berka.db"

con = sql.connect(dbfile)
con

<sqlite3.Connection at 0x7fac518a2b90>

Lo siguiente vamos a crear una **tabla** de las que tenemos. Normalmente crearemos aquella que no contiene claves ajenas como district. 

Para ello con la sentencia CREATE identificamos los atributos que va a tener la tabla.

```
CREATE TABLE <nombre_tabla> (
  <nombre_campo> <tipo campo> <primary key, not null, etc.>,
  ...
  FOREIGN KEY (<nombre_campo_fk) 
     REFERENCES <nombre_tabla_referencia> (<nombre_pk_tabla_referencia>)
)
```
Los tipos de datos que se manejan en SQLite se pueden consultar en [SQLite Data Types](https://www.sqlite.org/datatype3.html)

## Cargar tabla district










Cargamos los datos de nuestra fuente de datos


In [None]:
!wget https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/district.asc

--2022-11-25 16:16:33--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/district.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6574 (6.4K) [text/plain]
Saving to: ‘district.asc’


2022-11-25 16:16:33 (61.0 MB/s) - ‘district.asc’ saved [6574/6574]



Para crear la tabla ejecutaremos execute y la sentencia CREATE correspondiente

In [None]:
con.execute('CREATE TABLE DISTRICT (A1 INT PRIMARY KEY, A2 TEXT, A3 TEXT, A4 INT,A5 INT, A6 INT,' + 
  'A7 INT, A8 INT, A9 INT, A10 REAL, A11 INT, A12 REAL, A13 REAL, A14 INT, A15 INT, A16 INT)')


<sqlite3.Cursor at 0x7fac517f5880>

# Se establecerá A1 como clave pimaria, por eso, a demás de poner "INT" (entero), se agregará el comando "PRIMARY KEY".
Para realizar una correcta sintaxis, se deberá separar el comando cuando se alargue cerrando comillas, agregando un "+" y volviendo a abrir comillas.


Finalmente "guardaremos los cambios" con **commit**

> Una sentencia **COMMIT** en SQL finaliza una transacción de base de datos dentro de un sistema gestor de base de datos relacional (RDBMS) y hace visibles todos los cambios a otros usuario

In [None]:
con.commit()

Cualquier ejecución de una sentencia SQL sobre la base de datos nos crea un cursor el cual utilizaremos sobre todo para las consultas 

> Cursor se refiere a una estructura de control utilizada para el recorrido (y potencial procesamiento) de los registros del resultado de una consulta.

In [None]:
cursor = con.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [
     v[0] for v in cursor.fetchall()
     if v[0] != "sqlite_sequence"
]
cursor.close()
tables

['DISTRICT']

3. Cargamos los datos de la tabla


En este caso lo vamos a hacer mediante la utilización de una librería de Python como es PANDAS

In [None]:
pd.read_csv('district.asc', sep = ";" ).to_sql('DISTRICT', con, if_exists='append', index = False)

4. Vamos a consultar los datos que tiene esta tabla que acabamos de cargar

In [None]:
pd.read_sql_query("SELECT * FROM DISTRICT", con)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,73,Opava,north Moravia,182027,17,49,12,2,7,56.4,8746,3.33,3.74,90,4355,4433
73,74,Ostrava - mesto,north Moravia,323870,0,0,0,1,1,100.0,10673,4.75,5.44,100,18782,18347
74,75,Prerov,north Moravia,138032,67,30,4,2,5,64.6,8819,5.38,5.66,99,4063,4505
75,76,Sumperk,north Moravia,127369,31,32,13,2,7,51.2,8369,4.73,5.88,107,3736,2807


## Cargar resto de tablas

### account

In [None]:
!wget https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/account.asc

--2022-11-25 16:16:34--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/account.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155356 (152K) [text/plain]
Saving to: ‘account.asc’


2022-11-25 16:16:34 (9.56 MB/s) - ‘account.asc’ saved [155356/155356]



In [None]:
con.execute('CREATE TABLE IF NOT EXISTS ACCOUNT (account_id INT PRIMARY KEY, district_id INT, frequency TEXT, date INT, '
      + 'FOREIGN KEY (district_id) REFERENCES district (A1))')
con.commit()

# Es importante poner el tipo de dato que recoge cada variable. Por ejemplo : "INT" es entero, "TEXT" es texto y "DECIMAL" es un número con decimales

In [None]:
pd.read_csv('account.asc', sep = ";" ).to_sql('ACCOUNT', con, if_exists='append', index = False)


In [None]:
pd.read_sql_query("SELECT * FROM ACCOUNT LIMIT 10", con)

Unnamed: 0,account_id,district_id,frequency,date
0,576,55,POPLATEK MESICNE,930101
1,3818,74,POPLATEK MESICNE,930101
2,704,55,POPLATEK MESICNE,930101
3,2378,16,POPLATEK MESICNE,930101
4,2632,24,POPLATEK MESICNE,930102
5,1972,77,POPLATEK MESICNE,930102
6,1539,1,POPLATEK PO OBRATU,930103
7,793,47,POPLATEK MESICNE,930103
8,2484,74,POPLATEK MESICNE,930103
9,1695,76,POPLATEK MESICNE,930103


#### Manipulamos estos datos


Vamos a utilizar la sentencia UPDATE para traducir del checo al inglés

In [None]:
con.execute("UPDATE ACCOUNT SET frequency = 'monthly' WHERE frequency = 'POPLATEK MESICNE'")
con.execute("UPDATE ACCOUNT SET frequency = 'weekly' WHERE frequency = 'POPLATEK TYDNE'")
con.execute("UPDATE ACCOUNT SET frequency = 'after_tr' WHERE frequency = 'POPLATEK PO OBRATU'")
con.commit()

In [None]:
pd.read_sql_query("SELECT * FROM ACCOUNT LIMIT 10", con)

Unnamed: 0,account_id,district_id,frequency,date
0,576,55,monthly,930101
1,3818,74,monthly,930101
2,704,55,monthly,930101
3,2378,16,monthly,930101
4,2632,24,monthly,930102
5,1972,77,monthly,930102
6,1539,1,after_tr,930103
7,793,47,monthly,930103
8,2484,74,monthly,930103
9,1695,76,monthly,930103


### client

1. Descarga de los datos

In [None]:
!wget https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/client.asc

--2022-11-25 16:16:34--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/client.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94820 (93K) [text/plain]
Saving to: ‘client.asc’


2022-11-25 16:16:34 (11.6 MB/s) - ‘client.asc’ saved [94820/94820]



2. Creación de la tabla

In [None]:
con.execute("CREATE TABLE IF NOT EXISTS CLIENT (client_id INT PRIMARY KEY,birth_number varchar,	district_id INT, "
            + "FOREIGN KEY (district_id) REFERENCES district (A1))");
con.commit()


# "Foreign key" es una clave que se comparte con otra tabla o clave ajena. Se tendrá que establecer la relación con la tabla que comparte el dato.

3. Carga de los datos a la tabla

In [None]:
pd.read_csv('client.asc', sep = ";" ).to_sql('CLIENT', con, if_exists='append', index = False)

4. Comprobar que la creación y la carga se ha hecho bien

In [None]:
pd.read_sql_query("SELECT * FROM CLIENT LIMIT 10", con)

Unnamed: 0,client_id,birth_number,district_id
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5
5,6,190922,12
6,7,290125,15
7,8,385221,51
8,9,351016,60
9,10,430501,57


5. Código para borrado después de un diseño incorrecto

In [None]:
con.execute("DROP TABLE CLIENT");
con.commit()

### disposition

1. Descarga de los datos

In [None]:
!wget https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/disp.asc

--2022-11-25 16:16:35--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/disp.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 129716 (127K) [text/plain]
Saving to: ‘disp.asc’


2022-11-25 16:16:35 (8.50 MB/s) - ‘disp.asc’ saved [129716/129716]



2. Creación de la tabla


In [None]:
con.execute("CREATE TABLE IF NOT EXISTS  DISPOSITION (disp_id integer PRIMARY KEY, client_id integer,	account_id integer,	type varchar, "
      "FOREIGN KEY (client_id) REFERENCES client (client_id), " + 
      "FOREIGN KEY (account_id) REFERENCES account (account_id))")

con.commit()

3. Carga de los datos a la tabla


In [None]:
pd.read_csv('disp.asc', sep = ";" ).to_sql('DISPOSITION', con, if_exists='append', index = False)

4. Comprobar tabla

In [None]:
pd.read_sql_query("SELECT * FROM DISPOSITION LIMIT 10", con)

Unnamed: 0,disp_id,client_id,account_id,type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT
5,6,6,4,OWNER
6,7,7,5,OWNER
7,8,8,6,OWNER
8,9,9,7,OWNER
9,10,10,8,OWNER


### loan


1. Descarga de los datos

In [None]:
!wget  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/loan.asc

--2022-11-25 16:16:35--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/loan.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27037 (26K) [text/plain]
Saving to: ‘loan.asc’


2022-11-25 16:16:36 (59.1 MB/s) - ‘loan.asc’ saved [27037/27037]



2. Creación de la tabla


In [None]:
con.execute("CREATE TABLE LOAN(loan_id INT PRIMARY KEY, account_id INT,"
+ " date TEXT, amount INT, duration INT, payments REAL, status TEXT,"
+"FOREIGN KEY(account_id) REFERENCES ACCOUNT (account_id))")
con.commit()

3. Carga de los datos a la tabla


In [None]:
pd.read_csv('loan.asc', sep = ";" ).to_sql('LOAN', con, if_exists='append', index = False)

4. Comprobar tabla

In [None]:
pd.read_sql_query("SELECT * FROM LOAN LIMIT 10", con)

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,5314,1787,930705,96396,12,8033.0,B
1,5316,1801,930711,165960,36,4610.0,A
2,6863,9188,930728,127080,60,2118.0,A
3,5325,1843,930803,105804,36,2939.0,A
4,7240,11013,930906,274740,60,4579.0,A
5,6687,8261,930913,87840,24,3660.0,A
6,7284,11265,930915,52788,12,4399.0,A
7,6111,5428,930924,174744,24,7281.0,B
8,7235,10973,931013,154416,48,3217.0,A
9,5997,4894,931104,117024,24,4876.0,A


### trans

1. Cargar los datos al entorno

In [None]:
!wget  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/trans.asc

--2022-11-25 16:16:36--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/trans.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69406578 (66M) [text/plain]
Saving to: ‘trans.asc’


2022-11-25 16:16:39 (213 MB/s) - ‘trans.asc’ saved [69406578/69406578]



In [None]:
con.execute("CREATE TABLE TRANSACTIONS(trans_id INT PRIMARY KEY, account_id INT,"
+ " date TEXT, type TEXT, amount REAL, balance REAL, k_symbol TEXT, bank TEXT, account_dest INT,"
+"FOREIGN KEY(account_id) REFERENCES ACCOUNT (account_id))")
con.commit()

In [None]:
pd.read_csv('trans.asc', sep = ";" ).to_sql('TRANS', con, if_exists='append', index = False)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
pd.read_sql_query("SELECT * FROM TRANS LIMIT 10", con)

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,
5,771035,2632,930102,PRIJEM,VKLAD,1100.0,1100.0,,,
6,452728,1539,930103,PRIJEM,VKLAD,600.0,600.0,,,
7,725751,2484,930103,PRIJEM,VKLAD,1100.0,1100.0,,,
8,497211,1695,930103,PRIJEM,VKLAD,200.0,200.0,,,
9,232960,793,930103,PRIJEM,VKLAD,800.0,800.0,,,


2. Ejecutar el CREATE

3. Cargar los datos en la tabla

4. Comprobar la carga de los datos

### credit card

1. Obtener los datos 

In [None]:
!wget https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/card.asc

--2022-11-25 16:42:32--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/card.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31588 (31K) [text/plain]
Saving to: ‘card.asc.1’


2022-11-25 16:42:32 (13.5 MB/s) - ‘card.asc.1’ saved [31588/31588]



2. Ejecutar el CREATE

In [None]:
con.execute("CREATE TABLE CREDIT_CARD (card_id INT PRIMARY KEY,disp_id integer,	type varchar,	issued varchar,"
  + "FOREIGN KEY (disp_id) REFERENCES DISPOSITION (disp_id))");
con.commit()

3. Cargar los datos

In [None]:
pd.read_csv('card.asc', sep = ";" ).to_sql('CREDIT_CARD', con, if_exists='append', index = False)

4. Comprobar la creación y la carga


In [None]:
pd.read_sql_query('SELECT * FROM CREDIT_CARD LIMIT 10', con)

Unnamed: 0,card_id,disp_id,type,issued


### permanent order

In [None]:
con.execute("DROP TABLE PERMANENT_ORDER");
con.commit()

1. Descarga de los datos

In [None]:
!wget  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/order.asc

--2022-11-25 16:40:05--  https://raw.githubusercontent.com/zhouxu-ds/loan-default-prediction/main/data/order.asc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 273800 (267K) [text/plain]
Saving to: ‘order.asc.1’


2022-11-25 16:40:06 (10.2 MB/s) - ‘order.asc.1’ saved [273800/273800]



2. Creación de la tabla


In [None]:
con.execute("CREATE TABLE PERMANENT_ORDER(order_id INT PRIMARY KEY, account_id INT, bank_to TEXT,"
+"account_to INT, amount DECIMAL, k_symbol TEXT)")
con.commit()

3. Carga de los datos a la tabla


In [None]:
pd.read_csv('order.asc', sep = ";" ).to_sql('PERMANENT_ORDER', con, if_exists='append', index = False)

4. Comprobar tabla

In [None]:
pd.read_sql_query("SELECT * FROM PERMANENT_ORDER LIMIT 10", con)

Unnamed: 0,order_id,account_id,bank_to,account_to,amount,k_symbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3372.7,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,
5,29406,3,AB,59972357,3539.0,POJISTNE
6,29407,4,UV,26693541,2078.0,SIPO
7,29408,4,UV,5848086,1285.0,SIPO
8,29409,5,GH,37390208,2668.0,SIPO
9,29410,6,AB,44486999,3954.0,SIPO
