# 7.1. Data Collection

Module M-227-04: Programming for Data Analytics

Instructor: prof. Dmitry Pavlyuk

## Ways of Data Collection

1. From files - mainly for static or rarely updated data
    * Discussed on Week 3
2. From a Database - standardised language (SQL), highly optimised requests
    * Good for in-house, but rarely available to public
3. Using an Application Programming Interface (API) - structured requests/responses
    * The most appropriate and efficient way, but not always available
4. Web Scrapping - weakly structured data
    * Widely used, but has many questionable aspects
5. Regular expressions - unstructured text data with some patterns.
    * The useful tool for extracting data from texts

## Accessing Databases

## Database Creation - Loading data

We re-use data on students, student groups, courses, and results from Week 5.

In [1]:
import pandas as pd
data_dir = "../week5/data/"
courses_df = pd.read_csv(data_dir+'st_courses.csv').set_index("course_id")
groups_df = pd.read_csv(data_dir+'st_groups.csv').set_index("group_id")
students_df = pd.read_csv(data_dir+'st_students.csv').set_index("student_id")
results_df = pd.read_csv(data_dir+'st_results.csv') .set_index(["course_id","student_id"])

## Database Creation - SQLite

__SQLite__ is a C library that provides a lightweight disk-based database that doesn't require a separate server process and allows accessing the database using the SQL query language. Some applications can use SQLite for internal data storage. It's also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle.

In [2]:
import sqlite3
DB_PATH = 'students.db'
conn = sqlite3.connect(DB_PATH)
courses_df.to_sql("course", conn, if_exists="replace")
groups_df.to_sql("student_group", conn, if_exists="replace")
students_df.to_sql("student", conn, if_exists="replace")
results_df.to_sql("result", conn, if_exists="replace")
conn.close()

## Reviewing the database schema

In [3]:
conn = sqlite3.connect(DB_PATH)
print("Table COURSE:\n", conn.cursor().execute("PRAGMA table_info('course')").fetchall())
print("Table STUDENT_GROUP:\n",conn.cursor().execute("PRAGMA table_info('student_group')").fetchall())
print("Table STUDENT:\n",conn.cursor().execute("PRAGMA table_info('student')").fetchall())
print("Table RESULT:\n",conn.cursor().execute("PRAGMA table_info('result')").fetchall())
conn.close()

Table COURSE:
 [(0, 'course_id', 'INTEGER', 0, None, 0), (1, 'course_name', 'TEXT', 0, None, 0)]
Table STUDENT_GROUP:
 [(0, 'group_id', 'INTEGER', 0, None, 0), (1, 'group_name', 'TEXT', 0, None, 0), (2, 'group_year_started', 'INTEGER', 0, None, 0)]
Table STUDENT:
 [(0, 'student_id', 'INTEGER', 0, None, 0), (1, 'lastname', 'TEXT', 0, None, 0), (2, 'firstname', 'TEXT', 0, None, 0), (3, 'group_id', 'INTEGER', 0, None, 0)]
Table RESULT:
 [(0, 'course_id', 'INTEGER', 0, None, 0), (1, 'student_id', 'INTEGER', 0, None, 0), (2, 'attendance', 'INTEGER', 0, None, 0), (3, 'grade', 'REAL', 0, None, 0)]


## Reading data from the _course_ table

In [4]:
conn = sqlite3.connect(DB_PATH)
cursor = conn.execute("SELECT * from course")
for row in cursor:
    print(row)
conn.close()

(1, 'Information Systems and Technologies')
(2, 'Mathematics for data analytics')
(3, 'Modern Database Technologies')
(4, 'Programming for Data Analytics')
(5, 'Advanced Artificial Intelligence')


In [5]:
conn = sqlite3.connect(DB_PATH)
cursor = conn.execute("SELECT course_id, course_name from course")
for row in cursor:
    print(row)
conn.close()

(1, 'Information Systems and Technologies')
(2, 'Mathematics for data analytics')
(3, 'Modern Database Technologies')
(4, 'Programming for Data Analytics')
(5, 'Advanced Artificial Intelligence')


## Reading data: LEFT JOIN

In [6]:
conn = sqlite3.connect(DB_PATH)
cursor = conn.execute("""
SELECT 
    student_id, 
    firstname, 
    lastname, 
    group_name 
FROM 
    student LEFT JOIN student_group ON student.group_id = student_group.group_id;
""")
for row in cursor:
    print(row)
conn.close()

(101, 'Jurijs', 'M', '4201MDA')
(102, 'Kaspars', 'J', '4101MDA')
(103, 'Jānis', 'Z', '4203MDA')
(104, 'Iļja', 'P', '4201MDA')
(105, 'Andris', 'Z', '4201MDA')
(106, 'Ņikita', 'Z', '4203MDA')
(107, 'Ahmed', 'J', '4101MDA')
(108, 'Tamanjit', 'K', '4201MDA')
(109, 'Alexey', 'K', '4203MDA')
(110, 'Vjačeslavs', 'M', '4203MDA')
(111, 'Jevgenijs', 'B', '4203MDA')
(112, 'Ērika', 'T', '4203MDA')
(113, 'Simeon', 'I', '4101MDA')
(114, 'Oskars', 'K', '4101MDA')


## Reading data: GROUP BY

In [7]:
conn = sqlite3.connect(DB_PATH)
cursor = conn.execute("""
SELECT 
    student_group.group_id, 
    group_name,
    count(*)
FROM 
    student_group LEFT JOIN student ON student.group_id = student_group.group_id
GROUP BY
    student_group.group_id
""")
for row in cursor:
    print(row)
conn.close()

(1, '4101MDA', 4)
(2, '4201MDA', 4)
(3, '4203MDA', 6)


# Thank you