# Import transaction data and aggregate it

 This workflow will demonstrate how to connect SQLite database with python to import and aggregate data

## Problem Statement
Create scripts that will process the CSV files, insert data into table and use that data in next step for performing various aggregations on it. The columns in the CSV files are as follow:

“transactionId”, // a unique transaction identifier <br />
“user”, // a unique user identifier <br />
“datetime”, // a timestamp in ms (javascript format) <br />
“operation”, // a description of the transaction being charged <br />
“quantity”, <br />
“unitPrice” // the revenue of the transaction is quantity * unit_price <br />
We have use SQLite as a databse so the test is self-contained. 

# Expected Result

### Part 1: Create an import script

In the first part, we will create a script that imports the transactions of the CSV files into an SQL table.

- The script should accept the CSV filename.
- It should put the information into the DB, in a table that collects all transactions.
- All data from CSVs imported by the script will be placed in this same table.
- It should handle duplicate rows (i.e. same CSV imported twice by mistake or update existing data by processing a new file).

### Part 2: Create an aggregation script

In the second part, we will create a program that reads from the SQL table from the previous part and aggregates data into two new DB tables that allow to get the following information:

- Aggregated by user / day: Number of operations and revenue per User per Day
- Aggregated by user / hour: Number of operations and revenue per User per Hour

### Getting Started

###### First, Import SQLite3 as a database 

In [10]:
import pandas as pd
import sqlite3
from csv import reader
from csv import DictReader
import datetime

###### Second, create a Database object and connect it.

In [11]:
database = 'sample.db'
connection = sqlite3.connect(database)

###### Read CSV file 

In [12]:
df = pd.read_csv('transactions_2020-08-01.csv')

###### Get the details about the data stored in CSV file

In [13]:
print(df)
df.info()

       transactionId     user       datetime    operation  quantity  unitPrice
0              10000  user_23  1596265204486  OPERATION_3         6      58.77
1              10001  user_19  1596265208999  OPERATION_3         7      65.35
2              10002  user_69  1596265211607  OPERATION_6         1      36.57
3              10003  user_65  1596265215179  OPERATION_5         8      98.37
4              10004  user_21  1596265223091  OPERATION_4         2      74.43
...              ...      ...            ...          ...       ...        ...
19995          29995  user_97  1596351585521  OPERATION_2         5      84.54
19996          29996  user_13  1596351592813  OPERATION_1         1      37.30
19997          29997  user_74  1596351594605  OPERATION_7         1      21.99
19998          29998  user_68  1596351595906  OPERATION_2         6      48.98
19999          29999   user_2  1596351599448  OPERATION_1         3      69.57

[20000 rows x 6 columns]
<class 'pandas.core.frame.

###### Create table named 'tran' to store the records fetched from CSV files.

In [15]:
cur = connection.cursor()
sql = '''
CREATE TABLE tran(
transactionId INT, 
user TEXT, 
datetime NUMERIC, 
operation TEXT, 
quantity INT, 
unitPrice NUMERIC
)'''
cur.execute(sql)

<sqlite3.Cursor at 0xe87af20>

###### Open the CSV file, read the data and insert these records into tran table.

In [16]:
with open('transactions_2020-08-02.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        s = row['datetime']
        date = int(s) / 1000.0
        new_date = datetime.datetime.fromtimestamp(date).strftime('%Y-%m-%d %H:%M:%S.%f')
        #print(row['transactionId'], row['user'],row['datetime'],row['operation'],row['quantity'],row['unitPrice'],new_date)
        cur.execute('''INSERT INTO tran values (?, ?, ?, ?, ?, ?) ''', ( row['transactionId'], row['user'],row['datetime'],row['operation'],row['quantity'],row['unitPrice']))
    connection.commit()
    print("Records Inserted successfully........")

Records Inserted successfully........


###### Create new table named 'Agg_User_Per_Day' to store 'Number of operations and revenue per User per Day'

In [17]:
sql = '''
CREATE TABLE Agg_User_Per_Day(
user TEXT,
No_of_Operations INTEGER,
Revenue NUMERIC
)'''
cur.execute(sql)
print("Table created successfully........")


Table created successfully........


###### Aggregate the data fetched from tran table and insert it into new table.
This will perform following:
- Aggregated by user / day: Number of operations and revenue per User per Day

In [18]:
query = '''
           SELECT user,
                  count(operation),
                  quantity*unitPrice as Revenue 
           FROM  tran 
           WHERE strftime('%Y-%m-%d', datetime / 1000, 'unixepoch') = '2020-08-02' 
           GROUP BY user 
           '''

cur.execute(query)
results = cur.fetchall()
print(results)

cur.executemany('INSERT INTO Agg_User_Per_Day VALUES (?,?,?);',results)
# Commit your changes in the database
connection.commit()

[('user_0', 165, 302.1), ('user_1', 193, 10.83), ('user_10', 162, 244.89999999999998), ('user_11', 159, 42.51), ('user_12', 173, 84.87), ('user_13', 185, 61.67), ('user_14', 171, 167.68), ('user_15', 164, 498.26000000000005), ('user_16', 159, 339.04), ('user_17', 171, 22.05), ('user_18', 165, 751.44), ('user_19', 195, 135.75), ('user_2', 157, 174.4), ('user_20', 163, 264), ('user_21', 198, 61.63), ('user_22', 191, 29.73), ('user_23', 176, 686.08), ('user_24', 171, 407.65), ('user_25', 195, 585.4), ('user_26', 179, 674.56), ('user_27', 180, 181.35), ('user_28', 160, 52.0), ('user_29', 186, 75.24), ('user_3', 201, 873.36), ('user_30', 162, 279.02), ('user_31', 174, 55.62), ('user_32', 163, 754.0), ('user_33', 180, 628.92), ('user_34', 172, 223.84), ('user_35', 193, 6.65), ('user_36', 176, 257.46), ('user_37', 165, 250.52), ('user_38', 183, 390.32), ('user_39', 207, 269.4), ('user_4', 213, 317.55), ('user_40', 178, 500.64), ('user_41', 187, 119.72), ('user_42', 180, 16.56), ('user_43', 16

###### Fetch the aggregated data from Agg_User_Per_Day

In [19]:
cur.execute("SELECT * FROM Agg_User_Per_Day ")
results = cur.fetchall()
print(results)

[('user_0', 165, 302.1), ('user_1', 193, 10.83), ('user_10', 162, 244.89999999999998), ('user_11', 159, 42.51), ('user_12', 173, 84.87), ('user_13', 185, 61.67), ('user_14', 171, 167.68), ('user_15', 164, 498.26000000000005), ('user_16', 159, 339.04), ('user_17', 171, 22.05), ('user_18', 165, 751.44), ('user_19', 195, 135.75), ('user_2', 157, 174.4), ('user_20', 163, 264), ('user_21', 198, 61.63), ('user_22', 191, 29.73), ('user_23', 176, 686.08), ('user_24', 171, 407.65), ('user_25', 195, 585.4), ('user_26', 179, 674.56), ('user_27', 180, 181.35), ('user_28', 160, 52), ('user_29', 186, 75.24), ('user_3', 201, 873.36), ('user_30', 162, 279.02), ('user_31', 174, 55.62), ('user_32', 163, 754), ('user_33', 180, 628.92), ('user_34', 172, 223.84), ('user_35', 193, 6.65), ('user_36', 176, 257.46), ('user_37', 165, 250.52), ('user_38', 183, 390.32), ('user_39', 207, 269.4), ('user_4', 213, 317.55), ('user_40', 178, 500.64), ('user_41', 187, 119.72), ('user_42', 180, 16.56), ('user_43', 164, 4

###### Create new table named 'Agg_User_Per_Hour' to store 'Number of operations and revenue per User per Hour'

In [20]:
sql = '''
CREATE TABLE Agg_User_Per_Hour(
Hour TEXT,
user TEXT,
No_of_Operations INTEGER,
Revenue NUMERIC
)'''
cur.execute(sql)

<sqlite3.Cursor at 0xe87af20>

###### Aggregate the data fetched from tran table and insert it into new table. This will perform following:
- Aggregated by user / hour: Number of operations and revenue per User per Hour

In [23]:
query = '''
    SELECT  strftime('%Y-%m-%d %H:00:00', datetime / 1000, 'unixepoch') as datetime,
            user,
            count(operation) as count,
            quantity*unitPrice as Revenue
    FROM  tran
    WHERE strftime('%Y-%m-%d', datetime / 1000, 'unixepoch')  = '2020-08-02'
    GROUP BY strftime('%Y-%m-%d %H:00:00', datetime / 1000, 'unixepoch') ,user
        '''
cur.execute(query)
results = cur.fetchall()
print(results)

cur.executemany('INSERT INTO Agg_User_Per_Hour VALUES (?,?,?,?);',results)
# Commit your changes in the database
connection.commit()

[('2020-08-02 07:00:00', 'user_0', 15, 302.1), ('2020-08-02 07:00:00', 'user_1', 13, 10.83), ('2020-08-02 07:00:00', 'user_10', 10, 244.89999999999998), ('2020-08-02 07:00:00', 'user_11', 11, 42.51), ('2020-08-02 07:00:00', 'user_12', 9, 84.87), ('2020-08-02 07:00:00', 'user_13', 12, 61.67), ('2020-08-02 07:00:00', 'user_14', 9, 167.68), ('2020-08-02 07:00:00', 'user_15', 10, 498.26000000000005), ('2020-08-02 07:00:00', 'user_16', 12, 339.04), ('2020-08-02 07:00:00', 'user_17', 8, 22.05), ('2020-08-02 07:00:00', 'user_18', 8, 751.44), ('2020-08-02 07:00:00', 'user_19', 11, 135.75), ('2020-08-02 07:00:00', 'user_2', 9, 174.4), ('2020-08-02 07:00:00', 'user_20', 13, 264), ('2020-08-02 07:00:00', 'user_21', 9, 61.63), ('2020-08-02 07:00:00', 'user_22', 9, 29.73), ('2020-08-02 07:00:00', 'user_23', 12, 686.08), ('2020-08-02 07:00:00', 'user_24', 15, 407.65), ('2020-08-02 07:00:00', 'user_25', 10, 585.4), ('2020-08-02 07:00:00', 'user_26', 13, 674.56), ('2020-08-02 07:00:00', 'user_27', 8, 

###### Fetch the aggregated data from Agg_User_Per_Hour

In [25]:
cur.execute("SELECT * FROM Agg_User_Per_Hour")
results = cur.fetchall()
print(results)

[('2020-08-02 07:00:00', 'user_0', 15, 302.1), ('2020-08-02 07:00:00', 'user_1', 13, 10.83), ('2020-08-02 07:00:00', 'user_10', 10, 244.89999999999998), ('2020-08-02 07:00:00', 'user_11', 11, 42.51), ('2020-08-02 07:00:00', 'user_12', 9, 84.87), ('2020-08-02 07:00:00', 'user_13', 12, 61.67), ('2020-08-02 07:00:00', 'user_14', 9, 167.68), ('2020-08-02 07:00:00', 'user_15', 10, 498.26000000000005), ('2020-08-02 07:00:00', 'user_16', 12, 339.04), ('2020-08-02 07:00:00', 'user_17', 8, 22.05), ('2020-08-02 07:00:00', 'user_18', 8, 751.44), ('2020-08-02 07:00:00', 'user_19', 11, 135.75), ('2020-08-02 07:00:00', 'user_2', 9, 174.4), ('2020-08-02 07:00:00', 'user_20', 13, 264), ('2020-08-02 07:00:00', 'user_21', 9, 61.63), ('2020-08-02 07:00:00', 'user_22', 9, 29.73), ('2020-08-02 07:00:00', 'user_23', 12, 686.08), ('2020-08-02 07:00:00', 'user_24', 15, 407.65), ('2020-08-02 07:00:00', 'user_25', 10, 585.4), ('2020-08-02 07:00:00', 'user_26', 13, 674.56), ('2020-08-02 07:00:00', 'user_27', 8, 

# Possible Flaws in this workflow:
1. We have encorporated both insertion and selection logic in the same script. Hence, if we wish to see the same data multiple times, then there is a probability that duplicate records may get insert into the table every time. To avoid this, we can split these scripts into two and separate the insertion and selection logic. It will reduce the operation time and avoid duplicity.
2. Further, if we want to prevent same file to get insert multiple times in the database, then we can store the filename and it's upload date-time in separate column and put validation on it.
3. We can utilize available data in more detailed format and generate various reports like:
- Total revenue by each operation
- No.of transactions per day

# Conclusion:
We can analyse this data using various tools like node JS, Python, Tableau, Power BI etc. Additionally, We can visualize this data in Tableau or Power BI to make it easy to understand and presentale.