<a href="https://colab.research.google.com/github/SupunGurusinghe/sqlite-plus-colab/blob/main/sg_project1_wrong_mapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **What is Data Mapping?**

> Data mapping is crucial to the success of many data processes. One misstep in data mapping can ripple throughout your organization, leading to replicated errors, and ultimately, to inaccurate analysis.

> Nearly every enterprise will, at some point, move data between systems. And different systems store similar data in different ways. So to move and consolidate data for analysis or other tasks, a roadmap is needed to ensure the data gets to its destination accurately.

> For processes like data integration, data migration, data warehouse automation, data synchronization, automated data extraction, or other data management projects, quality in data mapping will determine the quality of the data to be analyzed for insights.

Ref: https://www.talend.com/resources/data-mapping/


> The quality in mapping applications depends on the
quality of stored data. In mapping applications, the
data quality has a great influence over the results.
The data are used without considering the contained
errors, and this can lead to erroneous results,
disorienting information and bad decisions that can
produce high costs to the user. Any difference
between the real world and the dataset is considered
an error

Ref: https://www.researchgate.net/publication/279274566_Mapping_Data_-_Quality_Quantity_Or_Both

Here we are basically focusing on data loaded to correct columns accurately.

### **Load data**

In [18]:
import pandas as pd

url1 = "https://raw.githubusercontent.com/SupunGurusinghe/sqlite-plus-colab/main/superstore_data.csv"
superstore_df1 = pd.read_csv(url1 ,encoding='windows-1252')

url2 = "https://raw.githubusercontent.com/SupunGurusinghe/sqlite-plus-colab/main/superstore.csv"
superstore_df2 = pd.read_csv(url2 ,encoding='windows-1252')


### **Background**

> **Scenario:** There are database tables `storefront1` & `storefront2` to store supermarket data. 
---
**Output:** All the set of rows of table



#### **Source Table**

In [19]:
import sqlite3

conn = sqlite3.connect('test_database')
c = conn.cursor()

# dropping an existing table
c.execute("DROP TABLE IF EXISTS storefront1")

c.execute('''CREATE TABLE storefront1 (
  RowID INT, 
  OrderID VARCHAR(30),
  OrderDate VARCHAR(12),
  ShipDate VARCHAR(12),
  ShipMode VARCHAR(15),
  CustomerID VARCHAR(20),
  CustomerName VARCHAR(50),
  Segment VARCHAR(20),
  Country VARCHAR(20),
  City VARCHAR(20),
  State VARCHAR(20),
  PostalCode VARCHAR(15),
  Region VARCHAR(20),
  ProductID VARCHAR(30),
  Category VARCHAR(20),
  SubCategory VARCHAR(20),
  ProductName VARCHAR(20),
  Sales DOUBLE,
  Quantity INT,
  Discount DOUBLE,
  Profit DOUBLE)
''')

conn.commit()

superstore_df1.to_sql('storefront1', conn, if_exists='replace', index = False)
 
c.execute('''  
SELECT * FROM storefront1
          ''')

results = c.fetchall()
c.close()

df1 = pd.DataFrame(results, columns= ['RowID', 'OrderID', 'OrderDate', 'ShipDate', 'ShipMode', 'CustomerID', 'CustomerName', 'Segment', 'Country', 'City', 'State', 'PostalCode', 'Region', 'ProductID', 'Category', 'SubCategory', 'ProductName', 'Sales', 'Quantity', 'Discount', 'Profit'])
df1.head()


Unnamed: 0,RowID,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,...,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Suplease,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


#### **Target Table**

In [20]:
conn = sqlite3.connect('test_database')
c = conn.cursor()

# dropping an existing table
c.execute("DROP TABLE IF EXISTS storefront2")

c.execute('''CREATE TABLE storefront2 (
  RowID INT, 
  OrderID VARCHAR(30),
  OrderDate VARCHAR(12),
  ShipDate VARCHAR(12),
  ShipMode VARCHAR(15),
  CustomerID VARCHAR(20),
  CustomerName VARCHAR(50),
  Segment VARCHAR(20),
  Country VARCHAR(20),
  City VARCHAR(20),
  State VARCHAR(20),
  PostalCode VARCHAR(15),
  Region VARCHAR(20),
  ProductID VARCHAR(30),
  Category VARCHAR(20),
  SubCategory VARCHAR(20),
  ProductName VARCHAR(20),
  Sales DOUBLE,
  Quantity INT,
  Discount DOUBLE,
  Profit DOUBLE)
''')

conn.commit()

superstore_df2.to_sql('storefront2', conn, if_exists='replace', index = False)
 
c.execute('''  
SELECT * FROM storefront2
          ''')

results = c.fetchall()
c.close()

df2 = pd.DataFrame(results, columns= ['RowID', 'OrderID', 'OrderDate', 'ShipDate', 'ShipMode', 'CustomerID', 'CustomerName', 'Segment', 'Country', 'City', 'State', 'PostalCode', 'Region', 'ProductID', 'Category', 'SubCategory', 'ProductName', 'Sales', 'Quantity', 'Discount', 'Profit'])
df2.head()


Unnamed: 0,RowID,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,...,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


**Table to DataFrame**

In [13]:
def create_df(table_name):
  """
    Convert all values of a table in to a dataframe

    Parameters
    ----------
    table_name : str
        Table name to be considered

    Variables
    ----------
    names: list of tuples
        All the column names of {table_name}
    result_list: list
        List of column names of {table_name}
    df: dataframe
        All the rows and columns of {table_name}

    Returns
    -------
    df: dataframe
        All the rows and columns of {table_n}
  """
  c = conn.cursor()
  # create dataframe from a table
  c.execute("SELECT name FROM pragma_table_info(?) ORDER BY cid", [table_name])
  names = c.fetchall()

  result_list = []
  for name in names:
    result_list.append(name[0])

  c.execute(f'SELECT * FROM {table_name}')
  results = c.fetchall()

  df = pd.DataFrame(results, columns= result_list)
  c.close()
  return df


**Simplest way of checking the mapping of data is obtaining table information**

For that; we can use the inbuild function with the following syntax 
> PRAGMA table_info(table_name);


In [10]:
def table_info_compare(table1, table2):
  """
    Columns in the result set include the column name, data type, whether or 
    not the column can be NULL, and the default value for the column. 
    The "pk" column in the result set is zero for columns that are not part of 
    the primary key, and is the index of the column in the primary key for 
    columns that are part of the primary key.

    Parameters
    ------------
    table1: str
      Name of source table
    table2: str
      Name of target table

    Returns
    ------------
    None
  """
  c = conn.cursor()

  c.execute('SELECT * FROM pragma_table_info(?) ORDER BY cid', [table1])
  print('Source Table Information\n')
  for result in c.fetchall():
    print(result)

  c.execute('SELECT * FROM pragma_table_info(?) ORDER BY cid', [table2])
  print('\nTarget Table Information\n')
  for result in c.fetchall():
    print(result)

  c.close()

table_info_compare('storefront1', 'storefront2')

Source Table Information

(0, 'RowID', 'INTEGER', 0, None, 0)
(1, 'OrderID', 'TEXT', 0, None, 0)
(2, 'OrderDate', 'TEXT', 0, None, 0)
(3, 'ShipDate', 'TEXT', 0, None, 0)
(4, 'ShipMode', 'TEXT', 0, None, 0)
(5, 'CustomerID', 'TEXT', 0, None, 0)
(6, 'CustomerName', 'TEXT', 0, None, 0)
(7, 'Segment', 'TEXT', 0, None, 0)
(8, 'Country', 'TEXT', 0, None, 0)
(9, 'City', 'TEXT', 0, None, 0)
(10, 'State', 'TEXT', 0, None, 0)
(11, 'PostalCode', 'INTEGER', 0, None, 0)
(12, 'Region', 'TEXT', 0, None, 0)
(13, 'ProductID', 'TEXT', 0, None, 0)
(14, 'Category', 'TEXT', 0, None, 0)
(15, 'SubCategory', 'TEXT', 0, None, 0)
(16, 'ProductName', 'TEXT', 0, None, 0)
(17, 'Sales', 'REAL', 0, None, 0)
(18, 'Quantity', 'INTEGER', 0, None, 0)
(19, 'Discount', 'REAL', 0, None, 0)
(20, 'Profit', 'REAL', 0, None, 0)

Target Table Information

(0, 'Row ID', 'INTEGER', 0, None, 0)
(1, 'Order ID', 'TEXT', 0, None, 0)
(2, 'Order Date', 'TEXT', 0, None, 0)
(3, 'Ship Date', 'TEXT', 0, None, 0)
(4, 'Ship Mode', 'TEXT', 0,

### **Comparison of Two Data Sets using Python**

> In here, we will be exploring how to compare two large files/datasets efficiently while creating meaningful summery using Python Library “datacompy”.

Ref: https://medium.com/analytics-vidhya/comparison-of-two-data-sets-using-python-a24a6d8beb13

`datacompy` : is a package to compare two DataFrames. Originally started as a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame)

**Installing datacompy**

In [11]:
!pip install datacompy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datacompy
  Downloading datacompy-0.8.2-py3-none-any.whl (23 kB)
Collecting ordered-set<=4.1.0,>=4.0.2
  Downloading ordered_set-4.1.0-py3-none-any.whl (7.6 kB)
Installing collected packages: ordered-set, datacompy
Successfully installed datacompy-0.8.2 ordered-set-4.1.0


**Details :**
> datacompy takes two dataframes as input and gives us a human-readable report containing statistics that lets us know the similarities and dissimilarities between the two dataframes.

> It will try to join two dataframes either on a list of join columns, or on indexes.

> Column-wise comparisons attempt to match values even when dtypes doesn't match. So if, for example, you have a column with decimal.Decimal values in one dataframe and an identically-named column with float64 data type in another, it will tell you that the dtypes are different but will still try to compare the values.

In [25]:
import datacompy

df1 = create_df('storefront1')
df2 = create_df('storefront2')
compare = datacompy.Compare(df1, df2, join_columns='RowID', abs_tol=0.0001, rel_tol=0, df1_name='Source Table', df2_name='Target Table')


**Generate the output (in the form of report )**
> print(compare.report())

In [26]:
print(compare.report())

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

      DataFrame  Columns  Rows
0  Source Table       21  9994
1  Target Table       21  9994

Column Summary
--------------

Number of columns in common: 21
Number of columns in Source Table but not in Target Table: 0
Number of columns in Target Table but not in Source Table: 0

Row Summary
-----------

Matched on: rowid
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 9,994
Number of rows in Source Table but not in Target Table: 0
Number of rows in Target Table but not in Source Table: 0

Number of rows with some compared columns unequal: 37
Number of rows with all compared columns equal: 9,957

Column Comparison
-----------------

Number of columns compared with some values unequal: 18
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 95

Columns with Unequal Values or Types
-----------------------