# `Avoiding Duplicate Entries PSQL`

# <font color=red>Mr Fugu Data Science</font>

# (◕‿◕✿)

# `Lets start from the` <font color=red>INSERT</font> `statement: you have 2 options`

When you decide to `insert` either a *`batch file, numberous CSV files or just a single line`* there are distinctions to consider for `avoiding duplicate entries`

`We have a few options depending on what scenario you fall into:`

`--------------------------------------------`

**`1. )`** This a situation where you want to find duplicate entries with an **`already have in an existing DB?`**

+ **`DISTINCT`** keyword

+ **`GROUPBY`** clause

+ **`INNER JOIN`** two or more tables

+ **`&&`** which is an `overlap` operator

**`2. )`** Are you trying to **`prevent creating duplicate entries before adding`** to a database?



+ **`ON CONFLICT DO NOTHING`**

+ **`UNIQUE`** constraint

+ Option to build a **`trigger/function`**

`----------------------------------------------`

+ **`Performance Issues:`**

When using the **`ON CONFLICT`** you are able to do `pre-checks to find conflicts` before insertion. If the checks pass then insert is performed. Otherswise, the attempt is deleted and moves on.

+ This is a step in the right direction because you are avoiding overhead resources creating a heap that is later deleted. When this occurs you are creating dead tuples which create wasted storage space.


+ `BLOAT:` when we are scanning tables and updating old entries with new ones we create dead tuples that occur from the deletion of an old entry "tuple". Overtime this will affect speed and storage space if it is not controled

`----------------------------------------------`

# <font color=red>Important Side Note:</font> *Batch Inserts* use `COPY` 

+ Consider using a `Temp Table` if you are interested in doing operations to `remove duplicates`, restraints or similar for speed ups and retain data without/prevent lose

`----------------------------------------------`

**`ON CONFLICT DO NOTHING`**

+ When a duplicate row is trying to be added it will be ignored and NOT create `bloat` such as `dead rows` of data and wasted space while creating an `insert`.
    + A `dead tuple` is creating additional overhead in the form of storage space. Avoiding this will help in the long run.
+ `Consideration for UPDATES` you are deleting an old row and inserting a new row creating a dead row from the old deleted row during update which can cause bloat.

`-----------------------------------------------------------------`

# `UNIQUE INDEX` or `Constraint`

+ If you decide to use this `constraint` understand that you can apply it to
`1 or more columns` but you `cannot have rows with repeating information` and `NULL` values are viewed as `not equal`.
    + There is a provision to account for the `Null` values if you need to `NULLS NOT DISTINCT` and treats as equal
    + When using with `multiple columns` rows with combined values cannot be repeated
+ `When to USE:`
    + If you have `columns` that are `called often` in queries
    + When a `Where` clause or `Join` are performed often for these columns
+ `Avoid IF:`
    + Columns are `UPDATED` often
    + Small tables size
    + Columns `Not used often`
    + This DOES create overhead for `INSERT` or `UPDATES` so take into consideration

+ `Unique Index` creates unique indexed columns
    + While `Unique Constraint` ensures that duplicates are not created and makes a unique index as well

`------------------------------------------------`

# EX.)

**`Create Table with Unique Constraint`**


`CREATE TABLE Customer_orders ( Transaction_id integer PRIMARY KEY,
User_name VARCHAR(255) NOT NULL,
Order_date date,
Quantity_Items integer,
Order_notes varchar(200),
CONSTRAINT User_unique UNIQUE (User_name)
);`

*If we were to add multiple columns just add a comma inside the parenthesis and your additional columns*

`-------------`

**`Alter Table:`** Use this if you have an existing table


`ALTER TABLE Customer_orders
ADD CONSTRAINT User_unique UNIQUE (User_name);`

`-------------`

**`Delete "Drop" Constraint`**

`ALTER TABLE Customer_orders
DROP CONSTRAINT User_unique;`

`-------------`

**`Unique Index`**


`CREATE UNIQUE INDEX idx_transaction_id
ON Customer_orders(Transaction_id);`

`-------------`

# **`Show Indexes for Current Database`**


`SELECT
tablename,
indexname,
indexdef
FROM
pg_indexes
WHERE
schemaname = 'public'
ORDER BY
tablename,
indexname;`

`---------Output--------------`


|    tablename    |      indexname     |                                              indexdef                                              |
|:---------------:|:------------------:|:--------------------------------------------------------------------------------------------------:|
| Refunds         | Refund_id_pkey     | CREATE UNIQUE INDEX Refund_id_pkey ON public.Refunds USING btree(Refund_id)                        |
| Customer_Orders | Customer_orders_pk |   CREATE UNIQUE INDEX Customer_orders_pkey ON public.Customer_orders USING btree(Transaction_id)   |
| Customer_Orders |   User_unique_key  | CREATE UNIQUE INDEX Customer_orders_User_name_key ON public.Customer_orders USING btree(User_name) |
| Pizza_Orders    | User_name_key      | CREATE UNIQUE INDEX Pizza_orders_User_name_key ON public.Pizza_orders USING btree(User_name)       |



# **`Show Indexes for Current Table`**


`SELECT
indexname,
indexdef
FROM
pg_indexes 
WHERE
tablename = 'Customer_orders';`

`------Output-------`


|    tablename    |      indexname     |                                              indexdef                                              |
|:---------------:|:------------------:|:--------------------------------------------------------------------------------------------------:|
| Customer_Orders | Customer_orders_pk |  CREATE UNIQUE INDEX Customer_orders_pkey ON public.Customer_orders USING btree(Transaction_id)    |
| Customer_Orders |   User_unique_key  | CREATE UNIQUE INDEX Customer_orders_User_name_key ON public.Customer_orders USING btree(User_name) |

`------------------------------------------------`

# `Triggers & When To USE/NOT Use`

Think of this as a callback based on operations performed on given events. This will trigger a function to run which in turn calls the trigger to act.

+ If this was a `bulk insert (avoid triggers)` but a single line insertion may not make a big deal up to a point.

Triggers can do automatic tasks such as `INSERT, DELETE, UPDATE` for example. 
+ Consider two options:
    + are you doing row-by-row
    + transaction based irrespect of number of rows
    
`The trigger will be specified: before, after or inplace of some operation!`

Here are a few examples of what you can do with them: 
+ `Scheduling a task`
+ `Possible Error Handling`
+ `Check for changes in your data`
+ `Logging or auditing`
    + Ex.) Consider if you were auditing users and their actions for some instance
    + Ex.) Auditing private information for users which will not be shown due to sensitivity/restrictions
        + Like: login/user information, time something may occur and you capture this, or database related information that is private
+ `Validating tasks`

`---------------------------`

+ **Good use case:** consider speeding up an update with a trigger where you use a temp table for operations here is a starter to think about: [In Memory Options as a start](https://www.enterprisedb.com/postgres-tutorials/how-tune-postgresql-memory) | [secondary resource](https://postgrespro.com/docs/enterprise/10/in-memory) | [AWS PSQL In Memory](https://aws.amazon.com/blogs/database/introducing-optimized-reads-for-amazon-rds-for-postgresql/)

`---------------------------`

**`Downsides or when to Avoid:`**

+ Possible `performance` slowdowns such as server loads
+ If you have a high volume of data being used it would not be recommended
+ `Debugging` can be an issue, for example client side applications won't see the trigger 
+ `Stored precedures within trigger`, try to `avoid` if possible
+ be leery of cross-database (especially if maintenance is due) or less cross-server triggers due to speed concerns
+ `Triggers firing more triggers`, try to `stay away` from this
+ `Recurssive triggers` turned (debug or performance issues)
    + `Try to reduce the number of write operations!`
+ `Iterating` when you are doing a row-by-row read or comparison this drastically reduces performance such as if you are using `WHILE or CURSORs`

**`PSQL Slight Differences from MySQL Triggers:`**

+ Truncating triggers
+ Trigger function is needed to call the actual trigger
+ Normal operations: *`Create Trigger`*, *`Drop Trigger`*, *`Alter Trigger`*, *`Disable/Enable Trigger`*

**A Few Take Aways:**

+ A user needs permission/privilege `TRIGGER` and `EXECUTE` to use
+ `pg_catalogue` will allow you to check all triggers for a given database
+ If you are creating multiple triggers that fire on same object they will run in alphabetical order

`------------------------------------------------`

# EX.)

There are a few steps to set this up:

+ Create a function without parameters
    + Create the trigger
        + 1.) create a trigger name
        + 2.) `BEFORE` or `AFTER` an event occurs
        + 3.) what are you doing? `INSERT`, `DELETE`, `UPDATE`, `TRUNCATE`
        + 4.) call the `table_name` after you use `ON` keyword
        + 5.) Is this a row-by-row reference: `FOR EACH ROW` or a statement `FOR EACH STATEMENT`
        + 6.) call the `EXECUTE PROCEDURE` and then put your trigger_name afterward


# **`Create Trigger`**

**Assume we have this Table**

`CREATE TABLE Customer_orders ( Transaction_id integer PRIMARY KEY,
User_name VARCHAR(60) UNIQUE NULLS NOT DISTINCT,
logged_into_acct TIMESTAMP(6) NOT NULL,
Order_date date,
Quantity_Items integer,
Order_notes varchar(200),
email varchar(60) UNIQUE NULLS NOT DISTINCT,
order_placed_at TIMESTAMP(6) NOT NULL
);`


**Next, create our function to call trigger**
+ I decided to make this regarding changing an email for a customer

`--------------------------------------------------`

CREATE OR REPLACE FUNCTION log_User_Email_changes()

    RETURNS TRIGGER
  
    LANGUAGE PLPGSQL

    AS

$$

BEGIN

    IF NEW.email <> OLD.email THEN

        INSERT INTO email_change_logs(email,User_name,email_changed_at)

        VALUES(OLD.email,OLD.User_name,now());

    END IF;

    RETURN NEW;
END;

$$

`-------------------------------------------------`

**Create the New Table:** *`email_change_logs`*

CREATE TABLE email_change_logs (
   email varchar(60) UNIQUE NULLS NOT DISTINCT,
   User_name VARCHAR(60),
   email_changed_at TIMESTAMP(6) NOT NULL
);

`--------------------------------------------------`
    


# **`Finally, the trigger itself`**


CREATE TRIGGER log_User_Email_changes

    AFTER INSERT

    ON "Customer_orders"

    FOR EACH ROW

    EXECUTE PROCEDURE log_User_Email_changes();
    
`--------------------------------------------------`

# *Customer_orders Table INSERT then call our Trigger?*


`--------------------------------------------------`

**`INSERT INTO Customer_orders(Transaction_id,User_name,logged_into_acct,Order_date,Quantity_Items,
Order_notes, email, order_placed_at) VALUES
( 111,'HorseTooth_John05','2023-08-03 12:50:50','2023-08-02 11:00:05', 0,'',
'johnyBeebop@mail.com','2023-08-03 22:20:10')`**

`--------------------------------------------------`


**`SELECT * FROM Customer_orders`**

| Transaction_id |           User_name          |    logged_into_acct   |       Order_date      | Quantity_Items | Order_notes |          email         | email_changed_at      |
|:--------------:|:----------------------------:|:---------------------:|:---------------------:|:--------------:|:-----------:|:----------------------:|-----------------------|
|       111      | 'HorseTooth_John05' | '2023-08-03 12:50:50' | '2023-08-02 11:00:05' |        2       |      ''     | 'johnyBeebop@mail.com' | '2023-08-03 22:20:10' |

`--------------------------------------------------`

**`SELECT * FROM employee_insert_trigger`**

|             email            |      User_name      | email_changed_at      |
|:----------------------------:|:-------------------:|-----------------------|
| 'HorseTooth_Cowboy@mail.com' | 'HorseTooth_John05' | '2023-08-03 22:20:10' |

`--------------------------------------------------`

# **`Update`**


CREATE TRIGGER email_update_users

 BEFORE UPDATE

 ON "User_name"

 FOR EACH ROW

EXECUTE PROCEDURE log_User_Email_changes();



`--------------------------------------------------`


# **`Delete Event Trigger`**

CREATE TRIGGER customer_Data_delete_trigger

    AFTER DELETE

    ON "Customer_orders"

    FOR EACH ROW

EXECUTE PROCEDURE after_envent_delete_fcn();

*`after_delete_fcn() this would be a function you created with whatever you want to do`*

`--------------------------------------------------`


# **`Drop`**

`drop trigger log_User_Email_changes on "Customer_orders";`

`--------------------------------------------------`

# `See All Triggers for a Database`

`SELECT your_DB_name FROM pg_trigger;`


`--------------------------------------------------`


# `Distinct:`

+ Find `Duplicate` rows based on *`1 column or more`* using a `SELECT` statement.

+ You can use `Distinct ON` in your query to take the first occurence of a duplicate

+ `When NOT TO USE:`
    + `Where` clause
    + More than 1 distinct key word search in a given SELECT statement such as calling DISTINCT col_1, DISTINCT col_2
    + `Group By`
    
`------------------------------------------------`

# EX.)


`Customer_acct ( first_name VARCHAR(60) NOT NULL, last_name VARCHAR(60) NOT NULL, User_name VARCHAR(60) UNIQUE NULLS NOT DISTINCT, Acct_notes varchar(200), email varchar(60) UNIQUE NULLS NOT DISTINCT, login_time TIMESTAMP(6) NOT NULL, Origin_country_abrev CHAR(3) NOT NULL, birth_yr CHAR(4) );`


| first_name |  last_name |     User_name     |        Acct_notes        |            email            | login_time       | Origin_country_abrev | birth_yr |
|:----------:|:----------:|:-----------------:|:------------------------:|:---------------------------:|------------------|:--------------------:|:--------:|
|   "Jason"  | "Shingles" |   "J_Shingles07"  |            ''            |       "js07@somemail"       | 2023-08-04 12:50 |         "CA"         |   1975   |
|   "John"   |  "Rivers"  | "whoseYoDaddy_11" |      '3 login fails'     |      "whoseYodd@jjmail"     | 2023-08-04 10:00 |         "USA"        |   1999   |
|   "John"   |   "Cash"   |   "JCash_bruh99"  |       'new account'      |   "johnnyCsh_money@mymail"  | 2023-08-03 06:33 |         "UK"         |   2001   |
|   "Ricky"  |  "Dovers"  |   "rDov_wham11"   |    'user name updated'   |     "ricky_d69@yourmail"    | 2023-05-03 09:50 |         "MX"         |   1989   |
|   "Mike"   |   "Smith"  |     "msmith21"    | 'origin country updated' | "mikeySmithy001@travelmail" | 2022-08-03 12:50 |         "USA"        |   2002   |
|   "Chris"  |   "Smith"  |   "CSmith_0906"   |            ''            |   "whodat191@creycraymail"  | 2023-08-03 12:50 |         "CA"         |   1979   |





**`One Column`**



**`More Than 1 Column`**



`------------------------------------------------`
    
    
# `Distinct ON:`

This is similar to above except it is useful when you want a specific `ordering` of your non-duplicated entries. Such as if you have `multiple entries but choose the first occurence`.
+ If for instance you have multiple columns but non-unique rows this can be a beneficial use case and output the first occurence of duplicates
    
+ The theory here from what I read was a `temp file` is created during the `group by` as well as other tasks such as reads/writes which create overhead adding between 5-10% decreases in speed. I cannot say this is true but here is a read which suggests this [psql distinct vs group by speed](https://nolongerset.com/distinct-vs-group-by-jet-speed-test/)
    
+ If using this in conjunction with `Group By` you will need both to have same ordering of columns

`------------------------------------------------`

# EX.)

`More Than 1 column but 1 Table`



`More Than 1 column but 2 Tables`


`------------------------------------------------`

# `Group By:`

+ Aggregate usage and flexibility
+ Should be `faster than` *distinct* `but check if this is true`. There are a lot considerations with this though!
    + The problem is the differing opinions and information to read!
    
`------------------------------------------------`

# EX.)



`------------------------------------------------`

#  `&&` Overlap Operator Comparing Arrays

+ The output will show `(T/F)` and `number of duplicate` entries

`------------------------------------------------`

# EX.)



`------------------------------------------------`

# `Inner Join`

+ Using two or more tables to construct a join based on something common between both tables such as a primary key to remove duplicate rows. 

`------------------------------------------------`

# EX.)



`------------------------------------------------`

**`If you want coded PSQL or Psycopg let me know for a future video`**

Also, I am considering other topics of PSQL future videos so stay tuned!

# Like, Share & <font color=red>SUB</font>scribe

# `Citations`

# ◔̯◔

https://tomcam.github.io/postgres/

https://aws.amazon.com/blogs/database/hidden-dangers-of-duplicate-key-violations-in-postgresql-and-how-to-avoid-them/

https://www.freecodecamp.org/news/how-to-remove-duplicate-data-in-sql/#:~:text=One%20of%20the%20easiest%20ways,values%20from%20a%20particular%20column

https://subscription.packtpub.com/book/data/9781803248974/5/ch05lvl1sec63/preventing-duplicate-rows

https://stackoverflow.com/questions/67616081/preventing-insert-on-duplicate-values-postgres

https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/30118648#30118648

https://www.postgresql.org/docs/current/btree-gist.html

https://codingsight.com/sql-insert-into-select-5-easy-ways-to-handle-duplicates/

https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/development/remove-duplicate-rows-sql-server-tab

https://www.mongodb.com/community/forums/t/batch-insert-upsert-avoiding-duplicates/163725 (mongodb)

https://stackoverflow.com/questions/53722405/how-to-insert-bulk-rows-and-ignore-duplicates-in-postgresql-9-3

https://www.psycopg.org/psycopg3/docs/advanced/async.html

https://www.postgresql.org/docs/current/ddl-generated-columns.html

https://www.appsloveworld.com/postgresql/100/58/how-to-do-a-bulk-insert-while-avoiding-duplicates-in-postgresql

https://alibaba-cloud.medium.com/use-of-the-postgresql-upsert-insert-on-conflict-do-function-f366ac8afd52 (good examples)

https://www.postgresqltutorial.com/postgresql-tutorial/how-to-delete-duplicate-rows-in-postgresql/

https://www.delftstack.com/howto/postgres/postgresql-insert-on-duplicate-update/ (cool read, look at RACE ex.)

https://www.geeksforgeeks.org/multiple-indexes-vs-multi-column-indexes/ (Index info)

https://www.postgresql.org/docs/current/indexes-multicolumn.html (Index key points)

https://devcenter.heroku.com/articles/postgresql-indexes (Index info 2)

https://rbranson.medium.com/10-things-i-hate-about-postgresql-20dbab8c2791 (issues with PSQL)

https://www.enterprisedb.com/postgres-tutorials/how-select-distinct-values-query-results-postgresql

https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-select-distinct/

https://nolongerset.com/distinct-vs-group-by-jet-speed-test/

https://database.guide/2-ways-to-delete-duplicate-rows-in-postgresql-ignoring-the-primary-key/ (good examples)

https://www.c-sharpcorner.com/article/different-ways-to-find-and-delete-duplicate-rows-from-a-table-in-sql-server/

https://copyprogramming.com/howto/sql-delete-duplicate-combined-rows-in-postgresql (interesting read)

https://medium.com/flatiron-engineering/uniqueness-in-postgresql-constraints-versus-indexes-4cf957a472fd

https://www.geeksforgeeks.org/postgresql-list-indexes/

`Triggers`

https://stackoverflow.com/questions/460316/are-database-triggers-necessary#:~:text=In%20this%20case%20triggers%20cause,that%20triggers%20are%20indeed%20harmful

https://stackoverflow.com/questions/460316/are-database-triggers-necessary#:~:text=In%20this%20case%20triggers%20cause,that%20triggers%20are%20indeed%20harmful

https://www.tutorialspoint.com/What-are-the-advantages-disadvantages-and-restrictions-of-using-MySQL-triggers

https://www.red-gate.com/simple-talk/databases/sql-server/database-administration-sql-server/sql-server-triggers-good-scary/ (good read!)

https://www.sqlservercentral.com/articles/postgresql-triggers-part-1

https://www.enterprisedb.com/postgres-tutorials/everything-you-need-know-about-postgresql-triggers

https://www.postgresqltutorial.com/postgresql-triggers/creating-first-trigger-postgresql/

https://www.techonthenet.com/postgresql/unique.php

`Bulk Insert`

https://www.enterprisedb.com/blog/7-best-practice-tips-postgresql-bulk-data-loading

https://www.commandprompt.com/education/how-to-insert-bulk-data-in-postgresql/

https://www.sqlshack.com/working-with-line-numbers-and-errors-using-bulk-insert/

https://www.2ndquadrant.com/en/blog/7-best-practice-tips-for-postgresql-bulk-data-loading/

https://www.cockroachlabs.com/docs/stable/performance-best-practices-overview

https://www.highgo.ca/2020/12/08/bulk-loading-into-postgresql-options-and-comparison/