## SCD Type 1 – Overwrite Old Data (No History Tracking)

**Definition:**  
SCD Type 1 handles changes by **overwriting** the existing data in the dimension table. Any changes in attributes for existing records are directly updated in the same row.

### Use Case:
This approach is suitable when **historical data is not required**, and only the latest information needs to be maintained. Common use cases:
- Correcting spelling errors in names
- Updating email addresses or phone numbers
- Fixing invalid or outdated data

### Business Rule:
- If the record exists and any column has changed, **update** the existing row.
- If the record is new (i.e., CustomerID not found), **insert** it.

### Technique:
1. Load the existing dimension table (`customers_base`) from the database.
2. Load the new incoming dataset (with updated values).
3. For each record in the incoming dataset:
   - If `CustomerID` exists in the base table:
     - Compare field values.
     - If differences are found, **overwrite** the existing row using `UPDATE`.
   - If `CustomerID` is not found:
     - **Insert** the record.
4. Write the result to a new table `customers_scd1`.

### Summary:
- Existing records: **Updated with new values**
- Changed records: **Overwritten**
- New records: **Inserted**
- No tracking of old values is maintained


In [70]:
import urllib
from sqlalchemy import create_engine
import pandas as pd
from datetime import datetime

server='DESKTOP-HJVSCEN\MSSQLSERVER1'
database='Python ETL'
username='sa'
password='Ka@12345678'


ConnectionString = f"""
    DRIVER={{ODBC Driver 18 for SQL Server}};
    SERVER={server};
    DATABASE={database};
    UID={username};
    PWD={password};
    TrustServerCertificate=yes;
"""
# URL-encode the connection string for SQLAlchemy
params=urllib.parse.quote_plus(ConnectionString)

engine=create_engine(f"mssql+pyodbc:///?odbc_connect={params}")


In [71]:
existing_df=pd.read_sql('select * from customers_base',con=engine)
existing_df

Unnamed: 0,customerid,name,city,email,lastupdated
0,101,Tanuj,Hyderabad,rangatanuj@gmail.com,2025-01-20
1,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
2,103,John,Pune,john@gmail.com,2025-03-24
3,104,Smrithi,Mumbai,smrithi@gmail.com,2025-04-26
4,105,Chiru,Banglore,chiru@gmail.com,2025-05-28


In [None]:
incoming_df = pd.DataFrame([
    {"customerid": 102, "name": "Meenu", "city": "Hyderabad", "email": "meenu@gmail.com", "lastupdated": datetime(2025, 2, 22)},     # No change
    {"customerid": 104, "name": "Smrithi", "city": "Chennai", "email": "smrithi@gmail.com", "lastupdated": datetime(2025, 6, 23)},   # Changed city
    {"customerid": 101, "name": "Tanuj", "city": "Bangalore", "email": "tanuj.new@gmail.com", "lastupdated": datetime(2025, 6, 23)},  # Changed city and email
    {"customerid": 102, "name": "Meenu", "city": "Hyderabad", "email": "meenu@gmail.com", "lastupdated": datetime(2025, 2, 22)},     # No change
    {"customerid": 107, "name": "Ravi", "city": "Delhi", "email": "ravi@gmail.com", "lastupdated": datetime(2025, 6, 20)}            # New record
])
incoming_df

Unnamed: 0,customerid,name,city,email,lastupdated
0,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
1,104,Smrithi,Chennai,smrithi@gmail.com,2025-06-23
2,101,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
3,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
4,107,Ravi,Delhi,ravi@gmail.com,2025-06-20


**Drop the duplicates values in the incoming_df**

In [73]:
incoming_df=incoming_df.drop_duplicates(subset=['customerid'])
incoming_df

Unnamed: 0,customerid,name,city,email,lastupdated
0,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
1,104,Smrithi,Chennai,smrithi@gmail.com,2025-06-23
2,101,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
4,107,Ravi,Delhi,ravi@gmail.com,2025-06-20


**Identify updated records using inner join**

In [74]:
merged_df=pd.merge(existing_df,incoming_df,how='inner',on='customerid',suffixes=('_old','_new'))
merged_df

Unnamed: 0,customerid,name_old,city_old,email_old,lastupdated_old,name_new,city_new,email_new,lastupdated_new
0,101,Tanuj,Hyderabad,rangatanuj@gmail.com,2025-01-20,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
1,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
2,104,Smrithi,Mumbai,smrithi@gmail.com,2025-04-26,Smrithi,Chennai,smrithi@gmail.com,2025-06-23


Only get those whose old values and new values do not match 

In [75]:
updates_df=merged_df[
    (merged_df['name_old']!=merged_df['name_new'])|
    (merged_df['city_old']!=merged_df['city_new'])|
    (merged_df['email_old']!=merged_df['email_new'])
]
updates_df

Unnamed: 0,customerid,name_old,city_old,email_old,lastupdated_old,name_new,city_new,email_new,lastupdated_new
0,101,Tanuj,Hyderabad,rangatanuj@gmail.com,2025-01-20,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
2,104,Smrithi,Mumbai,smrithi@gmail.com,2025-04-26,Smrithi,Chennai,smrithi@gmail.com,2025-06-23


1. **get only the required columns which are needed.**
2. **Renaming the columns similar to the existing table column names because we will be concatinate them at the end . So if different column names are present then the merging will take place among the tables instead of concatenating of values at the end**
3. **You can try this by commenting out the rename statement to see the differences**

In [76]:
only_updates_df=updates_df[['customerid','name_new','city_new','email_new','lastupdated_new']]
only_updates_df.rename(columns={
    'name_new':'name',
    'city_new':'city',
    'email_new':'email',
    'lastupdated_new':'lastupdated'
},inplace=True)
only_updates_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  only_updates_df.rename(columns={


Unnamed: 0,customerid,name,city,email,lastupdated
0,101,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
2,104,Smrithi,Chennai,smrithi@gmail.com,2025-06-23


**Get the new data which is not present in the previous table.So we will insert that new data at the end of the table**

In [77]:
only_new_df=incoming_df[~incoming_df['customerid'].isin(existing_df['customerid'])]
only_new_df

Unnamed: 0,customerid,name,city,email,lastupdated
4,107,Ravi,Delhi,ravi@gmail.com,2025-06-20


**Get the unchanged data that is the unupdated data. Which is the data which has no changes in the values and need not be updated**

In [78]:
unchanged_df=existing_df[~existing_df['customerid'].isin(updates_df['customerid'])]
unchanged_df

Unnamed: 0,customerid,name,city,email,lastupdated
1,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
2,103,John,Pune,john@gmail.com,2025-03-24
4,105,Chiru,Banglore,chiru@gmail.com,2025-05-28


**Now just concat the rows of unchanged_df(no changes needed),only_updates_df(existing rows which are updated), new_df(has only the new rows which are not present in the existing table)**

In [84]:
scd_1_df=pd.concat([unchanged_df,only_updates_df,only_new_df],axis=0,ignore_index=True)
scd_1_df.sort_values(by='customerid').reset_index(drop=True)

Unnamed: 0,customerid,name,city,email,lastupdated
0,101,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
1,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
2,103,John,Pune,john@gmail.com,2025-03-24
3,104,Smrithi,Chennai,smrithi@gmail.com,2025-06-23
4,105,Chiru,Banglore,chiru@gmail.com,2025-05-28
5,107,Ravi,Delhi,ravi@gmail.com,2025-06-20


### ANOTHER METHOD:

In [87]:
merged_df

Unnamed: 0,customerid,name_old,city_old,email_old,lastupdated_old,name_new,city_new,email_new,lastupdated_new
0,101,Tanuj,Hyderabad,rangatanuj@gmail.com,2025-01-20,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23
1,102,Meenu,Hyderabad,meenu@gmail.com,2025-02-22,Meenu,Hyderabad,meenu@gmail.com,2025-02-22
2,104,Smrithi,Mumbai,smrithi@gmail.com,2025-04-26,Smrithi,Chennai,smrithi@gmail.com,2025-06-23


In [89]:
changed_email=merged_df[merged_df['email_old']!=merged_df['email_new']]
changed_email

Unnamed: 0,customerid,name_old,city_old,email_old,lastupdated_old,name_new,city_new,email_new,lastupdated_new
0,101,Tanuj,Hyderabad,rangatanuj@gmail.com,2025-01-20,Tanuj,Bangalore,tanuj.new@gmail.com,2025-06-23


RangeIndex(start=-1, stop=4, step=1)
