## What is Data Integration?

Data Integration is the process of combining data from multiple sources (databases, files, APIs, systems, etc.) into a cohesive, unified dataset for analysis or modeling.

**1. Merging Datasets**

In [2]:
import pandas as pd

# Customers dataset
customers = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie']
})

# Transactions dataset
transactions = pd.DataFrame({
    'transaction_id': [1001, 1002, 1003],
    'customer_id': [101, 102, 104],
    'amount': [250, 450, 300]
})

# Merge on 'customer_id'
merged = pd.merge(transactions, customers, on='customer_id', how='inner')  # Only matching rows
print("🔗 Inner Join:")
print(merged)

🔗 Inner Join:
   transaction_id  customer_id  amount   name
0            1001          101     250  Alice
1            1002          102     450    Bob


**2. Concatenating Datasets**

In [3]:
# January sales
jan_sales = pd.DataFrame({
    'sale_id': [1, 2],
    'amount': [100, 200]
})

# February sales
feb_sales = pd.DataFrame({
    'sale_id': [3, 4],
    'amount': [150, 180]
})

# Combine vertically
all_sales = pd.concat([jan_sales, feb_sales], ignore_index=True)
print("\n📦 Combined Sales Data:")
print(all_sales)


📦 Combined Sales Data:
   sale_id  amount
0        1     100
1        2     200
2        3     150
3        4     180


**3. Deduplicating After Integration**

In [4]:
dup_data = pd.DataFrame({
    'id': [1, 2, 2, 3],
    'value': [100, 200, 200, 300]
})

# Remove duplicates
cleaned = dup_data.drop_duplicates()
print("\n🧹 Deduplicated Data:")
print(cleaned)


🧹 Deduplicated Data:
   id  value
0   1    100
1   2    200
3   3    300


**4. Handling Schema Mismatches**

In [5]:
# Dataset A
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Score': [85, 90]
})

# Dataset B with different column names
df2 = pd.DataFrame({
    'name': ['Charlie', 'David'],
    'score': [95, 80]
})

# Normalize schema
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()

# Combine
combined_df = pd.concat([df1, df2], ignore_index=True)
print("\n🧾 Combined with Schema Normalization:")
print(combined_df)


🧾 Combined with Schema Normalization:
      name  score
0    Alice     85
1      Bob     90
2  Charlie     95
3    David     80


**5. Data Consolidation**

In [7]:
customers = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie']
})

transactions = pd.DataFrame({
    'transaction_id': [201, 202, 203],
    'customer_id': [101, 102, 104],  # Note: 104 doesn't exist in customers
    'amount': [250, 400, 100]
})

merged_data = pd.merge(transactions, customers, on='customer_id', how='left')

print("🔗 Data Integration Result:")
print(merged_data)

consolidated = merged_data.groupby('name', dropna=True)['amount'].sum().reset_index()

print("\n📦 Data Consolidation Result:")
print(consolidated)

🔗 Data Integration Result:
   transaction_id  customer_id  amount   name
0             201          101     250  Alice
1             202          102     400    Bob
2             203          104     100    NaN

📦 Data Consolidation Result:
    name  amount
0  Alice     250
1    Bob     400
