# Tutorial: Data Integration from Facebook Lead Forms

This notebook will guide you through the process of fetching data from Facebook Lead Gen Forms and preparing it for analysis or database storage. 

### What you will learn:
1. How to set up your environment credentials.
2. **How our data is structured (using Pydantic models).**
3. How to connect to the Facebook Business API.
4. How to fetch leads from specific forms within a time range.
5. **How to inspect raw data and then clean it.**
6. How to "clean" and map the data into a standard format.

## Step 1: Import Required Libraries
First, we need to import the necessary Python tools. We use `facebook_business` to talk to Facebook and `pydantic` to define our data structure.

In [None]:
import os
import logging
import pandas as pd
import json
from typing import List, Optional
from datetime import datetime, timedelta, timezone
from dotenv import load_dotenv
from pydantic import BaseModel, Field

# Facebook SDK components
from facebook_business.api import FacebookAdsApi
from facebook_business.adobjects.leadgenform import LeadgenForm

# For BigQuery (Database)
from google.cloud import bigquery

# Set up logging to see what's happening
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## Step 2: Understanding the Data Structure (Pydantic)
Before we get the data, let's look at how we want it to look. We use **Pydantic Models** to define the exact "shape" of a lead form submission. This makes the code more reliable and easier for non-techs to understand what info we keep.

In [None]:
class FormDataItem(BaseModel):
    """Represents a single answer in a form (e.g., Name: John)"""
    key: Optional[str] = Field(None, description="The field name, like 'phone_number'")
    value: Optional[str] = Field(None, description="The actual answer from the user")

class FormSubmission(BaseModel):
    """Represents the entire lead submission"""
    form_submission_id: str = Field(..., description="Unique ID for this lead (starts with l:)")
    form_type: str = Field("fb_lead_gen", description="Source type (always 'fb_lead_gen' here)")
    timestamp: Optional[str] = Field(None, description="Date and time the lead was created")
    ad_id: Optional[str] = Field(None, description="The ID of the Facebook Ad that generated this lead")
    data: List[FormDataItem] = Field(default_factory=list, description="The list of answers from the form")

# Example of how it looks:
example_lead = FormSubmission(
    form_submission_id="l:12345",
    timestamp="2024-02-09 12:00:00",
    data=[
        FormDataItem(key="first_name", value="Tester"),
        FormDataItem(key="car_model", value="X9")
    ]
)
print("STRUCTURE PREVIEW:")
print(example_lead.model_dump_json(indent=2))

## Step 3: Configuration and Environment Variables
To access Facebook, you need a **Token** and the **IDs** of the forms you want to check.

In [None]:
# Load environment variables from .env file if it exists
load_dotenv()

access_token = os.environ.get("FB_TOKEN")
form_ids_str = os.environ.get("FB_LEAD_IDS")
bank = os.environ.get("BANK")

if not access_token or not form_ids_str:
    print("Error: FB_TOKEN or FB_LEAD_IDS not found in environment!")
else:
    print("Credentials loaded successfully.")

## Step 4: Initialize the Facebook API
We tell the Facebook SDK which token to use for all subsequent requests.

In [None]:
if access_token:
    FacebookAdsApi.init(access_token=access_token)
    print("Facebook API Initialized.")

## Step 5: Define Search Parameters
We search for leads created in the last 15 days.

In [None]:
start_time = (datetime.now() - timedelta(days=15)).replace(second=0, microsecond=0).timestamp()
end_time = datetime.now().replace(second=0, microsecond=0).timestamp()

fields = ['created_time', 'field_data', 'ad_id', 'form_id']
params = {
    'limit': 10,
    'filtering': [
        {'field': 'time_created', 'operator': 'GREATER_THAN', 'value': int(start_time)},
        {'field': 'time_created', 'operator': 'LESS_THAN', 'value': int(end_time)}
    ]
}

## Step 5.1: Try Fetching Raw Form Data
Before we clean the data, let's see what Facebook gives us. This is helpful to understand the 'native' format of the leads.

In [None]:
if form_ids_str:
    first_form_id = form_ids_str.split(",")[0]
    print(f"Fetching 1 example lead from Form: {first_form_id}...")
    try:
        # We set limit to 1 just to see the structure
        raw_params = params.copy()
        raw_params['limit'] = 1
        
        leads = LeadgenForm(first_form_id).get_leads(fields=fields, params=raw_params)
        
        if leads:
            first_lead = leads[0]
            print("\n--- RAW DATA FROM FACEBOOK ---")
            # We convert the object to a dictionary for pretty printing
            print(json.dumps(dict(first_lead), indent=2))
        else:
            print("No leads found in this time range.")
    except Exception as e:
        print(f"Error fetching raw data: {e}")

## Step 6: Fetch and Process (Mapping & Cleaning)
Now we apply the `form_mapping` rules to turn the messy raw data above into the clean `FormSubmission` structure we defined in Step 2.

In [None]:
# Import our mapping configuration
import sys
sys.path.append('..') 
try:
    from src.form_mapping import form_mapping
except ImportError:
    form_mapping = {}

all_cleaned_leads = []

if form_ids_str:
    for form_id in form_ids_str.split(","):
        try:
            leads = LeadgenForm(form_id).get_leads(fields=fields, params=params)
            
            for lead in leads:
                # 1. Prepare basic info
                dt_utc = datetime.strptime(lead['created_time'], "%Y-%m-%dT%H:%M:%S%z")
                bangkok_time = dt_utc.astimezone(timezone(timedelta(hours=7))).strftime("%Y-%m-%d %H:%M:%S")
                
                data_items = []
                
                # 2. Process answers
                for f in lead.get('field_data', []):
                    key = f['name']
                    value = f['values'][0] if f['values'] else ""

                    if key == 'phone': value = value.replace("+66", "0").replace("-", "")
                    if key == 'target_branch': value = value.replace("-", "_")

                    # Use mapping naming
                    key = form_mapping.get(form_id, {}).get("clean", {}).get(key, key)
                    data_items.append(FormDataItem(key=key, value=value))

                if form_id == '2214442656030450':
                    data_items.append(FormDataItem(key='car_model', value='x9'))

                # 3. Create the VALIDATED object
                submission = FormSubmission(
                    form_submission_id=f"l:{lead['id']}",
                    timestamp=bangkok_time,
                    ad_id=lead.get('ad_id'),
                    data=data_items
                )
                
                all_cleaned_leads.append(submission.model_dump())
            
            print(f"Processed form {form_id}")
        except Exception as e:
            print(f"Error in form {form_id}: {e}")

print(f"\nTotal leads: {len(all_cleaned_leads)}")

## Step 7: Final Display
All results are now in a clean table format.

In [None]:
if all_cleaned_leads:
    display(pd.DataFrame(all_cleaned_leads).head())
else:
    print("No data to show.")