## Phase 2: Gathering Data

### What Data is Needed?

Once the problem is framed, identify all relevant data sources.

### Data Sources

**Internal Sources**

1. **Transactional Data**
   - Database tables with customer transactions
   - Tools: SQL, PostgreSQL, MySQL, BigQuery, Snowflake
   
   ```sql
   SELECT customer_id, transaction_amount, transaction_date, product_category
   FROM transactions
   WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)
   ```

2. **Customer Data**
   - CRM systems (Salesforce, HubSpot)
   - Contains: demographics, contact info, communication history
   
3. **Log Data**
   - Website/app activity logs
   - Contains: page views, clicks, session duration
   - Tools: Apache Kafka, Splunk, CloudWatch

**External Sources**

1. **Public Datasets**
   - Kaggle, UCI Machine Learning Repository, OpenML
   - Free but usually for learning/competition
   
2. **APIs**
   - Weather data (OpenWeatherMap)
   - Financial data (Alpha Vantage, IEX Cloud)
   - Market data (Yahoo Finance)
   
3. **Web Scraping** (with legal/ethical approval)
   - Tools: BeautifulSoup, Selenium, Scrapy
   - Must respect ToS and robots.txt
   
4. **Third-party Data Providers**
   - Demographic data, credit scores, market data
   - Providers: Experian, Equifax, Bloomberg

### Data Requirements Checklist

```
Frame the Problem checklist:

[✓] Does the data contain the prediction TARGET?
    Example: Churn status (yes/no) for historical customers

[✓] Does the data cover sufficient time period?
    Example: At least 12-24 months for seasonality patterns

[✓] Is the volume adequate?
    Example: Need >1000 positive examples for binary classification

[✓] Are there privacy/compliance concerns?
    Example: PII (personally identifiable information), GDPR, HIPAA

[✓] Is the data quality acceptable?
    Example: <10% missing values, no obvious errors

[✓] Will the data be available at prediction time?
    Example: Don't use future data to predict the past
```

### Example: Gathering Data for Churn Prediction


In [None]:
import pandas as pd
import numpy as np

# Query 1: Get customer transactions (18-24 months)
query_transactions = """
SELECT 
    customer_id,
    SUM(transaction_amount) as total_spent,
    COUNT(*) as transaction_count,
    MAX(transaction_date) as last_transaction_date,
    MIN(transaction_date) as first_transaction_date
FROM transactions
WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)
GROUP BY customer_id
"""

# Query 2: Get customer demographics
query_demographics = """
SELECT 
    customer_id,
    age,
    gender,
    account_creation_date,
    account_type,
    country
FROM customers
"""

# Query 3: Get support tickets (engagement signal)
query_support = """
SELECT 
    customer_id,
    COUNT(*) as support_tickets,
    MAX(ticket_date) as last_support_date
FROM support_tickets
WHERE ticket_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)
GROUP BY customer_id
"""

# Query 4: Get target variable (churn)
query_churn = """
SELECT 
    customer_id,
    CASE WHEN last_purchase_date < DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 
         THEN 1 ELSE 0 END as churned
FROM customers
"""

# Combine all data
transactions_df = pd.read_sql(query_transactions, connection)
demographics_df = pd.read_sql(query_demographics, connection)
support_df = pd.read_sql(query_support, connection)
churn_df = pd.read_sql(query_churn, connection)

# Merge all tables
raw_data = transactions_df.merge(demographics_df, on='customer_id')
raw_data = raw_data.merge(support_df, on='customer_id', how='left')
raw_data = raw_data.merge(churn_df, on='customer_id')

print(f"Dataset shape: {raw_data.shape}")
print(f"Churn rate: {raw_data['churned'].mean():.2%}")


### Tools Used in Data Collection

| Tool | Purpose | Use Case |
|------|---------|----------|
| SQL / BigQuery | Query databases | Extract transactional data |
| Apache Spark | Large-scale data processing | Process 100GB+ datasets |
| Pandas | Data manipulation in Python | Small-medium datasets |
| Kafka | Stream data collection | Real-time data pipelines |
| Scrapy / BeautifulSoup | Web scraping | Collect web data |
| APIs | Access third-party data | Weather, financial data |
| AWS S3 / GCS | Cloud storage | Store raw data files |

---
