
# 1Ô∏è Metric Schema (Columns + Formulas)

Think of your dataset as **entity-centric** (one row = one business).

---

## Core Entity Columns (Raw)

| Column            | Description                          |
| ----------------- | ------------------------------------ |
| `business_id`     | Generated UUID                       |
| `business_name`   | Raw name                             |
| `category`        | Normalized category                  |
| `latitude`        | Geo                                  |
| `longitude`       | Geo                                  |
| `address_raw`     | Original address                     |
| `phone`           | Contact                              |
| `source_list`     | Sources found (Maps, Justdial, etc.) |
| `first_seen_date` | Earliest signal                      |
| `last_seen_date`  | Latest signal                        |

---

## Derived Metric Columns (IMPORTANT)

### üîπ 1. Existence & Confidence

```text
source_count = len(source_list)

contact_score =
  +1 if phone exists
  +1 if phone appears in ‚â•2 sources

address_confidence =
  fuzzy_match_score(addresses) / 100
```

---

### üîπ 2. Digital Visibility Metrics

| Metric                    | Formula                                 |
| ------------------------- | --------------------------------------- |
| `platform_coverage_score` | (# platforms present / total platforms) |
| `review_count`            | Total reviews                           |
| `review_velocity`         | reviews_last_30_days / 30               |
| `photo_density`           | photos / years_active                   |

```text
digital_visibility_score =
  0.4 * platform_coverage_score +
  0.3 * log(review_count + 1) +
  0.2 * review_velocity +
  0.1 * photo_density
```

---

###  3. Activity & Demand Signals

| Metric                | Formula                           |
| --------------------- | --------------------------------- |
| `recency_score`       | exp(-days_since_last_review / 30) |
| `owner_response_rate` | replies / total_reviews           |
| `hours_consistency`   | open_days / 7                     |

```text
activity_score =
  0.5 * recency_score +
  0.3 * owner_response_rate +
  0.2 * hours_consistency
```

---

###  4. Market & Spatial Metrics

| Metric                     | Formula                               |
| -------------------------- | ------------------------------------- |
| `business_density`         | businesses / km¬≤                      |
| `competition_radius`       | avg distance to 5 nearest competitors |
| `review_per_business_area` | area_reviews / area_businesses        |

---

###  Final Business Confidence Score (MOST IMPORTANT)

```text
business_confidence_score =
  0.35 * source_count_norm +
  0.25 * contact_score +
  0.25 * activity_score +
  0.15 * digital_visibility_score
```

This is your **headline metric**.

---

# 2Ô∏è Turn This Into a GitHub Project (Portfolio-Grade)

###  Recommended Repo Structure

```
small-business-market-intelligence/
‚îÇ
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ processed/
‚îÇ   ‚îî‚îÄ‚îÄ enriched/
‚îÇ
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îú‚îÄ‚îÄ 01_entity_cleaning.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 02_deduplication.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 03_metric_engineering.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 04_spatial_analysis.ipynb
‚îÇ
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ scoring.py
‚îÇ   ‚îú‚îÄ‚îÄ geo_metrics.py
‚îÇ   ‚îú‚îÄ‚îÄ validation.py
‚îÇ
‚îú‚îÄ‚îÄ dashboards/
‚îÇ   ‚îî‚îÄ‚îÄ streamlit_app.py
‚îÇ
‚îú‚îÄ‚îÄ README.md
‚îî‚îÄ‚îÄ requirements.txt
```

---

###  README Structure (Very Important)

Use this framing:

> **Problem**
> Small and informal businesses lack financial transparency.
>
> **Solution**
> Use alternative data + proxy metrics to estimate business activity, visibility, and market gaps.
>
> **Metrics Designed**
> Digital Visibility, Activity Score, Market Density, Confidence Score.
>
> **Impact**
> Identifies underserved markets & digitization opportunities.

This makes recruiters **stop scrolling**.

---

# 3Ô∏è Visualize These Metrics on a Map (Game-Changer)

![Image](https://i.sstatic.net/sSg26.jpg)

![Image](https://miro.medium.com/1%2ATHec9J1LxeNjnu1UmUL1mA.png)

![Image](https://d1a3f4spazzrp4.cloudfront.net/kepler.gl/website/hero/kepler.gl-hexagon.png)

![Image](https://d1a3f4spazzrp4.cloudfront.net/kepler.gl/website/hero/kepler.gl-contours.png)

---

##  Map Layers You Should Build

### Layer 1: Business Density Heatmap

* Color = businesses/km¬≤

### Layer 2: Confidence Score Clusters

* Green = high confidence
* Red = low confidence (opportunity!)

### Layer 3: Market Gap Zones

* Low density + high reviews nearby

---

### üîß Tools (Python)

| Tool        | Use                     |
| ----------- | ----------------------- |
| `geopandas` | Spatial joins           |
| `folium`    | Interactive maps        |
| `kepler.gl` | Pro-level visualization |
| `h3`        | Hex-based density       |

Example concept:

```text
Hex Cell ‚Üí Avg Confidence Score ‚Üí Color
```

This is **startup-grade visualization**.

---

# 4Ô∏è Statistical Validation (This Makes It Legit)

You *must* prove your metrics aren‚Äôt arbitrary.

---

##  1. Internal Consistency

* Correlation between:

  * `activity_score` ‚Üî `review_velocity`
  * `confidence_score` ‚Üî `source_count`

```text
Expected: strong positive correlation
```

---

##  2. Construct Validity

Test assumptions:

* Businesses with **recent reviews** should have **higher confidence**
* Multi-source businesses should dominate top quartile

Use:

* Boxplots
* Mann‚ÄìWhitney U Test

---

## 3. Stability Over Time

If you scrape twice:

```text
confidence_score_t1 ‚âà confidence_score_t2
```

Low drift = good metric.

---

##  4. Ground Truth Sampling (Optional but üî•)

Manually verify:

* 50 high-score businesses
* 50 low-score businesses

Measure:

```text
precision = real / predicted
```

This alone makes your project **research-level**.

---

#  What we are  Building (Big Picture)

 NOT building:
 A scraper
 A dataset

we ARE building:
A **Market Intelligence Engine**
An **Alternative Credit / Opportunity Model**
A **Digitization Gap Detector**

This thinking is used by:

* Hyperlocal delivery startups
* MSME fintech lenders
* Geo-analytics companies
* Consulting firms

---


