#  **NFT Authenticity Model - Chronological Flow** (Made by Claude Sonnet 4)


## 1. **Initialization Phase** (`__init__`)
When the `NFTAuthenticityModel` is created:
- Sets up empty dictionaries to store different ML models (`self.models`)
- Initializes storage for model performance scores and results
- Creates a `StandardScaler` for feature normalization
- Defines the path where the final model will be saved

## 2. **Data Loading Phase** (`prepare_data()`)
The system looks for training data:
- Searches the `training_data/` directory for JSON files with NFT data
- Finds the most recent file (sorted by filename)
- Loads the JSON data containing NFT collection features
- Extracts feature arrays and collection names from each record
- Converts everything into a pandas DataFrame for easier manipulation

## 3. **Label Creation Phase** (`create_synthetic_labels()`)
Since there's no ground truth about which NFTs are legitimate vs suspicious:
- **Creates synthetic labels using a scoring system**:
  - +3 points: Verified collection
  - +1 point each: Has Discord, has Twitter
  - +2 points: Has floor price > 0
  - +2 points: High trading volume (>1000)
  - +1 point: Many owners (>1000)
  - +1 point: Reddit mentions exist
  - +3 points: Collection is in known legitimate list
- **Labels collections as legitimate (1) if score ≥ 6, suspicious (0) otherwise**

## 4. **Feature Engineering Phase** (`engineer_features()`)
Transforms raw data into more meaningful features:
- **Volume per owner**: `total_volume / num_owners` (liquidity indicator)
- **Market cap to volume ratio**: `market_cap / total_volume` (market efficiency)
- **Price premium**: `average_price / floor_price` (pricing structure)
- **Social engagement score**: `reddit_mentions + reddit_engagement`
- **Liquidity indicator**: `√(total_volume × num_owners)` (market activity)
- Handles division by zero and infinite values safely

## 5. **Data Augmentation Phase** (`generate_synthetic_data()`)
If the dataset is too small (<10 samples):
- **Generates 70% legitimate collections** with characteristics like:
  - High verification rate, strong social media presence
  - Higher floor prices, market caps, and trading volumes
  - More owners and positive Reddit sentiment
- **Generates 30% suspicious collections** with characteristics like:
  - Low verification rate, weak social media presence
  - Lower prices, volumes, and fewer owners
  - Less Reddit engagement and more negative sentiment

## 6. **Model Training Phase** (`train_models()`)
Trains four different machine learning algorithms:

### Data Splitting:
- Splits data 80/20 for training/testing (if enough data exists)
- Uses stratified sampling to maintain class balance
- Creates scaled versions of features for algorithms that need normalization

### Models Trained:
1. **Random Forest**: Ensemble of decision trees, handles mixed data types well
2. **Gradient Boosting**: Sequential learning, builds strong classifier from weak ones
3. **Logistic Regression**: Linear model, uses scaled features
4. **Support Vector Machine (SVM)**: Finds optimal decision boundary, uses scaled features

### For Each Model:
- Fits the model on training data
- Makes predictions on test data
- Calculates performance metrics (accuracy, precision, recall, F1-score, ROC-AUC)
- Extracts feature importance (which features matter most)
- Stores all results in `ModelResults` objects

## 7. **Model Evaluation Phase** (`evaluate_models()`)
- Compares all models side-by-side in a performance table
- Ranks models by F1-score (balance of precision and recall)
- Identifies the best performing model
- Prints detailed comparison showing strengths/weaknesses

## 8. **Model Persistence Phase** (`save_models()` & `save_best_model()`)
- **Saves all trained models** as `.pkl` files using joblib
- **Saves the feature scaler** (needed for future predictions)
- **Saves metadata** including:
  - Training date and time
  - Feature names and their order
  - Model performance metrics
  - Preprocessing steps used

## 9. **Prediction Capability** (`predict_collection()`)
For new NFT collections:
- Takes raw collection features as input
- Applies the same feature engineering transformations
- Uses the saved scaler if needed (for Logistic Regression/SVM)
- Returns prediction with confidence scores:
  - `is_legitimate`: Binary prediction (True/False)
  - `legitimacy_probability`: Confidence in legitimacy (0-1)
  - `risk_score`: Inverse of legitimacy (1 - legitimacy_probability)
  - `confidence`: Model's certainty in its prediction

## 10. **Main Execution Flow** (`main()`)
When run as a script:
1. Creates `NFTAuthenticityModel` instance
2. Loads and prepares training data
3. Generates synthetic data if dataset is too small
4. Trains all four models
5. Evaluates and compares performance
6. Saves the best model for future use
7. Provides success/error feedback

## Key Design Decisions:

**Why Multiple Models?** Different algorithms have different strengths - ensemble methods like Random Forest handle complex interactions, while linear models like Logistic Regression are interpretable.

**Why Synthetic Labels?** NFT authenticity is subjective and ground truth is rare. The heuristic approach uses observable market signals that correlate with legitimacy.

**Why Feature Engineering?** Raw features like "total_volume" become more meaningful when combined (e.g., "volume_per_owner" indicates liquidity quality).

**Why Data Augmentation?** Machine learning needs sufficient data to learn patterns. Synthetic data generation ensures minimum viable dataset size while maintaining realistic feature distributions.

The model essentially learns to recognize patterns that distinguish established, actively-traded NFT collections from potentially suspicious or low-quality ones based on market activity, social presence, and verification status.

---

## Technical Stack:
- **Python Libraries**: pandas, numpy, scikit-learn, joblib
- **ML Algorithms**: Random Forest, Gradient Boosting, Logistic Regression, SVM
- **Data Sources**: OpenSea API, Reddit API
- **Output Format**: Pickled models (.pkl) with JSON metadata