# üéÆ Introduction to Graph ML: Predict Stream Languages on Twitch

## Problem Statement

You're working as a **Data Scientist at Twitch** üßô

Every day, new users join the platform and start streaming. Your manager wants to identify the **language** of these new streams. Converting audio to text and running language detection is expensive.

**Alternative approach:** Use the graph structure!

### Key Hypothesis
- Users mostly chat in a **single language**
- If a user chats in two streams ‚Üí likely both streams use the **same language**
- Exception: English (many people understand basic English)

### Graph Representation
- **Monopartite graph**: `(:Stream)-[:SHARED_AUDIENCE]->(:Stream)`
- **Undirected**: Bidirectional audience sharing
- **Weighted**: Count of shared audience members

---

## üì¶ Setup and Imports

In [None]:
# Import custom modules
from graph import GraphConnector
from data_loader import TwitchDataLoader
from ml import GraphMLPipeline
from viz import *

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = [16, 9]

print("‚úì Imports successful")

## üîå Step 1: Connect to Neo4j

First, establish connection to the Neo4j database.

In [None]:
# Initialize connection
graph = GraphConnector(
    uri="bolt://neo4j:7687",
    user="neo4j",
    password="graphml2024"
)

# Test connection
if graph.test_connection():
    print("‚úì Successfully connected to Neo4j")
    print(f"  URI: {graph.uri}")
else:
    print("‚úó Failed to connect to Neo4j")

### Check available databases

In [None]:
databases = graph.run_query("SHOW DATABASES")
databases

## üìä Step 2: Load Twitch Data

Load stream data and shared audience relationships into Neo4j.

In [None]:
# Initialize data loader
loader = TwitchDataLoader(graph)

# Load all data
stats = loader.load_all()

print("\nüìä Data Loading Summary:")
print(f"  Streams: {stats['streams']}")
print(f"  Relationships: {stats['relationships']}")
print(f"\nüåç Language Distribution:")
for lang_stat in stats['languages'][:10]:
    print(f"  {lang_stat['language']:6s}: {lang_stat['count']:4d} streams")

### Visualize language distribution

In [None]:
df_languages = pd.DataFrame(stats['languages'])
plot_language_distribution(df_languages)
plt.show()

### Explore the graph structure

In [None]:
# Check degree distribution
plot_degree_distribution(graph)
plt.show()

## üéØ Step 3: Create Graph Projection

Create a GDS graph projection for running algorithms.

**Question:** Should we use DIRECTED or UNDIRECTED projection?

**Answer:** UNDIRECTED because:
- Shared audience is bidirectional
- If user chats in both streams A and B, the connection goes both ways
- Language similarity is symmetric

In [None]:
# Initialize ML pipeline
ml_pipeline = GraphMLPipeline(graph)

# Create graph projection
projection_stats = ml_pipeline.create_graph_projection(
    graph_name="twitch",
    orientation="UNDIRECTED"
)

projection_stats

## üß¨ Step 4: Generate Node Embeddings with Node2Vec

Use Node2Vec algorithm to create vector representations of streams.

**Node2Vec parameters:**
- `embeddingDimension=8`: Size of embedding vectors
- `walkLength=80`: Length of random walks
- `inOutFactor=0.5`: BFS vs DFS bias
- `returnFactor=1.0`: Likelihood of revisiting nodes

In [None]:
# Run Node2Vec
n2v_stats = ml_pipeline.run_node2vec(
    graph_name="twitch",
    embedding_dimension=8,
    walk_length=80,
    iterations=10
)

n2v_stats

## üìê Step 5: Analyze Embedding Distances

Compare Euclidean distance and Cosine similarity between connected nodes.

In [None]:
# Get distance metrics
df_distances = ml_pipeline.analyze_embedding_distances()

print(f"Analyzed {len(df_distances)} node pairs")
print(f"\nDistance Statistics:")
print(df_distances[['euclidean', 'cosine', 'weight']].describe())

### Plot distance distributions

In [None]:
ml_pipeline.plot_distance_distributions(df_distances)
plt.show()

**Observations:**
- **Euclidean distance**: Shows wider distribution, more sensitive to magnitude
- **Cosine similarity**: Clusters near 1.0, better for capturing direction/angle
- Cosine is preferred for high-dimensional embeddings (angle matters more than distance)

## üìä Step 6: Analyze Degree by Similarity

Check if cosine similarity correlates with node degree.

In [None]:
# Get degree statistics by similarity
df_degree = ml_pipeline.analyze_degree_by_similarity()

df_degree.head(10)

In [None]:
# Plot degree by similarity
ml_pipeline.plot_degree_by_similarity(df_degree)
plt.show()

In [None]:
# Plot weight by similarity
ml_pipeline.plot_weight_by_similarity(df_degree)
plt.show()

**Insights:**
- Higher cosine similarity ‚Üí Often higher average degree
- Streams with similar embeddings tend to be more connected
- Weight correlates with similarity (stronger shared audience = more similar)

## ü§ñ Step 7: Prepare Training Data

Extract embeddings and labels for machine learning.

In [None]:
# Prepare data
df_training = ml_pipeline.prepare_training_data()

print(f"Training samples: {len(df_training)}")
print(f"Unique languages: {len(ml_pipeline.label_mapping)}")
print(f"\nLanguage encoding:")
for i, lang in enumerate(ml_pipeline.label_mapping[:10]):
    print(f"  {i}: {lang}")

df_training.head()

## üå≤ Step 8: Train Random Forest Classifier

Use embeddings as features to predict stream language.

In [None]:
# Train classifier
results = ml_pipeline.train_classifier(
    df_training,
    test_size=0.2,
    random_state=42,
    n_estimators=100,
    max_depth=10,
    min_samples_split=5
)

# Print report
ml_pipeline.print_classification_report(results['classification_report'])

## üìä Step 9: Visualize Results

### Confusion Matrix

In [None]:
ml_pipeline.plot_confusion_matrix(
    results['confusion_matrix'],
    results['label_mapping'].astype(str)
)
plt.show()

### Comprehensive Dashboard

In [None]:
create_analysis_dashboard(results, graph)
plt.show()

### Feature Importance

In [None]:
plot_feature_importance(results['model'], top_n=8)
plt.show()

### Embedding Space Visualization

In [None]:
# Sample data for visualization (too many points slow down plotting)
df_sample = df_training.sample(n=min(1000, len(df_training)), random_state=42)

plot_embedding_space_2d(df_sample, method='pca')
plt.show()

## üí≠ Discussion Questions

### 1. What do you think about the confusion matrix?

**Analysis:**
- Diagonal values should be high (correct predictions)
- Off-diagonal shows misclassifications
- Look for patterns: Which languages are confused with each other?
- English might be confused with others (hypothesis about English understanding)

### 2. What is the appropriate metric to show your manager?

**Recommendation: F1-Score (Weighted Average)**

**Why?**
- **Accuracy** can be misleading with imbalanced classes
- **Precision**: How many predicted languages are correct?
- **Recall**: How many actual languages did we find?
- **F1-Score**: Harmonic mean of precision and recall
- **Weighted F1**: Accounts for class imbalance (important since English dominates)

**Business Context:**
- False positives (wrong language): Viewers get recommendations in wrong language ‚Üí bad UX
- False negatives (missed language): Stream not categorized properly ‚Üí lost discoverability
- Both matter ‚Üí F1-Score balances both concerns

### 3. How can you improve the classifier quality?

**Strategies:**

**a) Better Embeddings:**
- Increase `embeddingDimension` (try 16, 32, 64)
- Tune Node2Vec hyperparameters (`walkLength`, `inOutFactor`, `returnFactor`)
- Try other embedding algorithms (GraphSAGE, GCN)

**b) More Features:**
- Add centrality metrics (PageRank, betweenness)
- Include node properties (views, account age)
- Community detection features

**c) Better Model:**
- Hyperparameter tuning (GridSearch, RandomSearch)
- Try other models (XGBoost, Neural Networks)
- Ensemble methods

**d) Handle Imbalance:**
- SMOTE for minority classes
- Class weights in model
- Stratified sampling

**e) Data Quality:**
- Remove noisy data (dead accounts, very low degree nodes)
- Add temporal features (streaming time patterns)
- Incorporate chat message patterns

---

## üéØ Next Steps and Extensions

### Try These Experiments:

1. **Change embedding dimension** and compare results
2. **Add more graph features** (PageRank, Louvain communities)
3. **Try different classifiers** (XGBoost, Neural Network)
4. **Implement cross-validation** for robust evaluation
5. **Analyze misclassifications** in detail
6. **Build a confusion matrix** for specific language pairs

### Production Considerations:

- **Real-time inference**: How to handle new streams?
- **Model updates**: Retrain periodically as graph evolves
- **Monitoring**: Track prediction confidence, drift
- **A/B testing**: Compare with baseline (audio transcription)
- **Cost analysis**: Graph ML vs. traditional NLP

---

## üßπ Cleanup

In [None]:
# Close connection
graph.close()
print("‚úì Connection closed")