# Applied Analytics Portfolio

**Predicting and Explaining Healthcare App Quality**

Group: `___`

Names & Student IDs: `___`

---

## 1. Introduction

Briefly describe the **decision context**:

- Mental health unit that wants to recommend high-quality healthcare apps.
- Patients have a range of mental illnesses and somatic comorbidities.

Explain **why prediction helps** and what the **overall goal** of this portfolio is:

- Use app metadata and user reviews to estimate whether an app is likely to be highly rated.
- Identify key factors that drive user-perceived app quality.

Conclude with a short **structure overview** of the notebook/report (what is done in Sections 2–5).

## 2. Data Understanding and Preparation

### 2.1 Research Goal and Operationalization
- Formulate a **precise prediction question**.
- Define what **"high-quality" / "highly rated"** means (e.g., rating threshold + minimum number of ratings).
- Specify which apps you will include (e.g., which categories, filters).

### 2.2 Data Overview
- Number of apps and reviews after filtering.
- Brief description of key variables (metadata and text).

### 2.3 Cleaning and Filtering
- Handle outliers (e.g., extreme prices, extremely low number of ratings).
- Check missing values in important variables and decide on imputation vs. dropping.
- Document inclusion criteria and any comparator groups (e.g., medical vs. non-medical apps).

In [None]:
# 2. Data Understanding and Preparation
# TODO: Load your datasets here and perform basic checks
import pandas as pd

# Example:
# apps = pd.read_csv('apps.csv')
# reviews = pd.read_csv('reviews.csv')
# apps.head()


## 3. Data Exploration

- Explore distributions of ratings, number of ratings, prices, categories, etc.
- Visualize relevant relationships (e.g., rating vs. price, rating vs. category).
- Use basic text mining on reviews: word frequencies, simple sentiment or topic structure.
- Create and justify **new features** that may help prediction (e.g., sentiment score, review length, price bins).
- Comment on what these patterns suggest about app quality.

In [None]:
# 3. Data Exploration
# TODO: EDA plots and feature creation
import matplotlib.pyplot as plt

# Example placeholder:
# apps['log_ratings'] = np.log1p(apps['ratingCount'])
# apps['averageRating'].hist()
# plt.show()


## 4. Modeling Approach

### 4.1 Review Sentiment with Zero-/Few-Shot Learning (SetFit or alternative)

1. Define sentiment classes (e.g., positive / neutral / negative).
2. Manually label a small, balanced subset of reviews.
3. Fine-tune a SetFit model or a LLM model and evaluate performance.
5. Aggregate predicted sentiment to the **app level** (e.g., share of positive reviews).

These aggregated sentiment metrics will be used as features in Section 4.2.

In [3]:
# Installieren der benötigten Bibliotheken für das SetFit Modell
!pip install setfit datasets tqdm pandas scikit-learn



In [4]:
import pandas as pd
import numpy as np
from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset
from tqdm.notebook import tqdm
import os

# --- KONFIGURATION: PFADE ANPASSEN ---
# Bitte hier euren Pfad zu den CSV-Dateien eintragen
# (Hier sind Cems Pfade als Platzhalter)
PATH_APPS = r'C:/Users/cbogo/Downloads/apple_apps_medizin.csv'
PATH_REVIEWS = r'C:/Users/cbogo/Downloads/app_reviews_medizin.csv'

# ==========================================
# SCHRITT 1: DATEN LADEN & VORBEREITEN
# ==========================================
# BITTE HIER DIE NEUEN DATEINAMEN EINTRAGEN
PATH_APPS_NEW = r'C:/Users/cbogo/Downloads/apple_apps_medizin_1.csv' # <-- Anpassen!
PATH_REVIEWS_NEW = r'C:/Users/cbogo/Downloads/app_reviews_medizin_1.csv' # <-- Anpassen!

print("1. Lade das NEUE Dataset...")
try:
    apps_df = pd.read_csv(PATH_APPS_NEW, low_memory=False)
    reviews_df = pd.read_csv(PATH_REVIEWS_NEW, low_memory=False)
    
    # Check: Haben wir die neuen Spalten wirklich?
    if 'price_eur' in apps_df.columns and 'privacy_not_collected_bool' in apps_df.columns:
        print("--> Super! Neue Felder 'price_eur' und 'privacy' gefunden.")
    else:
        print("WARNUNG: Neue Felder nicht gefunden. Hast du die richtige Datei geladen?")

    # Wir arbeiten weiter mit einer Kopie
    filtered_apps = apps_df.copy()

    # --- UMWANDLUNG DER NEUEN FELDER ---
    
    # 1. Preis: Wir nehmen direkt 'price_eur' (keine eigene Reinigung mehr nötig!)
    # Falls mal ein Wert fehlt, nehmen wir 0.0 an
    filtered_apps['price_numeric'] = pd.to_numeric(filtered_apps['price_eur'], errors='coerce').fillna(0.0)

    # 2. Review Count sicherstellen (Text zu Zahl)
    filtered_apps['review_count'] = pd.to_numeric(filtered_apps['review_count'], errors='coerce').fillna(0).astype(int)

    # 3. Privacy für später vorbereiten (Bool zu Zahl: True=1, False=0)
    # Das brauchen wir gleich für Sektion 4.2
    filtered_apps['privacy_flag'] = filtered_apps['privacy_not_collected_bool'].astype(int)

    print(f"--> Apps geladen: {len(filtered_apps)}")
    print(f"--> Reviews geladen: {len(reviews_df)}")
    print("--> Datenvorbereitung abgeschlossen (mit neuen Feldern).")

except FileNotFoundError:
    print("FEHLER: Die neuen Dateien wurden nicht gefunden. Bitte Pfad prüfen!")

# ==========================================
# SCHRITT 2: MODELL TRAINING (FEW-SHOT)
# ==========================================
print("\n2. Trainiere Sentiment-Modell (Few-Shot Learning)...")

# Wir nutzen ein manuell erstelltes, perfekt balanciertes Set (8 Positiv / 8 Negativ)
# Dies verhindert Bias, da zufällige Reviews oft zu 90% positiv sind.
training_data = {
    "text": [
        # POSITIV (Label 1)
        "Die App hilft mir sehr gut, meine Medikamente zu managen.",
        "Endlich eine Übersicht, die einfach zu bedienen ist. Super!",
        "Tolle Funktionen und sehr übersichtlich gestaltet.",
        "Der Symptom-Tracker ist genau das, was ich gesucht habe.",
        "Sehr hilfreich im Alltag mit meiner Krankheit.",
        "Schnelle Ladezeiten und stürzt nie ab. Perfekt.",
        "Ich fühle mich durch die App viel sicherer.",
        "Klasse Support und regelmäßige Updates.",
        # NEGATIV (Label 0)
        "Die App stürzt ständig ab, völlig unbrauchbar.",
        "Viel zu teuer für das, was geboten wird.",
        "Datenschutz ist hier eine Katastrophe. Nie wieder.",
        "Funktioniert nach dem Update überhaupt nicht mehr.",
        "Werbung nervt total und macht die Nutzung unmöglich.",
        "Unübersichtlich und kompliziert. Ich lösche sie wieder.",
        "Die Verbindung zum Server bricht andauernd ab.",
        "Leider keine Hilfe, reine Zeitverschwendung."
    ],
    "label": [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
}

# Dataset erstellen
train_dataset = Dataset.from_dict(training_data)

# Modell laden (Multilingual für Deutsch/Englisch)
model_id = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id, device="cpu") # CPU erzwingen für Kompatibilität

# Trainer initialisieren
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    metric="accuracy"
)

# Training starten (Dauert ca. 1-2 Minuten)
trainer.train(batch_size=4, epochs=1)
print("--> Training abgeschlossen.")

# ==========================================
# SCHRITT 3: INFERENZ (ANWENDUNG AUF ECHTE DATEN)
# ==========================================
print("\n3. Starte Analyse der echten Reviews...")

# Nur Reviews der relevanten Apps nehmen
valid_ids = filtered_apps['app_id'].unique()
reviews_analysis = reviews_df[reviews_df['app_id'].isin(valid_ids)].copy()
reviews_analysis = reviews_analysis.dropna(subset=['review'])

# PERFORMANCE OPTIMIERUNG:
# Wir nehmen nur die neuesten 10 Reviews pro App.
# Grund: Analyse von allen 70k Reviews würde auf CPU >10 Stunden dauern.
# 10 Reviews reichen für einen statistischen Trend.
reviews_analysis = reviews_analysis.groupby('app_id').head(10)

print(f"--> Analysiere {len(reviews_analysis)} Reviews (Top 10 pro App)...")

# Vorhersage in Batches (Häppchenweise) mit Ladebalken
texts = reviews_analysis["review"].tolist()
results = []
batch_size = 8

for i in tqdm(range(0, len(texts), batch_size), desc="KI-Analyse läuft"):
    batch = texts[i : i + batch_size]
    preds = model.predict(batch)
    results.extend(preds.tolist())

reviews_analysis["ai_sentiment"] = results

# ==========================================
# SCHRITT 4: AGGREGATION & FEATURE ENGINEERING
# ==========================================
print("\n4. Berechne Scores pro App...")

# Durchschnitt berechnen (1=Positiv, 0=Negativ)
app_stats = reviews_analysis.groupby('app_id')['ai_sentiment'].agg(['mean', 'count']).reset_index()
app_stats.columns = ['app_id', 'sentiment_score', 'sentiment_count_ai']

# Merge mit der App-Tabelle
# (Erst alte Spalten löschen, falls vorhanden, um Fehler bei Mehrfachausführung zu vermeiden)
cols_to_drop = ['sentiment_score', 'sentiment_count_ai', 'share_positive', 'share_negative']
filtered_apps = filtered_apps.drop(columns=[c for c in cols_to_drop if c in filtered_apps.columns])

filtered_apps = filtered_apps.merge(app_stats, on='app_id', how='left')

# Fehlende Werte auffüllen (Apps ohne Text-Reviews bekommen Score 0.5 = Neutral)
filtered_apps['sentiment_score'] = filtered_apps['sentiment_score'].fillna(0.5)

# Die geforderten Features erstellen
filtered_apps['share_positive'] = filtered_apps['sentiment_score']       # Identisch mit Score
filtered_apps['share_negative'] = 1.0 - filtered_apps['share_positive']  # Der Rest ist negativ

print("-" * 40)
print("FERTIG! Sektion 4.1 erledigt.")
print("-" * 40)
print(filtered_apps[['app_name', 'review_count', 'share_positive', 'share_negative']].head(10))

1. Lade das NEUE Dataset...
--> Super! Neue Felder 'price_eur' und 'privacy' gefunden.
--> Apps geladen: 46324
--> Reviews geladen: 443942
--> Datenvorbereitung abgeschlossen (mit neuen Feldern).

2. Trainiere Sentiment-Modell (Few-Shot Learning)...


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/16 [00:00<?, ? examples/s]

  trainer.train(batch_size=4, epochs=1)
***** Running training *****
  Num unique pairs = 640
  Batch size = 16
  Num epochs = 1


Step,Training Loss
1,0.2885


--> Training abgeschlossen.

3. Starte Analyse der echten Reviews...
--> Analysiere 30524 Reviews (Top 10 pro App)...


KI-Analyse läuft:   0%|          | 0/3816 [00:00<?, ?it/s]


4. Berechne Scores pro App...
----------------------------------------
FERTIG! Sektion 4.1 erledigt.
----------------------------------------
                                            app_name  review_count  \
0         NCLEX-RN tests - practice exam preparation             0   
1                                    Studio360 Cycle             0   
2                                      Smoke Finance             0   
3                               Women's Golf Network             0   
4  White Noise - Natural Calm Sounds for Sleep Cycle             1   
5   Army Fitness Workout Exercises & APFT Calculator             0   
6                                     Champs Fitness             0   
7                          Betsy's Health Foods Inc.             0   
8                                   VT1 Martial Arts             0   
9                                  REDCap Mobile App             0   

   share_positive  share_negative  
0             0.5             0.5  
1             

In [5]:
# --- QUALITÄTS-CHECK (Sanity Check) ---

# 1. Fall "Geister-Reviews" untersuchen
# Wir suchen Apps, die laut Store 0 Reviews haben, aber laut KI einen Score haben
ghost_apps = filtered_apps[
    (filtered_apps['review_count'] == 0) & 
    (filtered_apps['share_positive'] != 0.5) # Ungleich 0.5 heißt: Die KI hat was gefunden!
]

print(f"Anzahl Apps mit '0 Reviews' aber echtem Sentiment: {len(ghost_apps)}")

if len(ghost_apps) > 0:
    sample_id = ghost_apps.iloc[0]['app_id']
    print(f"\nBeweis-Stück für App ID {sample_id} ({ghost_apps.iloc[0]['app_name']}):")
    # Wir holen den Text aus der Review-Tabelle
    texts = reviews_analysis[reviews_analysis['app_id'] == sample_id]['review'].tolist()
    print(f"--> Gefundener Text: '{texts[0]}'")
    print("(Daher kommt der Score! Die '0' im Review-Count war falsch.)")

# 2. Wie viele sind "echt" vs. "aufgefüllt" (0.5)?
total = len(filtered_apps)
filled = len(filtered_apps[filtered_apps['share_positive'] == 0.5])
real = total - filled

print("\n" + "-"*30)
print("DATEN-QUALITÄT ÜBERSICHT")
print("-"*30)
print(f"Gesamt Apps:              {total}")
print(f"Davon mit echten Texten:  {real}   ({(real/total)*100:.1f}%)")
print(f"Davon aufgefüllt (0.5):   {filled} ({(filled/total)*100:.1f}%)")

Anzahl Apps mit '0 Reviews' aber echtem Sentiment: 180

Beweis-Stück für App ID 1265836047 (Breathing Box):
--> Gefundener Text: 'Voice is very cold and App turns itself off when phone locks'
(Daher kommt der Score! Die '0' im Review-Count war falsch.)

------------------------------
DATEN-QUALITÄT ÜBERSICHT
------------------------------
Gesamt Apps:              46324
Davon mit echten Texten:  5924   (12.8%)
Davon aufgefüllt (0.5):   40400 (87.2%)


### 4.2 Predictive Modeling of App Quality

1. **Define the target** variable at app level (e.g., high_quality = 1 if avg rating ≥ threshold and sufficient rating count).
2. **Model A – Simple & interpretable:** Logistic Regression or a small Decision Tree.
3. **Model B – More powerful:** e.g., Random Forest or Gradient Boosting with basic hyperparameter tuning.
4. Compare performance (accuracy, precision, recall, F1, ROC-AUC, etc.) and comment on the trade-off between interpretability and performance.

In [None]:
# 4.2 Predictive Modeling of App Quality
# TODO: Build train/test split, fit Model A and Model B, and evaluate
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Example placeholder:
# X = apps_model_features
# y = apps['high_quality']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ...


## 5. Interpretation and Argumentation of Results

1. **Model Interpretation / Explainable AI**
- Inspect and visualize feature importance (e.g., SHAP values or model-specific importances).
- Discuss which features most strongly influence predicted app quality.

2. **Fairness & Bias Reflection**
- Where could sampling bias, measurement error, or missing data affect your results?
- Briefly relate your reflections to fairness notions mentioned in the course.

3. **LLM / SetFit as Method**
- Discuss where these methods might introduce bias or instability.
- Mention how sensitive your results are to label definitions or prompts (short reflection).

4. **Practical Insights for the Clinic**
- List 2–4 concrete, comprehensible recommendations that the mental health unit could use.
- Focus on what your results *suggest they should pay attention to* when recommending apps.

## 6. AI Tools and References

- Briefly describe where AI tools (e.g., ChatGPT, Copilot) were used, in line with FU guidelines.
- List key papers, blog posts, or documentation that you relied on for methods.
