# Scalable AI Recommendation System for Large Datasets
Portfolio-ready notebook for movies, online courses, and products. Follows IU portfolio structure (Konzeptions-, Erarbeitungs-/Reflexions-, Finalisierungsphase) and includes Streamlit demo guidance.

## Notebook Navigation
- Work top to bottom; finish and submit Phase 1 before starting Phase 2; finish and submit Phase 2 before Phase 3.
- Keep page limits: Phase 1 concept <= 0.5 page; Phase 2 implementation note <= 0.5 page; Phase 3 abstract = 2 pages.
- Upload each Portfolioteil in PebblePad; during Phase 3 re-upload the updated Phase 1 and Phase 2 parts; confirm in Atlas.
- Feedback after Phase 1 and Phase 2 may take up to 7 days; final feedback/grade after Phase 3 within 6 weeks.
- Submit the Eidesstattliche Erklaerung electronically via myCampus; follow IU citation and literature rules.

## Submission Rules (condensed from IU portfolio guides)
- Three phases in the given order. Skipping order invalidates the attempt.
- Portfolioteil 1 (Konzept): short text (<=0.5 page) plus early sketches or wireframes.
- Portfolioteil 2 (Erarbeitung/Reflexion): short text (<=0.5 page) plus digital draft (prototype screenshot or partial app).
- Portfolioteil 3 (Finalisierung): 2-page abstract PDF + final product + re-uploads of Portfolioteil 1 and 2 (updated) + optional zip with resources.
- The portfolio documents the full learning path: goals, method, execution, reflection, and final artefact.
- Submission channel: PebblePad; verification and final feedback in Atlas.

## Suggested 4-week Timeline
| Day | Milestone |
| --- | --- |
| 1 | Kick-off, define goals, secure data access, outline evaluation. |
| 7 | Submit Phase 1 concept + sketches to PebblePad. |
| 14 | Receive/implement feedback; continue builds. |
| 21 | Submit Phase 2 implementation note + digital draft. |
| 28 | Finalize models/UI; upload Phase 3 with abstract + final product + updated Phase 1 and 2. |

## Phase 1 – Konzeptionsphase (Portfolioteil 1)
### Problem and goal (answer concisely)
- What user problem are we solving? (movie/course/product discovery, cold start, relevance)
- What is the measurable goal? (e.g., +X% NDCG@10, sub-300 ms latency)
- Which users and item domains are in scope?

### Data sources
- Pick one or more large interaction datasets (MovieLens 25M, Amazon Reviews, course click logs).
- Note size, sparsity, license, and sampling choices.
- Plan item metadata for content features: genres, tags, embeddings from text or images.

### Method and tools
- Collaborative filtering: implicit ALS or BPR; sparse matrix factorization; fast item-item cosine.
- Content-based: TF-IDF or embedding similarity; ANN index (faiss/hnswlib/annoy).
- Hybrid: weighted blend or model stacking; backfill with popularity for cold start.
- Dimensionality reduction: SVD/SVD++ to compress signals.
- Stack suggestion: Python, Polars/Pandas, Scipy sparse, Faiss/HNSWLib, Streamlit, Docker (optional).

### Success metrics and SLOs
- Offline: NDCG@k, HR@k, MAP, coverage, diversity, catalog freshness.
- Online/UX: latency budget (<300 ms per query), memory budget, update cadence.

### Risks and mitigations
- Cold start -> content features and popularity backfill.
- Bias toward head items -> re-ranking with diversity or explore-exploit.
- Compute limits -> batch training, incremental updates, vector quantization for ANN.

### Deliverable (submit before Phase 2)
- Concept text (<=0.5 page) covering goal, idea, chosen methods, tools, data.
- Initial sketch: system diagram or Streamlit wireframe.
- Upload as Portfolioteil 1 in PebblePad and wait for feedback before proceeding.

### Phase 1 draft slots
- Concept text (keep <=0.5 page):
  - ...
- Sketch or architecture reference (file path or link):
  - ...
- Notes to incorporate after feedback:
  - ...

## Phase 2 – Erarbeitungs-/Reflexionsphase (Portfolioteil 2)
### Implementation plan
- Data pipeline: ingest -> clean -> encode -> split -> persist sparse matrices and feature stores.
- Models:
  - Collaborative filtering (implicit ALS/BPR) on sparse interactions.
  - Content-based similarity on item text/metadata embeddings.
  - Hybrid scorer = w1*CF + w2*Content (+ optional popularity prior); tune weights on validation.
- Retrieval speed: build ANN index for item vectors; batch precompute user vectors; cache top-N per segment.
- Streamlit UI: selectors (user/item), recommended list with scores, "Recommended because you liked X" explainer, filters (domain, recency).

### Experiment tracker (add rows)
| Date | Model/params | Metric@k | Latency | Notes / next step |
| ---- | ------------ | -------- | ------- | ----------------- |
|      |              |          |         |                   |

### Reflection log (summarize to <=0.5 page for submission)
- Status vs plan; scope changes.
- Risks encountered and mitigations.
- Resource use (time/hardware); blockers.
- Evidence of progress: link to prototype screenshot or partial Streamlit run.

### Deliverable (submit before Phase 3)
- Implementation note (<=0.5 page) + digital draft/prototype screenshot.
- Upload as Portfolioteil 2 in PebblePad; incorporate feedback (allow up to 7 days).

In [None]:
# Skeleton: lightweight item-item recommender using cosine similarity
# Un-comment imports when libraries are installed and data paths are set.

# import pandas as pd
# from scipy import sparse
# from sklearn.metrics.pairwise import cosine_similarity

# interactions = pd.read_csv("data/interactions.csv")  # user_id, item_id, rating or implicit score
# user_index = {u: i for i, u in enumerate(interactions.user_id.unique())}
# item_index = {i: j for j, i in enumerate(interactions.item_id.unique())}

# rows = interactions.user_id.map(user_index)
# cols = interactions.item_id.map(item_index)
# data = interactions["event_strength"] if "event_strength" in interactions else 1
# matrix = sparse.coo_matrix((data, (rows, cols))).tocsr()

# item_sims = cosine_similarity(matrix.T, dense_output=False)

# def recommend_for_user(user_id, top_n=10):
#     uid = user_index[user_id]
#     scores = matrix[uid].dot(item_sims).toarray().ravel()
#     top_items = scores.argsort()[::-1][:top_n]
#     inv_item_index = {v: k for k, v in item_index.items()}
#     return [(inv_item_index[i], float(scores[i])) for i in top_items]

# Example:
# recs = recommend_for_user(list(user_index.keys())[0], top_n=5)
# print(recs)


In [None]:
# Streamlit UI sketch (save as app.py when ready)
# import streamlit as st
# users = []  # populate from user_index keys
# items = []  # populate from item_index keys
# st.title("AI Recommender Demo")
# mode = st.radio("Mode", ["User", "Item seed"])
# if mode == "User":
#     user = st.selectbox("Pick a user", users)
#     if st.button("Recommend"):
#         recs = recommend_for_user(user)
#         for rank, (item, score) in enumerate(recs, start=1):
#             st.write(f"{rank}. {item} (score={score:.3f})")
#             st.caption("Recommended because your profile is similar to users who liked these.")
# else:
#     seed = st.selectbox("Pick an item you liked", items)
#     # TODO: add item-item recommendations using item_sims[seed]
#     st.caption("Explain: similar content or co-consumed with the selected item.")


## Phase 3 – Finalisierungsphase (Portfolioteil 3)
### Abstract template (2 pages, PDF for submission)
- Context and goal (Problemabgrenzung/Zielsetzung).
- Methodology (CF, content, hybrid, SVD, ANN; data choice; why).
- Implementation highlights (pipeline, scaling decisions, Streamlit UI, monitoring).
- Results (offline metrics, latency, qualitative examples).
- Reflection (what worked, gaps, next steps).
- Resources (data sources, software, versions).

### Final product checklist
- Final trained artefacts or reproducible pipeline (notebook or scripts).
- Streamlit app ready to run (`streamlit run app.py` or container notes).
- Updated Portfolioteil 1 and Portfolioteil 2 files.
- Optional zip with additional resources.

### Submission steps
- Upload abstract PDF, final product, and updated P1/P2 to PebblePad.
- Verify submission in Atlas; note final feedback window (<=6 weeks).
- Submit Eidesstattliche Erklaerung in myCampus.

### Reflection prompts
- Did outcomes meet goals and metrics? If not, why?
- What would you change with more time or data?
- How does the solution generalize across movies, courses, products?

## Rubric alignment (map work to grading weights)
| Kriterium | Gewicht | Where addressed |
| --------- | ------- | ---------------- |
| Problemabgrenzung/Zielsetzung | 10% | Phase 1 goal statement; Abstract intro. |
| Methodik/Idee/Vorgehen | 20% | Phase 1 method plan; Phase 2 experiment design; Abstract method. |
| Qualitaet der Umsetzung | 40% | Phase 2 pipeline/prototype; Phase 3 final app + metrics. |
| Kreativitaet/Richtigkeit | 20% | Hybrid strategy, explainability, UI choices, metric gains. |
| Formalia | 10% | Page limits, citation style, PebblePad/Atlas process, Eidesstattliche Erklaerung. |


## Big-data and scalability notes
- Store interactions as sparse CSR/CSC; keep user/item maps as int32; use chunked ETL with Polars or DuckDB when data exceeds RAM.
- Batch or incremental model updates; schedule nightly retrains; warm-start ANN index with rebuilt vectors.
- ANN choices: hnswlib for fast recall; faiss for GPU or IVF-PQ; annoy for lightweight CPU; precompute item-item top-K lists.
- Cache popular items per segment for cold start; use content encoder to embed new items immediately.
- Measure latency end-to-end (vector lookup + ANN + re-ranking) and log to a monitoring dashboard.


## Attachments / artefact paths (fill in)
- Data location: `...`
- Model artefacts: `...`
- Streamlit entrypoint: `app.py`
- Abstract PDF: `abstract.pdf`
- Phase 1 concept file: `phase1_concept.pdf`
- Phase 2 draft file: `phase2_draft.pdf`
- Optional resources zip: `resources.zip`
