### ***Collaborative filtering***

Collaborative filtering is a method used by recommender systems to make automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). 

The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.

- User-user collaborative filtering - Use user similarity matrix to recommend items liked by similar users
- Hybrid methods - Combine collaborative filtering and content-based filtering for better recommendations

***Types of Collaborative Filtering***
- User-user collaborative filtering - Use user similarity matrix to recommend items liked by similar users
- Item-item collaborative filtering - Use item similarity matrix to recommend similar items
- Hybrid methods - Combine collaborative filtering and content-based filtering for better recommendations

***User-Based CF***

- Finds users similar to the target user
- Recommends items liked by similar users

📘 Example:
- “People who have similar movie tastes to you also liked Interstellar.”

🔧 Uses: Cosine similarity or Pearson correlation between user rating vectors.

***Item-Based CF***
- Finds items similar to the ones the user liked
- Recommends similar items

📘 Example:
    - “Since you liked Inception, you might like Shutter Island.”

🔧 Uses: Similarity between item rating vectors (columns instead of rows).

***Order of implementation***

1. Import necessary libraries  
    - pandas, numpy, scipy.sparse, sklearn.metrics, sklearn.model_selection  
    - similarity / nearest-neighbors: sklearn.neighbors, sklearn.metrics.pairwise or libraries like implicit, faiss  
    - optional: surprise, lightfm, matplotlib / seaborn for EDA and plotting

2. Load the dataset  
    - read CSV/Parquet or database, inspect columns (user_id, item_id, rating, timestamp)  
    - handle duplicates, outliers, and type conversions  
    - create time-based train/validation/test splits if applicable

3. Create a user–item matrix  
    - pivot to dense or sparse matrix (scipy.sparse.csr_matrix)  
    - preserve implicit feedback (counts) vs explicit ratings semantics  
    - keep mappings for user_id/item_id ↔ matrix indices

4. Handle missing values and normalization  
    - for explicit CF: leave as NaN and predict only missing entries, or mean-center rows for similarity  
    - for implicit CF: convert to binary or confidence-weighted values (e.g., confidence = 1 + alpha * interactions)  
    - apply global/user/item bias removal or scaling as needed

5. Compute similarity matrix  
    - user-based: cosine, Pearson correlation, or adjusted cosine (mean-centering)  
    - item-based: cosine or item-item correlation (columns)  
    - optimize: compute only top-k neighbors, use approximate nearest neighbors or sparse operations

6. Predict missing ratings (scoring)  
    - neighborhood-based: weighted sum of neighbor ratings with normalization and regularization  
    - model-based: matrix factorization (ALS, SGD), factorization machines, or neural approaches  
    - apply baseline corrections (global/user/item biases) to reduce systematic error

7. Generate top‑N recommendations  
    - score candidate items, exclude already-observed items for the user  
    - rerank by diversity / serendipity / popularity penalties if required  
    - return item ids with scores and explanations (optional)

8. Evaluate model performance  
    - rating prediction: RMSE, MAE  
    - ranking: Precision@K, Recall@K, MAP@K, NDCG@K, Hit Rate  
    - use stratified or temporal cross-validation and holdout sets; evaluate offline-to-online gap

9. (Optional) Apply dimensionality reduction / latent-factor models  
    - SVD / truncated SVD, ALS, NMF; choose number of latent factors and regularization  
    - compare latent models vs neighborhood methods on your metrics

10. (Optional) Tune hyperparameters and selection criteria  
     - grid or random search for similarity metric, neighborhood size k, regularization, learning rate, latent factors  
     - use validation metrics and early stopping; consider computational cost

11. Production and deployment considerations  
     - handle cold-start (side information / hybrid methods), incremental updates, and batch retraining cadence  
     - scale with sharding, approximate nearest neighbors, or precomputed candidate lists  
     - monitor drift, A/B test online, and ensure privacy/compliance (anonymization, consent)

12. Documentation and reproducibility  
     - log data provenance, preprocessing steps, hyperparameters, and model artifacts  
     - provide reproducible notebooks, unit tests for core transformation functions, and CI for retraining pipelines