# 🎬 MovieLens Recommendation System - Project Structure

This notebook provides an overview of the MovieLens recommendation system project structure and demonstrates the key components. We'll explore:

1. Project Directory Overview
2. Environment Setup
3. Data Flow Explanation
4. Testing Hybrid Recommender
5. Streamlit Integration

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import project modules
from src.hybrid_recommender import HybridRecommender
from src.data_processor import MovieLensProcessor
from config import PROCESSED_DATA_DIR, RAW_DATA_DIR

## 1. Project Directory Overview

The project follows a modular structure with clear separation of concerns:

### 📁 Project Structure
```
movielens-recommender/
├── data/
│   ├── raw/                    # Raw MovieLens dataset
│   └── processed/              # Cleaned and processed data
├── src/
│   ├── data_processor.py       # Data preprocessing
│   ├── collaborative_filter.py # Collaborative filtering
│   ├── content_filter.py      # Content-based filtering
│   ├── hybrid_recommender.py  # Hybrid system
│   └── utils.py              # Utility functions
├── streamlit_app/            # Web interface
├── notebooks/               # Analysis notebooks
├── models/                  # Saved models
├── requirements.txt         # Python dependencies
└── config.py               # Global configuration
```

## 2. Environment Setup

The project requires Python 3.8+ and several dependencies. Let's verify the environment setup:

In [None]:
# Check Python version
import platform
print(f"Python version: {platform.python_version()}")

# List installed packages
import pkg_resources
required_packages = [
    'pandas',
    'numpy',
    'scikit-learn',
    'streamlit',
    'plotly',
    'seaborn'
]

print("\nInstalled package versions:")
for package in required_packages:
    try:
        version = pkg_resources.get_distribution(package).version
        print(f"✓ {package}: {version}")
    except pkg_resources.DistributionNotFound:
        print(f"✗ {package}: Not installed")

## 3. Data Flow Explanation

The recommendation system follows this data flow:

1. **Raw Data**: MovieLens dataset is downloaded and extracted
2. **Preprocessing**: Data cleaning and feature engineering
3. **Model Training**: Collaborative and content-based filtering
4. **Hybrid System**: Combining both approaches
5. **User Interface**: Streamlit web application

Let's examine each step:

In [None]:
# Initialize data processor
processor = MovieLensProcessor()

# Check if data exists, otherwise download it
if not os.path.exists(RAW_DATA_DIR):
    processor.download_movielens()

# Load and process data
processor.load_data()
processor.clean_data()

# Display basic statistics
stats = processor.get_basic_stats()
print("📊 Dataset Statistics:")
for key, value in stats.items():
    if key != 'rating_distribution':
        print(f"- {key}: {value}")

# Plot rating distribution
plt.figure(figsize=(10, 6))
sns.barplot(x=stats['rating_distribution'].index, 
            y=stats['rating_distribution'].values)
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

## Analyse des Films du Questionnaire

Analysons les films sélectionnés pour le questionnaire initial pour valider leur pertinence:

In [None]:
from config import QUESTIONNAIRE_MOVIES

# Charger les données des films
df_movies = pd.read_csv(os.path.join(PROCESSED_DATA_DIR, 'movies_clean.csv'))
df_ratings = pd.read_csv(os.path.join(PROCESSED_DATA_DIR, 'ratings_clean.csv'))

# Analyser les films du questionnaire
questionnaire_stats = []
for movie in QUESTIONNAIRE_MOVIES:
    movie_id = movie['movie_id']
    movie_ratings = df_ratings[df_ratings['movie_id'] == movie_id]
    movie_info = df_movies[df_movies['movie_id'] == movie_id].iloc[0]
    
    stats = {
        'title': movie['title'],
        'n_ratings': len(movie_ratings),
        'avg_rating': movie_ratings['rating'].mean(),
        'genres': movie_info['genres']
    }
    questionnaire_stats.append(stats)
    
# Afficher les statistiques
df_questionnaire = pd.DataFrame(questionnaire_stats)
print("📊 Statistiques des Films du Questionnaire:")
print(df_questionnaire.to_string(index=False))

# Visualiser la distribution des genres
all_genres = []
for genres in df_questionnaire['genres']:
    if isinstance(genres, str):
        all_genres.extend(eval(genres))
        
genre_counts = pd.Series(all_genres).value_counts()

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.values, y=genre_counts.index)
plt.title('Distribution des Genres dans les Films du Questionnaire')
plt.xlabel('Nombre de Films')
plt.show()

## Analyse des Interactions Utilisateurs

Examinons les patterns d'interaction des utilisateurs pour optimiser notre système de recommandation:

In [None]:
# Analyser les interactions utilisateurs
user_stats = df_ratings.groupby('user_id').agg({
    'rating': ['count', 'mean', 'std']
}).round(2)

user_stats.columns = ['n_ratings', 'avg_rating', 'rating_std']

print("\n📊 Statistiques des Utilisateurs:")
print(f"Nombre moyen de ratings par utilisateur: {user_stats['n_ratings'].mean():.1f}")
print(f"Note moyenne globale: {user_stats['avg_rating'].mean():.2f}")

# Visualiser la distribution du nombre de ratings par utilisateur
plt.figure(figsize=(10, 6))
sns.histplot(data=user_stats, x='n_ratings', bins=30)
plt.title('Distribution du Nombre de Ratings par Utilisateur')
plt.xlabel('Nombre de Ratings')
plt.ylabel('Nombre d\'Utilisateurs')
plt.show()

# Analyser la matrice user-item
user_item_matrix = df_ratings.pivot_table(
    index='user_id',
    columns='movie_id',
    values='rating'
)

sparsity = (user_item_matrix.isna().sum().sum() / 
            (user_item_matrix.shape[0] * user_item_matrix.shape[1]))

print(f"\n🔍 Sparsité de la matrice user-item: {sparsity:.1%}")

## Analyse Temporelle

Analysons la distribution temporelle des films pour valider nos périodes proposées:

In [None]:
from config import TIME_PERIODS

# Créer des labels pour les périodes
def get_time_period(year):
    for period, (start, end) in TIME_PERIODS.items():
        if start <= year <= end:
            return period
    return 'Unknown'

# Ajouter la période à chaque film
df_movies['period'] = df_movies['year'].apply(get_time_period)

# Visualiser la distribution temporelle
plt.figure(figsize=(12, 6))
sns.countplot(data=df_movies, x='period', order=['Classiques', 'Golden Age', 'Modernes', 'Toutes époques'])
plt.title('Distribution des Films par Période')
plt.xticks(rotation=45)
plt.show()

# Analyser les ratings moyens par période
df_with_periods = df_movies.merge(df_ratings, on='movie_id')
period_stats = df_with_periods.groupby('period').agg({
    'rating': ['count', 'mean', 'std']
}).round(3)

period_stats.columns = ['n_ratings', 'avg_rating', 'rating_std']
print("\n📊 Statistiques par Période:")
print(period_stats)

## 4. Testing Hybrid Recommender

Now let's test the hybrid recommendation system with some sample user ratings:

In [None]:
# Initialize and train hybrid recommender
hybrid = HybridRecommender()

# Sample user ratings
test_ratings = {
    1: 5,    # Toy Story
    50: 4,   # Star Wars
    269: 3   # Fargo
}

# Get recommendations
recommendations = hybrid.get_hybrid_recommendations(
    test_ratings,
    preferred_genres=['Action', 'Sci-Fi'],
    discovery_type="balanced"
)

# Display recommendations
print("🎬 HYBRID RECOMMENDATIONS:")
for i, rec in enumerate(recommendations[:5], 1):
    print(f"\n{i}. {rec['title']} (Score: {rec['hybrid_score']:.3f})")
    print(f"   {hybrid.get_explanation(rec)}")

## 5. Streamlit Integration

The recommendation system is exposed through a Streamlit web interface. Here's how to run it:

```bash
# Activate virtual environment
venv\Scripts\activate  # Windows
source venv/bin/activate  # Mac/Linux

# Launch Streamlit app
streamlit run streamlit_app/app.py
```

The Streamlit app provides:
- User questionnaire for initial preferences
- Movie rating interface
- Personalized recommendations
- Explanation for each recommendation

## Conclusion

This notebook has demonstrated:
1. The project's modular structure
2. Environment setup and validation
3. Data processing pipeline
4. Hybrid recommendation system
5. Integration with Streamlit

Next steps:
- Implement the Streamlit interface
- Add more sophisticated recommendation algorithms
- Improve explanation generation
- Add user feedback collection