## üåçüöÄ Incredible India Explorer
AI-Powered Indian Travel Recommender System
### üìã Overview

This notebook creates a comprehensive travel recommendation system using:
- **Content-Based Filtering**: Recommends destinations based on features (type, tags, activities, etc.)
- **Collaborative Filtering**: Recommends based on user behavior patterns
- **Hybrid Approach**: Combines both methods for better recommendations

### üìä Datasets Used

1. **Destination_df.csv** - 10,000 destinations with 33 features
2. **Users_df.csv** - 10,000 user profiles
3. **Users_History_df.csv** - 12,275 trip records
4. **Reviews_df.csv** - 10,000 reviews

---

## üìö Step 1: Import Required Libraries

In [7]:

# Data manipulation
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
import pickle
import warnings
import os
import gc
warnings.filterwarnings('ignore')
print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## üìÇ Step 2: Load Datasets

In [8]:
print("Loading datasets...")
print("="*80)

# Load all datasets
users_df = pd.read_csv('Users_df.csv')
destinations_df = pd.read_csv('Destination_df.csv')
history_df = pd.read_csv('Users_History_df.csv')
reviews_df = pd.read_csv('Reviews_df.csv')

print(f"‚úÖ Users Dataset: {users_df.shape}")
print(f"‚úÖ Destinations Dataset: {destinations_df.shape}")
print(f"‚úÖ User History Dataset: {history_df.shape}")
print(f"‚úÖ Reviews Dataset: {reviews_df.shape}")
print("="*80)

Loading datasets...
‚úÖ Users Dataset: (10000, 16)
‚úÖ Destinations Dataset: (9964, 33)
‚úÖ User History Dataset: (12275, 16)
‚úÖ Reviews Dataset: (10000, 7)


## üîç Step 3: Exploratory Data Analysis

In [9]:
# Display basic information
print("\nüîç Destination Dataset Info:")
print("-"*80)
print(f"Total Destinations: {len(destinations_df):,}")
print(f"\nColumns ({len(destinations_df.columns)}):")
for i, col in enumerate(destinations_df.columns, 1):
    print(f"  {i:2d}. {col:<30} ({destinations_df[col].dtype})")

print("\nüìä Key Statistics:")
print("-"*80)
print(f"Destination Types: {destinations_df['Type'].nunique()}")
print(f"States/UTs: {destinations_df['State/UT'].nunique()}")
print(f"Cities: {destinations_df['City'].nunique()}")


üîç Destination Dataset Info:
--------------------------------------------------------------------------------
Total Destinations: 9,964

Columns (33):
   1. DestinationID                  (int64)
   2. Name                           (object)
   3. City                           (object)
   4. State/UT                       (object)
   5. Type                           (object)
   6. Tags                           (object)
   7. Popularity                     (float64)
   8. RatingCount                    (int64)
   9. BestTimeToVisit                (object)
  10. WeatherSummary                 (object)
  11. Latitude                       (object)
  12. Longitude                      (float64)
  13. EntryFee                       (object)
  14. AverageCost                    (object)
  15. RecommendedDuration            (object)
  16. Activities                     (object)
  17. Accessibility                  (object)
  18. NearestAirport                 (object)
  19. NearestRailw

In [10]:
# Check missing values in key columns
print("\nüîç Missing Values Analysis:")
print("-"*80)

key_cols = ['Name', 'Type', 'Tags', 'Description', 'Activities', 'City', 'State/UT', 'Popularity']
missing_data = []

for col in key_cols:
    if col in destinations_df.columns:
        missing = destinations_df[col].isna().sum()
        pct = (missing / len(destinations_df)) * 100
        missing_data.append({
            'Column': col,
            'Missing': missing,
            'Percentage': f"{pct:.2f}%"
        })

missing_df = pd.DataFrame(missing_data)
print(missing_df.to_string(index=False))


üîç Missing Values Analysis:
--------------------------------------------------------------------------------
     Column  Missing Percentage
       Name        0      0.00%
       Type        0      0.00%
       Tags        0      0.00%
Description        0      0.00%
 Activities        0      0.00%
       City        0      0.00%
   State/UT        0      0.00%
 Popularity        0      0.00%


In [11]:
# Display top destination types
print("\nüìä Top 10 Destination Types:")
print("-"*80)
top_types = destinations_df['Type'].value_counts().head(10)
for i, (dtype, count) in enumerate(top_types.items(), 1):
    print(f"{i:2d}. {dtype:<25} : {count:>5} destinations")

print("\nüó∫Ô∏è Top 10 States:")
print("-"*80)
top_states = destinations_df['State/UT'].value_counts().head(10)
for i, (state, count) in enumerate(top_states.items(), 1):
    print(f"{i:2d}. {state:<25} : {count:>5} destinations")


üìä Top 10 Destination Types:
--------------------------------------------------------------------------------
 1. Nature                    :  1629 destinations
 2. Adventure                 :  1421 destinations
 3. Wildlife                  :   698 destinations
 4. Heritage                  :   598 destinations
 5. Religious                 :   564 destinations
 6. Historical                :   512 destinations
 7. Beach                     :   500 destinations
 8. Museum                    :   434 destinations
 9. Hill Station              :   433 destinations
10. Temple                    :   418 destinations

üó∫Ô∏è Top 10 States:
--------------------------------------------------------------------------------
 1. West Bengal               :   661 destinations
 2. Sikkim                    :   529 destinations
 3. Ladakh                    :   504 destinations
 4. Tamil Nadu                :   500 destinations
 5. Himachal Pradesh          :   475 destinations
 6. Uttarakhand  

## üßπ Step 4: Data Preprocessing

In [12]:
print("\nüßπ Data Preprocessing...")
print("="*80)

# Create a working copy
df = destinations_df.copy()

# Fill missing values
print("\n1. Filling missing values...")
df['Tags'] = df['Tags'].fillna('')
df['Description'] = df['Description'].fillna('')
df['Activities'] = df['Activities'].fillna('')
df['Type'] = df['Type'].fillna('Unknown')
df['City'] = df['City'].fillna('')
df['State/UT'] = df['State/UT'].fillna('')
print("   ‚úÖ Missing values handled")

# Display sample data
print("\n2. Sample preprocessed data:")
print("-"*80)
print(df[['Name', 'Type', 'City', 'State/UT']].head(3))


üßπ Data Preprocessing...

1. Filling missing values...
   ‚úÖ Missing values handled

2. Sample preprocessed data:
--------------------------------------------------------------------------------
          Name        Type    City       State/UT
0    Taj Mahal  Historical    Agra  Uttar Pradesh
1   Amber Fort  Historical  Jaipur      Rajasthan
2  Goa Beaches       Beach  Panaji            Goa


## üéØ Step 5: Feature Engineering for Content-Based Filtering

In [13]:
print("\nüéØ Feature Engineering...")
print("="*80)

def create_content_features(row):
    """
    Combine multiple columns to create rich feature representation
    
    Weighting strategy:
    - Type: 3x (most important - primary category)
    - Tags: 2x (important - detailed characteristics)
    - Activities: 1x (standard weight)
    - Description: 1x (limited to 200 chars to avoid noise)
    - State/UT: 1x (for regional similarity)
    """
    features = []
    
    # Type (weighted 3x)
    features.append(str(row['Type']) * 3)
    
    # Tags (weighted 2x)
    features.append(str(row['Tags']) * 2)
    
    # Activities
    features.append(str(row['Activities']))
    
    # Description (truncated)
    desc = str(row['Description'])[:200]
    features.append(desc)
    
    # State/UT (regional similarity)
    features.append(str(row['State/UT']))
    
    return ' '.join(features)

print("\nCreating combined content features...")
df['content_features'] = df.apply(create_content_features, axis=1)

print("‚úÖ Content features created!")
print(f"\nSample feature string (first 300 chars):")
print("-"*80)
print(df['content_features'].iloc[0][:300] + "...")
print("-"*80)


üéØ Feature Engineering...

Creating combined content features...
‚úÖ Content features created!

Sample feature string (first 300 chars):
--------------------------------------------------------------------------------
HistoricalHistoricalHistorical heritage, architecture, culture, monument, guided tour, photography, historic landmark, iconic, white, marble, tourist favorite, must visit, photogenicheritage, architecture, culture, monument, guided tour, photography, historic landmark, iconic, white, marble, tourist...
--------------------------------------------------------------------------------


## ü§ñ Step 6: Build TF-IDF Vectorizer

In [14]:
print("\nü§ñ Building TF-IDF Vectorizer...")
print("="*80)

# Initialize TF-IDF Vectorizer with optimized parameters
tfidf = TfidfVectorizer(
    max_features=5000,        # Limit to top 5000 features (memory efficient)
    stop_words='english',     # Remove common English words
    ngram_range=(1, 2),       # Use both unigrams and bigrams
    min_df=2,                 # Ignore terms appearing in < 2 documents
    max_df=0.8                # Ignore terms appearing in > 80% documents
)

print("\nFitting TF-IDF vectorizer...")
tfidf_matrix = tfidf.fit_transform(df['content_features'])

print(f"\n‚úÖ TF-IDF Matrix Created!")
print(f"   Shape: {tfidf_matrix.shape}")
print(f"   ({tfidf_matrix.shape[0]:,} destinations √ó {tfidf_matrix.shape[1]:,} features)")
print(f"   Sparsity: {(1.0 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")
print(f"   Memory: ~{tfidf_matrix.data.nbytes / (1024**2):.2f} MB")


ü§ñ Building TF-IDF Vectorizer...

Fitting TF-IDF vectorizer...

‚úÖ TF-IDF Matrix Created!
   Shape: (9964, 5000)
   (9,964 destinations √ó 5,000 features)
   Sparsity: 99.26%
   Memory: ~2.82 MB


## üìä Step 7: Compute Cosine Similarity Matrix

In [15]:
print("\nüìä Computing Cosine Similarity Matrix...")
print("="*80)
print("‚è≥ This may take a few minutes for large datasets...\n")

# Use linear_kernel (faster than cosine_similarity for TF-IDF)
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

print(f"\n‚úÖ Cosine Similarity Matrix Created!")
print(f"   Shape: {cosine_sim.shape}")
print(f"   ({cosine_sim.shape[0]:,} √ó {cosine_sim.shape[1]:,})")
print(f"   Memory: ~{cosine_sim.nbytes / (1024**2):.2f} MB")

# Show some sample similarities
print(f"\nüìà Sample Similarities for '{df['Name'].iloc[0]}':")
print("-"*80)
sample_sims = cosine_sim[0][:10]
for i, sim in enumerate(sample_sims):
    print(f"   {df['Name'].iloc[i]:<40} : {sim:.4f}")


üìä Computing Cosine Similarity Matrix...
‚è≥ This may take a few minutes for large datasets...


‚úÖ Cosine Similarity Matrix Created!
   Shape: (9964, 9964)
   (9,964 √ó 9,964)
   Memory: ~757.46 MB

üìà Sample Similarities for 'Taj Mahal':
--------------------------------------------------------------------------------
   Taj Mahal                                : 1.0000
   Amber Fort                               : 0.2591
   Goa Beaches                              : 0.0115
   Alleppey Backwaters                      : 0.0259
   Pangong Lake                             : 0.0188
   Varanasi Ghats                           : 0.0516
   Hawa Mahal                               : 0.4146
   Munnar Tea Gardens                       : 0.0198
   Khajuraho Temples                        : 0.2666
   Ajanta Ellora Caves                      : 0.3022


## üóÇÔ∏è Step 8: Create Indices Mapping

In [16]:
print("\nüóÇÔ∏è Creating Indices Mapping...")
print("="*80)

# Create mapping: destination name -> dataframe index
indices = pd.Series(df.index, index=df['Name']).to_dict()

print(f"\n‚úÖ Indices Mapping Created!")
print(f"   Total mappings: {len(indices):,}")
print(f"\nSample mappings:")
print("-"*80)
for i, (name, idx) in enumerate(list(indices.items())[:5]):
    print(f"   {name:<40} -> Index {idx}")


üóÇÔ∏è Creating Indices Mapping...

‚úÖ Indices Mapping Created!
   Total mappings: 9,866

Sample mappings:
--------------------------------------------------------------------------------
   Taj Mahal                                -> Index 0
   Amber Fort                               -> Index 1
   Goa Beaches                              -> Index 2
   Alleppey Backwaters                      -> Index 9925
   Pangong Lake                             -> Index 9908


## ‚úÖ Step 9: Test Content-Based Recommendations

In [17]:
def get_content_recommendations(destination_name, cosine_sim, indices, df, top_n=10):
    """
    Get content-based recommendations for a given destination
    
    Parameters:
    -----------
    destination_name : str
        Name of the destination
    cosine_sim : numpy array
        Cosine similarity matrix
    indices : dict
        Name to index mapping
    df : DataFrame
        Destinations dataframe
    top_n : int
        Number of recommendations to return
    
    Returns:
    --------
    DataFrame with recommendations
    """
    try:
        # Get index
        idx = indices.get(destination_name)
        if idx is None:
            return pd.DataFrame()
        
        # Get similarity scores
        sim_scores = list(enumerate(cosine_sim[idx]))
        
        # Sort by similarity (descending)
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        
        # Get top N (excluding first one - itself)
        sim_scores = sim_scores[1:top_n+1]
        
        # Get destination indices
        dest_indices = [i[0] for i in sim_scores]
        
        # Create recommendations dataframe
        recommendations = df.iloc[dest_indices][['Name', 'City', 'State/UT', 'Type', 'Popularity']].copy()
        recommendations['Similarity_Score'] = [round(i[1], 3) for i in sim_scores]
        
        return recommendations
    
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()

# Test the function
print("\n‚úÖ Testing Content-Based Recommendations:")
print("="*80)

test_destination = df['Name'].iloc[0]
print(f"\nüéØ Getting recommendations for: {test_destination}")
print("-"*80)

recommendations = get_content_recommendations(test_destination, cosine_sim, indices, df, top_n=5)

if not recommendations.empty:
    print("\nTop 5 Similar Destinations:")
    print(recommendations.to_string(index=False))
    print("\n‚úÖ Content-based recommendations working perfectly!")
else:
    print("‚ùå No recommendations found")


‚úÖ Testing Content-Based Recommendations:

üéØ Getting recommendations for: Taj Mahal
--------------------------------------------------------------------------------

Top 5 Similar Destinations:
                           Name    City       State/UT       Type  Popularity  Similarity_Score
                  Jaswant Thada Jodhpur      Rajasthan Historical        9.10             0.722
Hoshang Shah Tomb & Jami Masjid   Mandu Madhya Pradesh Historical        9.85             0.595
                   Rumi Darwaza Lucknow  Uttar Pradesh Historical        9.60             0.583
                     Jag Mandir Udaipur      Rajasthan Historical        9.20             0.568
                 Blue City View Jodhpur      Rajasthan Historical        9.20             0.548

‚úÖ Content-based recommendations working perfectly!


## üë• Step 10: Build Collaborative Filtering Model

In [18]:
print("\nüë• Building Collaborative Filtering Model...")
print("="*80)

print("\n1. Creating User-Item Matrix...")
print("   Using ExperienceRating as interaction value")

# Create user-item matrix from history data
user_item_matrix = history_df.pivot_table(
    index='UserID',
    columns='DestinationID',
    values='ExperienceRating',
    fill_value=0
)

print(f"\n‚úÖ User-Item Matrix Created!")
print(f"   Shape: {user_item_matrix.shape}")
print(f"   ({user_item_matrix.shape[0]:,} users √ó {user_item_matrix.shape[1]:,} destinations)")
print(f"   Total ratings: {(user_item_matrix > 0).sum().sum():,}")

# Calculate sparsity
total_cells = user_item_matrix.shape[0] * user_item_matrix.shape[1]
non_zero = (user_item_matrix != 0).sum().sum()
sparsity = (1 - (non_zero / total_cells)) * 100
print(f"   Sparsity: {sparsity:.2f}%")
print(f"   Average ratings per user: {non_zero / user_item_matrix.shape[0]:.1f}")


üë• Building Collaborative Filtering Model...

1. Creating User-Item Matrix...
   Using ExperienceRating as interaction value

‚úÖ User-Item Matrix Created!
   Shape: (7275, 7420)
   (7,275 users √ó 7,420 destinations)
   Total ratings: 12,274
   Sparsity: 99.98%
   Average ratings per user: 1.7



‚úÖ User-Item Matrix Created!
   Shape: (7275, 7420)
   (7,275 users √ó 7,420 destinations)
   Total ratings: 12,274
   Sparsity: 99.98%
   Average ratings per user: 1.7


In [19]:
print("\n2. Computing User Similarity Matrix...")
print("   ‚è≥ This may take a few minutes...\n")

# Compute user similarity using cosine similarity
user_similarity = cosine_similarity(user_item_matrix)

# Convert to DataFrame for easier handling
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print(f"\n‚úÖ User Similarity Matrix Created!")
print(f"   Shape: {user_similarity_df.shape}")
print(f"   ({user_similarity_df.shape[0]:,} √ó {user_similarity_df.shape[1]:,})")
print(f"   Memory: ~{user_similarity_df.values.nbytes / (1024**2):.2f} MB")


2. Computing User Similarity Matrix...
   ‚è≥ This may take a few minutes...


‚úÖ User Similarity Matrix Created!
   Shape: (7275, 7275)
   (7,275 √ó 7,275)
   Memory: ~403.79 MB


## ‚úÖ Step 11: Test Collaborative Filtering

In [20]:
def get_collaborative_recommendations(user_id, user_similarity_df, user_item_matrix, df, top_n=10):
    """
    Get collaborative filtering recommendations for a user
    
    Parameters:
    -----------
    user_id : int
        User ID
    user_similarity_df : DataFrame
        User similarity matrix
    user_item_matrix : DataFrame
        User-item interaction matrix
    df : DataFrame
        Destinations dataframe
    top_n : int
        Number of recommendations
    
    Returns:
    --------
    DataFrame with recommendations
    """
    try:
        if user_id not in user_similarity_df.index:
            return pd.DataFrame()
        
        # Get similar users (top 10)
        similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:11]
        
        # Get destinations rated by target user
        user_rated = user_item_matrix.loc[user_id]
        user_rated_dest = user_rated[user_rated > 0].index.tolist()
        
        # Aggregate ratings from similar users
        recommendations = {}
        for sim_user, similarity in similar_users.items():
            sim_user_ratings = user_item_matrix.loc[sim_user]
            for dest_id, rating in sim_user_ratings.items():
                if rating > 0 and dest_id not in user_rated_dest:
                    if dest_id not in recommendations:
                        recommendations[dest_id] = 0
                    recommendations[dest_id] += rating * similarity
        
        # Sort and get top N
        top_dest = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:top_n]
        
        if not top_dest:
            return pd.DataFrame()
        
        # Get destination details
        dest_ids = [d[0] for d in top_dest]
        result = df[df['DestinationID'].isin(dest_ids)][['DestinationID', 'Name', 'City', 'State/UT', 'Type', 'Popularity']].copy()
        
        # Add predicted scores
        score_map = {d[0]: round(d[1], 3) for d in top_dest}
        result['Predicted_Score'] = result['DestinationID'].map(score_map)
        result = result.sort_values('Predicted_Score', ascending=False)
        
        return result
    
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()

# Test the function
print("\n‚úÖ Testing Collaborative Filtering Recommendations:")
print("="*80)

test_user = user_item_matrix.index[0]
print(f"\nüéØ Getting recommendations for User ID: {test_user}")
print("-"*80)

collab_recommendations = get_collaborative_recommendations(
    test_user, user_similarity_df, user_item_matrix, df, top_n=5
)

if not collab_recommendations.empty:
    print("\nTop 5 Recommended Destinations:")
    print(collab_recommendations.to_string(index=False))
    print("\n‚úÖ Collaborative filtering recommendations working perfectly!")
else:
    print("‚ùå No recommendations found")


‚úÖ Testing Collaborative Filtering Recommendations:

üéØ Getting recommendations for User ID: 1
--------------------------------------------------------------------------------

Top 5 Recommended Destinations:
 DestinationID                      Name         City          State/UT       Type  Popularity  Predicted_Score
          9396              Royal Konark       Konark            Odisha       Fort        7.23            5.126
          5777                    Nagpur       Nagpur       Maharashtra       Lake        7.88            0.513
           625        Martand Sun Temple     Anantnag   Jammu & Kashmir Historical        9.50            0.000
          3429 Pin Valley to Mudh Winter         Mudh             Spiti    Offbeat        9.92            0.000
          2041  Bhismaknagar Winter Fort Bhismaknagar Arunachal Pradesh Historical        9.70            0.000

‚úÖ Collaborative filtering recommendations working perfectly!


## üíæ Step 12: Save All Models and Files

In [21]:
print("\nüíæ Saving all models and files...")
print("="*80)

# 1. Save TF-IDF Vectorizer
print("\n[1/5] Saving TF-IDF Vectorizer...")
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)
size = os.path.getsize('tfidf_vectorizer.pkl') / 1024
print(f"      ‚úÖ tfidf_vectorizer.pkl ({size:.1f} KB)")

# 2. Save Cosine Similarity Matrix
print("\n[2/5] Saving Cosine Similarity Matrix...")
with open('cosine_similarity.pkl', 'wb') as f:
    pickle.dump(cosine_sim, f)
size = os.path.getsize('cosine_similarity.pkl') / (1024 * 1024)
print(f"      ‚úÖ cosine_similarity.pkl ({size:.1f} MB)")

# 3. Save Indices
print("\n[3/5] Saving Indices Mapping...")
with open('indices.pkl', 'wb') as f:
    pickle.dump(indices, f)
size = os.path.getsize('indices.pkl') / 1024
print(f"      ‚úÖ indices.pkl ({size:.1f} KB)")

# 4. Save User-Item Matrix
print("\n[4/5] Saving User-Item Matrix...")
user_item_matrix.to_pickle('user_item_matrix.pkl')
size = os.path.getsize('user_item_matrix.pkl') / (1024 * 1024)
print(f"      ‚úÖ user_item_matrix.pkl ({size:.1f} MB)")

# 5. Save User Similarity DataFrame
print("\n[5/5] Saving User Similarity Matrix...")
user_similarity_df.to_pickle('user_similarity_df.pkl')
size = os.path.getsize('user_similarity_df.pkl') / (1024 * 1024)
print(f"      ‚úÖ user_similarity_df.pkl ({size:.1f} MB)")

print("\n" + "="*80)
print("üéâ ALL MODELS SAVED SUCCESSFULLY!")
print("="*80)


üíæ Saving all models and files...

[1/5] Saving TF-IDF Vectorizer...
      ‚úÖ tfidf_vectorizer.pkl (212.6 KB)

[2/5] Saving Cosine Similarity Matrix...
      ‚úÖ cosine_similarity.pkl (757.5 MB)

[3/5] Saving Indices Mapping...
      ‚úÖ indices.pkl (246.8 KB)

[4/5] Saving User-Item Matrix...
      ‚úÖ user_item_matrix.pkl (412.0 MB)

[5/5] Saving User Similarity Matrix...
      ‚úÖ user_similarity_df.pkl (403.8 MB)

üéâ ALL MODELS SAVED SUCCESSFULLY!


## ‚úÖ Step 13: Verification and Final Summary

In [22]:
print("\n" + "="*80)
print("VERIFICATION - Checking All Required Files")
print("="*80)

required_files = [
    'tfidf_vectorizer.pkl',
    'cosine_similarity.pkl',
    'indices.pkl',
    'user_item_matrix.pkl',
    'user_similarity_df.pkl'
]

data_files = [
    'Users_df.csv',
    'Destination_df.csv',
    'Users_History_df.csv',
    'Reviews_df.csv'
]

print("\nüì¶ Model Files:")
print("-"*80)
all_models_present = True
for file in required_files:
    if os.path.exists(file):
        size = os.path.getsize(file) / (1024 * 1024)
        print(f"   ‚úÖ {file:<30} ({size:.2f} MB)")
    else:
        print(f"   ‚ùå {file:<30} NOT FOUND")
        all_models_present = False

print("\nüìä Data Files:")
print("-"*80)
all_data_present = True
for file in data_files:
    if os.path.exists(file):
        size = os.path.getsize(file) / (1024 * 1024)
        print(f"   ‚úÖ {file:<30} ({size:.2f} MB)")
    else:
        print(f"   ‚ùå {file:<30} NOT FOUND")
        all_data_present = False

print("\n" + "="*80)

if all_models_present and all_data_present:
    print("‚úÖ SUCCESS! All required files are present and ready.")
    print("\nüì± You can now run the Streamlit application:")
    print("\n   streamlit run app.py")
    print("\nüí° The app will load all models automatically!")
else:
    print("‚ö†Ô∏è  WARNING: Some files are missing. Please check above.")


VERIFICATION - Checking All Required Files

üì¶ Model Files:
--------------------------------------------------------------------------------
   ‚úÖ tfidf_vectorizer.pkl           (0.21 MB)
   ‚úÖ cosine_similarity.pkl          (757.46 MB)
   ‚úÖ indices.pkl                    (0.24 MB)
   ‚úÖ user_item_matrix.pkl           (411.95 MB)
   ‚úÖ user_similarity_df.pkl         (403.85 MB)

üìä Data Files:
--------------------------------------------------------------------------------
   ‚úÖ Users_df.csv                   (1.63 MB)
   ‚úÖ Destination_df.csv             (7.59 MB)
   ‚úÖ Users_History_df.csv           (1.34 MB)
   ‚úÖ Reviews_df.csv                 (1.39 MB)

‚úÖ SUCCESS! All required files are present and ready.

üì± You can now run the Streamlit application:

   streamlit run app.py

üí° The app will load all models automatically!


## üìä Step 14: Model Statistics and Performance Metrics

In [23]:
print("\n" + "="*80)
print("MODEL STATISTICS AND PERFORMANCE METRICS")
print("="*80)

print("\nüìä Dataset Statistics:")
print("-"*80)
print(f"   ‚Ä¢ Total Destinations: {len(destinations_df):,}")
print(f"   ‚Ä¢ Total Users: {len(users_df):,}")
print(f"   ‚Ä¢ Total Trips: {len(history_df):,}")
print(f"   ‚Ä¢ Total Reviews: {len(reviews_df):,}")
print(f"   ‚Ä¢ Destination Types: {destinations_df['Type'].nunique()}")
print(f"   ‚Ä¢ States/UTs: {destinations_df['State/UT'].nunique()}")

print("\nü§ñ Content-Based Filtering Model:")
print("-"*80)
print(f"   ‚Ä¢ TF-IDF Features: {tfidf_matrix.shape[1]:,}")
print(f"   ‚Ä¢ Destinations Covered: {len(indices):,}")
print(f"   ‚Ä¢ Similarity Matrix Size: {cosine_sim.shape[0]:,} √ó {cosine_sim.shape[1]:,}")
print(f"   ‚Ä¢ Feature Sparsity: {(1.0 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")

print("\nüë• Collaborative Filtering Model:")
print("-"*80)
print(f"   ‚Ä¢ Active Users: {user_item_matrix.shape[0]:,}")
print(f"   ‚Ä¢ Rated Destinations: {user_item_matrix.shape[1]:,}")
print(f"   ‚Ä¢ Total Ratings: {(user_item_matrix > 0).sum().sum():,}")
total_cells = user_item_matrix.shape[0] * user_item_matrix.shape[1]
non_zero = (user_item_matrix != 0).sum().sum()
print(f"   ‚Ä¢ Matrix Sparsity: {(1 - (non_zero / total_cells)) * 100:.2f}%")
print(f"   ‚Ä¢ Avg Ratings per User: {non_zero / user_item_matrix.shape[0]:.1f}")
print(f"   ‚Ä¢ User Similarity Matrix: {user_similarity_df.shape[0]:,} √ó {user_similarity_df.shape[1]:,}")

print("\nüèÜ Top Destination Types:")
print("-"*80)
top_types = destinations_df['Type'].value_counts().head(10)
for i, (dtype, count) in enumerate(top_types.items(), 1):
    print(f"   {i:2d}. {dtype:<25} : {count:>5} destinations")

print("\nüó∫Ô∏è  Top States/UTs:")
print("-"*80)
top_states = destinations_df['State/UT'].value_counts().head(10)
for i, (state, count) in enumerate(top_states.items(), 1):
    print(f"   {i:2d}. {state:<25} : {count:>5} destinations")

print("\n" + "="*80)
print("‚úÖ NOTEBOOK EXECUTION COMPLETE!")
print("="*80)
print("\nüéâ All models trained and saved successfully!")
print("üì± Ready to deploy: streamlit run app.py")
print("\n" + "="*80)


MODEL STATISTICS AND PERFORMANCE METRICS

üìä Dataset Statistics:
--------------------------------------------------------------------------------
   ‚Ä¢ Total Destinations: 9,964
   ‚Ä¢ Total Users: 10,000
   ‚Ä¢ Total Trips: 12,275
   ‚Ä¢ Total Reviews: 10,000
   ‚Ä¢ Destination Types: 44
   ‚Ä¢ States/UTs: 165

ü§ñ Content-Based Filtering Model:
--------------------------------------------------------------------------------
   ‚Ä¢ TF-IDF Features: 5,000
   ‚Ä¢ Destinations Covered: 9,866
   ‚Ä¢ Similarity Matrix Size: 9,964 √ó 9,964
   ‚Ä¢ Feature Sparsity: 99.26%

üë• Collaborative Filtering Model:
--------------------------------------------------------------------------------
   ‚Ä¢ Active Users: 7,275
   ‚Ä¢ Rated Destinations: 7,420
   ‚Ä¢ Total Ratings: 12,274
   ‚Ä¢ Matrix Sparsity: 99.98%
   ‚Ä¢ Avg Ratings per User: 1.7
   ‚Ä¢ User Similarity Matrix: 7,275 √ó 7,275

üèÜ Top Destination Types:
----------------------------------------------------------------------------