<a href="https://colab.research.google.com/github/Gcarmnonapy7/CIFAR-Classificator/blob/main/movies_cluesterization_%26crowd_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



+----------------------+ | 1️⃣ Input Data | | - TMDb metadata | | - Box office data | | - Holiday/weekend | +----------+-----------+ | v +----------------------+ | 2️⃣ Data Cleaning | | - Parse genres/keywords| | - Fill missing values | | - Log-transform nums | | - One-hot categories | +----------+-----------+ | v +----------------------+ | 3️⃣ Feature Engineering | | - Genre strength | | - Keyword strength | | - Budget / revenue | | - Historical trends | +----------+-----------+ | v +----------------------+ | 4️⃣ Movie Similarity / Clustering | | - Cluster movies by features | | - Cosine similarity for keywords | | - Output: cluster label per movie | +----------+-----------+ | v +----------------------+ | 5️⃣ ML Crowd Prediction | | - Input: features + cluster | | - Target: total predicted crowd | | - Output: predicted total audience| +----------+-----------+ | v +----------------------+ | 6️⃣ Weekly Distribution | | - Split total predicted crowd per week | | - Adjust for holidays, weekends | | - Output: weekly predicted attendance | +----------+-----------+ | v +----------------------+ | 7️⃣ Cinema & Room Assignment | | - Simulate cinemas & rooms | | - Assign movies to rooms based on weekly attendance | | - Respect room capacities | | - Output: schedule per cinema/week | +----------+-----------+ | v +----------------------+ | 8️⃣ Output / Team Consumption | | - Export Excel / CSV / JSON | | - Columns: Movie, Cinema, Room, Week, Predicted Crowd | +----------+-----------+ | v +----------------------+ +----------------------+


In [35]:
# import depedencies
import json
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [36]:
import kagglehub
import os
path_tmdb = kagglehub.dataset_download("tmdb/tmdb-movie-metadata")
path_movie_box = kagglehub.dataset_download("aditya126/movies-box-office-dataset-2000-2024")

Using Colab cache for faster access to the 'tmdb-movie-metadata' dataset.
Using Colab cache for faster access to the 'movies-box-office-dataset-2000-2024' dataset.


In [37]:
print(os.listdir(path_tmdb))
print(os.listdir(path_movie_box))

['tmdb_5000_movies.csv', 'tmdb_5000_credits.csv']
['enhanced_box_office_data(2000-2024)u.csv']


In [38]:
#Import datasets

tmdb = pd.read_csv(os.path.join(path_tmdb,'tmdb_5000_movies.csv'))
box = pd.read_csv(os.path.join(path_movie_box,'enhanced_box_office_data(2000-2024)u.csv'))

Predicted Crowd Week w=∑similarityi​∑(similarityi​×crowdi,w​)​

In [39]:
print(tmdb.shape)
print(box.shape)

(4803, 20)
(5000, 13)


In [40]:
box.info(
    verbose=True,
    show_counts=True
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Rank                  5000 non-null   int64  
 1   Release Group         5000 non-null   object 
 2   $Worldwide            5000 non-null   float64
 3   $Domestic             5000 non-null   float64
 4   Domestic %            5000 non-null   float64
 5   $Foreign              5000 non-null   float64
 6   Foreign %             5000 non-null   float64
 7   Year                  5000 non-null   int64  
 8   Genres                4822 non-null   object 
 9   Rating                4830 non-null   object 
 10  Vote_Count            4830 non-null   float64
 11  Original_Language     4830 non-null   object 
 12  Production_Countries  4800 non-null   object 
dtypes: float64(6), int64(2), object(5)
memory usage: 507.9+ KB


In [41]:
tmdb.info(
    verbose=True,
    show_counts=True
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [42]:
def detect_outliers(df):
  pass

In [43]:
def clean_keywords(column_keywords):

    if column_keywords is None or (isinstance(column_keywords, float) and pd.isna(column_keywords)):
        return []

    if isinstance(column_keywords, str):
        try:
            column_keywords = json.loads(column_keywords)
        except (json.JSONDecodeError, TypeError):
            return []

    if isinstance(column_keywords, list):
        return [item['name'] if isinstance(item, dict) else str(item) for item in column_keywords]

    return []

tmdb['keywords'] = tmdb['keywords'].apply(clean_keywords)

In [44]:
tmdb['keywords']

Unnamed: 0,keywords
0,"[culture clash, future, space war, space colon..."
1,"[ocean, drug abuse, exotic island, east india ..."
2,"[spy, based on novel, secret agent, sequel, mi..."
3,"[dc comics, crime fighter, terrorist, secret i..."
4,"[based on novel, mars, medallion, space travel..."
...,...
4798,"[united states–mexico barrier, legs, arms, pap..."
4799,[]
4800,"[date, love at first sight, narration, investi..."
4801,[]


In [45]:
exploded_keywords = tmdb.explode('keywords')

words_freq = (
    exploded_keywords
    .groupby('keywords')
    ['revenue']
    .mean()
)

tmdb['keyword_strength'] = tmdb['keywords'].apply(
    lambda keywords: np.median([words_freq[word] for word in keywords]) if len(keywords) > 0 else 0
)

In [46]:
tmdb['keyword_strength'].describe()

Unnamed: 0,keyword_strength
count,4803.0
mean,88384940.0
std,77671390.0
min,0.0
25%,43176290.0
50%,73160930.0
75%,115159100.0
max,877244800.0


In [47]:
columns_to_rename_box_df = {'Release Group': 'title','$Worldwide':'worldwide_gross','Rating' : 'rating','Production_Countries' : 'countries','Original_Language':'language', 'Domestic':'contry_view'}

In [48]:
box.rename(columns=columns_to_rename_box_df,inplace=True) # title and total crowd around the world

In [49]:
# === EDA ==
import matplotlib.pyplot as plt
import seaborn as sns


sns.set_style('darkgrid')

def get_top_countries(df,column_name,top_n=10):

   df[column_name] = df[column_name].str.split(',')

   df_exploded_names = df.explode(column_name)

   df_exploded_names[column_name] = df_exploded_names[column_name].str.strip()

   return df_exploded_names[column_name].value_counts().head(top_n)

get_top_countries(box,'countries')

Unnamed: 0_level_0,count
countries,Unnamed: 1_level_1
United States of America,3153
United Kingdom,616
France,454
Japan,436
China,394
Germany,362
Canada,243
South Korea,214
Hong Kong,191
India,158


In [51]:
DROP_COLUMNS_TMDB = {
    'overview',
    'original_title',
    'title',
    'production_companies',
    'production_countries',
    'id',
    'homepage',
    'status',
    'tagline',
    'spoken_languages'
}

tmdb = tmdb.drop(columns=DROP_COLUMNS_TMDB)

In [52]:
def clean_genres(col):
  if pd.isna(col) or col == '':
    return ''

  genres = json.loads(col)

  return ' '.join([g['name'].replace(' ','_').lower() for g in genres])

tmdb['genres'] = tmdb['genres'].apply(clean_genres)
tmdb['release_year'] = pd.to_datetime(tmdb['release_date']).dt.year


In [53]:
tmdb['genres_list'] = tmdb['genres'].apply(lambda x : x.split())
genre_exploded = tmdb.explode('genres_list')
genre_counts = genre_exploded['genres_list'].value_counts()
genre_avg_revenue = (
    genre_exploded
    .groupby('genres_list')
    ['revenue'].mean())

tmdb['genre_revenue_strength'] = tmdb['genres_list'].apply(lambda genres : np.mean([genre_avg_revenue[gen] for gen in genres] ) if len(genres) > 0 else 0)

In [61]:
display(genre_avg_revenue.sort_values(ascending=False).head(15))

print("\nSummary of Genre Revenue Strength:")
display(tmdb['genre_revenue_strength'].describe())

Unnamed: 0_level_0,revenue
genres_list,Unnamed: 1_level_1
animation,225693000.0
adventure,208660200.0
fantasy,193354200.0
family,162345500.0
science_fiction,152456500.0
action,141213100.0
war,84155870.0
thriller,81044290.0
mystery,78300930.0
comedy,71289500.0



Summary of Genre Revenue Strength:


Unnamed: 0,genre_revenue_strength
count,4775.0
mean,88047840.0
std,40093820.0
min,5101770.0
25%,61136040.0
50%,71289500.0
75%,111128700.0
max,225693000.0


In [54]:
# Ensure tmdb is a proper copy to avoid SettingWithCopyWarning
tmdb = tmdb.copy()

# Fill 0 budgets with the median of their genre group
tmdb['budget'] = tmdb.groupby('genres')['budget'].transform(
    lambda x: x.replace(0, x.median())
)

# Fallback: For rows that are still 0 (groups where all budgets were 0), use the global median
global_median_budget = tmdb[tmdb['budget'] > 0]['budget'].median()
tmdb.loc[tmdb['budget'] == 0, 'budget'] = global_median_budget

print(f"Rows with 0 budget remaining: {len(tmdb[tmdb['budget'] == 0])}")
print(f"Global median used for fallback: {global_median_budget}")

Rows with 0 budget remaining: 0
Global median used for fallback: 18000000.0


In [55]:
tmdb = tmdb[tmdb['genres_list'].apply(len) > 0]

In [56]:
tmdb[['genres_list','genre_revenue_strength']]

Unnamed: 0,genres_list,genre_revenue_strength
0,"[action, adventure, fantasy, science_fiction]",1.739210e+08
1,"[adventure, fantasy, action]",1.810758e+08
2,"[action, adventure, crime]",1.386747e+08
3,"[action, crime, drama, thriller]",8.513107e+07
4,"[action, adventure, science_fiction]",1.674433e+08
...,...,...
4797,"[foreign, thriller]",4.070447e+07
4798,"[action, crime, thriller]",9.613602e+07
4799,"[comedy, romance]",6.564595e+07
4800,"[comedy, drama, romance, tv_movie]",4.585203e+07


In [57]:
print(len(tmdb[tmdb['genres_list'].apply(len) == 0]))# verify columns without genre

0


In [None]:
columns_to_log = ['revenue','budget','vote_count','keyword_strength']

def transform_log1p(data,columns_to_log):
  valid_columns = [col for col in columns_to_log if col in data.columns]
  data[valid_columns] = np.log1p(data[valid_columns])
  return data
tmdb_log1p = transform_log1p(tmdb,columns_to_log)

In [None]:
cols_to_check = ['budget', 'revenue', 'vote_count', 'keyword_strength']
tmdb_model = tmdb_log1p[(tmdb_log1p[cols_to_check] > 0).all(axis=1)].copy()

In [None]:
print(tmdb_log1p.shape)
print(tmdb_model.shape)

In [None]:
tmdb_model.describe()

In [None]:
# Merging box and tmdb on title (inner)