# Data Analysis - Movie Recommendation System

This notebook focuses on initial data exploration, cleaning, and basic statistical analysis of the movie dataset.

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import ast
import warnings
warnings.filterwarnings('ignore')

## 2. Load Data

In [2]:
df = pd.read_csv('../../data/final/sample.csv')
df.head()

Unnamed: 0,title,genres,positive_users,positive_count,negative_users,negative_count,vote_average,vote_count,status,release_date,revenue,runtime,adult,budget,original_language,overview,poster_path,production_companies,keywords,tmdb_id
0,Much Ado About Nothing: Shakespeare's Globe Th...,"['comedy', 'drama']","[61947, 172263]",2,[],0,8.0,2,Released,2012-10-09,0,167,False,0,English,Much Ado About Nothing is a comedic play by Wi...,/jkzvbyBQNo4NrLYtmllz6TIryJY.jpg,Shakespeare's Globe,theater play,210695
1,Piranhaconda,"['horror', 'sci-fi']","[18184, 177617, 198781]",3,"[49217, 62537, 68978, 71504, 78715, 80729, 814...",22,5.068,139,Released,2012-06-16,0,90,False,1000000,English,A hybrid creature - half piranha and half anac...,/oqgQghwDwyv4PzyaoMUnhgtiCjt.jpg,New Horizons Picture,"ransom, hawaii, water monster, filmmaking, kil...",115084
2,Edge of Fury,['thriller'],[],0,[189614],1,5.5,6,Released,1958-05-01,0,77,False,0,English,A psychopathic young beachcomber pretends to b...,/A0C3I8F6PRim9FmCytNfP1RLouk.jpg,Wisteria Productions,summer house,35128
3,Bird of Paradise,"['adventure', 'drama', 'romance']",[],0,"[37118, 189614, 289344]",3,5.139,18,Released,1932-08-12,753000,80,False,0,English,When a young South Seas sailor falls overboard...,/n2vjVQ1Y3m0LBTm59bYm1NfclqW.jpg,RKO Radio Pictures,"exotic island, volcano, love, pre-code",92341
4,Critters 2: The Main Course,"['comedy', 'horror', 'sci-fi']","[1109, 1429, 6807, 36214, 48682, 49121, 52961,...",52,"[1511, 2242, 2364, 4605, 4880, 5773, 6416, 793...",274,5.957,370,Released,1988-04-29,3813293,86,False,4500000,English,A batch of unhatched critter eggs are mistaken...,/5rR3dVJXnXskkBBWKbLg5EDKWtT.jpg,"Sho Films, New Line Cinema","spacecraft, small town, bounty hunter, hamburg...",10127


## 3. Initial Exploration

In [3]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   title                 15000 non-null  str    
 1   genres                15000 non-null  str    
 2   positive_users        15000 non-null  str    
 3   positive_count        15000 non-null  int64  
 4   negative_users        15000 non-null  str    
 5   negative_count        15000 non-null  int64  
 6   vote_average          15000 non-null  float64
 7   vote_count            15000 non-null  int64  
 8   status                15000 non-null  str    
 9   release_date          15000 non-null  str    
 10  revenue               15000 non-null  int64  
 11  runtime               15000 non-null  int64  
 12  adult                 15000 non-null  bool   
 13  budget                15000 non-null  int64  
 14  original_language     15000 non-null  str    
 15  overview              15000 no

In [4]:
df.describe()

Unnamed: 0,positive_count,negative_count,vote_average,vote_count,revenue,runtime,budget,tmdb_id
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,763.696467,251.449467,6.425128,801.752533,31054010.0,100.474933,10575800.0,180215.2
std,3548.358195,794.137707,1.017764,2222.053151,111258300.0,29.571787,27826540.0,224555.4
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
25%,2.0,2.0,5.8,27.0,0.0,89.0,0.0,18208.25
50%,13.0,12.0,6.625,105.0,0.0,99.0,0.0,61585.0
75%,174.0,107.0,7.1,492.0,11072460.0,113.0,7456758.0,308291.8
max,85220.0,14719.0,10.0,34495.0,2923706000.0,743.0,379000000.0,1136736.0


## 4. Missing Values Analysis

In [5]:
missing_values = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
pd.concat([missing_values, missing_percent], axis=1, keys=['Missing Count', 'Percent'])

Unnamed: 0,Missing Count,Percent
title,0,0.0
genres,0,0.0
positive_users,0,0.0
positive_count,0,0.0
negative_users,0,0.0
negative_count,0,0.0
vote_average,0,0.0
vote_count,0,0.0
status,0,0.0
release_date,0,0.0


## 5. Statistical Analysis of Numerical Columns

In [6]:
numerical_cols = ['vote_average', 'vote_count', 'revenue', 'runtime', 'budget']
df[numerical_cols].agg(['mean', 'median', 'std', 'min', 'max'])

Unnamed: 0,vote_average,vote_count,revenue,runtime,budget
mean,6.425128,801.752533,31054010.0,100.474933,10575800.0
median,6.625,105.0,0.0,99.0,0.0
std,1.017764,2222.053151,111258300.0,29.571787,27826540.0
min,0.0,0.0,0.0,0.0,0.0
max,10.0,34495.0,2923706000.0,743.0,379000000.0


## 6. Categorical Distribution

In [7]:
print("Language Distribution:")
print(df['original_language'].value_counts().head(10))

print("\nStatus Distribution:")
print(df['status'].value_counts())

Language Distribution:
original_language
English     10547
French        742
Japanese      558
Spanish       390
Italian       319
German        313
Korean        258
Russian       246
Hindi         235
Chinese       185
Name: count, dtype: int64

Status Distribution:
status
Released    15000
Name: count, dtype: int64
