# BUDGET CLASSIFICATION USING DECISION TREES

_**Classifying budget level of movies for production companies based on parameters such as country, director, genre, expected revienue, main star, etc.**_
Data card is available at https://www.kaggle.com/datasets/danielgrijalvas/movies.

In [None]:
# Imports required packages

import numpy as np
import pandas as pd

import re

from sklearn.model_selection import train_test_split

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

## Data Collection

In [None]:
# Loads dataset from csv file
movies = pd.read_csv("movies.csv")

In [None]:
# Displays few of the data samples
display(movies.head())

## Exploratory Data Analysis (EDA)

In [None]:
# Checks for basic information about the dataset

movies.info()

**Observations from the basic information are**

- Features are of both numeric and non-numeric
- Most features have missing values
- Important attribute 'budget' has quite a few missing values

In [None]:
# Checks for the descriptive statistics of the dataset
movies.describe()

In [None]:
# Checks for attributes having missing values in the dataset
movies.isnull().sum()

As **_budget_** will be considered as dependant attribute in this experiment, instances having missing values for this attribute need to be removed.

**Checks for the values in each categorical attribute**

In [None]:
movies.rating.value_counts()

In [None]:
movies.genre.value_counts()

In [None]:
movies.released.value_counts()

In [None]:
movies.director.value_counts()

In [None]:
movies.writer.value_counts()

In [None]:
movies.star.value_counts()

In [None]:
movies.country.value_counts()

In [None]:
movies.company.value_counts()

## Data Preparation

### Checking for Duplicate Instances

In [None]:
# Drops duplicate instances, if any
movies.drop_duplicates(keep='first', inplace=True)

### Removing Non-required Attributes

In [None]:
# Drops attribute 'name' from the dataset as it is an 'identifier'-like

movies.drop(["name"], axis=1, inplace=True)

### Removing Instances with Missing Values

In [None]:
# Deletes the instances for missing 'budget' values. Other instances with missing values also
# get deleted considering these instances are just tiny portion of the dataset
# 'ranting': 1.0%, 'released': 0.02%, 'score': 0.03%, 'votes': 0.03%, 'writer': 0.03%,
# 'star': 0.01%, 'country': 0.03%, 'gross': 2%, 'company': 0.2%, 'runtime': 0.05%

movies.dropna(inplace=True)

In [None]:
# Checks the shape of the dataset after removing instances with missing values

movies.shape

### Preparing Target

In [None]:
# Checks for the budget distribution
movies.budget.plot(kind='hist')

In [None]:
# Segmenting budget values into bins or labels for equal distribution during 
# train-test dataset seperation and to be used as target in this classification experiment.
movies["budget_level"] = pd.cut(movies.budget,
       bins=[0., 25000000., 50000000., 75000000., 100000000., 125000000., 150000000., 
             175000000., 200000000., 225000000., np.inf],
       labels=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

In [None]:
# Drops the 'budget' attribute as no more required after binning
movies.drop(["budget"], axis=1, inplace=True)

### Preparing Test Dataset

In [None]:
# Splits dataset into train and test dataset
X_train, X_test = train_test_split(
    movies, test_size=0.20, random_state=42, stratify=movies["budget_level"])

In [None]:
# Seperates target attribute from train dataset
y_train = X_train.budget_level.copy()
X_train.drop(["budget_level"], axis=1, inplace=True)

In [None]:
# Seperates target attribute from test dataset
y_test = X_test.budget_level.copy()
X_test.drop(["budget_level"], axis=1, inplace=True)

## Modeling

In [None]:
# Stores names of the categorical attributes for later use in pipeline
categorical_attributes = ["rating", "genre", "released", "director", "writer", "star", "country", "company"]

In [None]:
# Configures transformation for categorical attributes
column_transformer = ColumnTransformer([
    # To call categorical pipeline for categorical attribute
    ("categorical_pipeline", OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_attributes)])

In [None]:
# Configures model pipeline containing attributes transformer and model
model_pipeline = Pipeline([
    ("data_transformation", column_transformer),
    ("modeling", RandomForestClassifier(oob_score=True, random_state=42))
])

In [None]:
# Fits the model over pipeline
model_pipeline.fit(X_train, y_train)

### Prediction and Performance Analysis

**Predicting on train dataset and performing performance**

In [None]:
# Makes predictions on train data
y_train_predictions = model_pipeline.predict(X_train)

# Shows few of the predictions
y_train_predictions

In [None]:
# Performs accuracy score against train data
accuracy_score(y_train, y_train_predictions)

**Predicting on test dataset and performing performance**

In [None]:
# Makes predictions on test data
y_test_predictions = model_pipeline.predict(X_test)

In [None]:
# Shows few of the predictions
y_test_predictions

In [None]:
# Performs accuracy score against test data
accuracy_score(y_test, y_test_predictions)