The University of Melbourne, School of Computing and Information Systems

# COMP30027 Machine Learning, 2024 Semester 1
## Project 2: IMDB Movie Rating Prediction

The goal of this project is to build and critically analyze supervised machine learning methods for predicting IMDB movie ratings based on various predictor variables that include movie-title, duration, director and actor(s) names and facebook likes, keywords, genre, country, budget, and others. There are five possible outcomes 0 being the lowest and 4 being the highest.

This assignment aims to reinforce the largely theoretical lecture concepts surrounding data representation, classifier construction, evaluation and error analysis, by applying them to an open-ended problem. You will also have an opportunity to practice your general problem-solving skills, written communication skills, and critical thinking skills

# Imports


In [92]:
import pandas as pd

# general
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif

# classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# visualisation
from sklearn import tree
import matplotlib.pyplot as plt

# Read data into Dataframes


In [93]:
# Read training and testing dataset into dataframes
train_df = pd.read_csv("project_data/train_dataset.csv")
test_df = pd.read_csv("project_data/test_dataset.csv")

# Separate features and class for training df
X_train = train_df.drop(columns=["imdb_score_binned"])
y_train = train_df["imdb_score_binned"]

# Test dataset only contains features without labels 
X_test = test_df

# Split the training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


# Preprocess Data

In [94]:
# Define preprocessing steps for both numerical and text data
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

# Preprocessing for numerical data
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])


### K-NN 

In [95]:
# Define the pipeline with preprocessing and KNN classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('knn', KNeighborsClassifier())
])

# Define parameter grid
param_grid = {
    'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],  # Different scaling methods
    'knn__n_neighbors': [3, 5, 7, 9],  # Different values of k
    'knn__metric': ['euclidean', 'manhattan']  # Different distance metrics
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Accuracy:", best_score)

# Evaluate on the validation set using the best model
best_model = grid_search.best_estimator_
val_accuracy = accuracy_score(y_val, best_model.predict(X_val))
print("Validation Accuracy with Best Model:", val_accuracy)

Best Parameters: {'knn__metric': 'manhattan', 'knn__n_neighbors': 9, 'preprocessor__num__scaler': StandardScaler()}
Best Accuracy: 0.6566805266805267
Validation Accuracy with Best Model: 0.6888519134775375


### Decision Tree


In [96]:
# Define preprocessing steps
imputer = SimpleImputer(strategy='mean') 
feature_selector = SelectKBest(score_func=f_classif, k=8)  
classifier = DecisionTreeClassifier()

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', imputer),
    ('feature_selector', feature_selector),
    ('classifier', classifier)
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate on the validation set
accuracy = pipeline.score(X_val, y_val)
print("Validation Accuracy:", accuracy)


Validation Accuracy: 0.5940099833610649


### SVM


In [99]:
# Define feature selection, imputation, and SVM classifier
imputer = SimpleImputer()
feature_selector = SelectKBest(score_func=f_classif)
classifier = SVC()

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', imputer),
    ('feature_selector', feature_selector),
    ('classifier', classifier)
])

# Define parameter grid
param_grid = {
    'imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selector__k': [5, 8, 10] 
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Accuracy:", best_score)

# Evaluate on the validation set using the best model
best_model = grid_search.best_estimator_
val_accuracy = best_model.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)

Best Parameters: {'feature_selector__k': 8, 'imputer__strategy': 'mean'}
Best Accuracy: 0.6745746708246708
Validation Accuracy with Best Model: 0.6788685524126455
