# Week One Final Project: Movie Revenue Prediction and Exploration

Where has week one gone! We have one more project for you to put a nice little bow on all of the hard work you've done so far. For this project, be persistent, be curious, and ask questions if you get stuck!

## The Project

You and your teammates will create one prediction model and *AT LEAST* three plots or charts. Everyone will present their model and their charts during the final session of the day.
* Model predictions will be ranked according to their r-squared values and we will crown a winner!
* Your plots should be driven by curiosity. Everyone will present at least one plot.

## Helper Functions

We've provided helper functions down below. If you need help remembering what they do, refer to the `airbnb_solution.ipynb` example.

In [1]:
# We'll use these packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer

pd.set_option('display.max_columns', 100)

# Read in the data!
movie_data = pd.read_csv("movie_dataset_final.csv")

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Helper Function: Feature Engineering
# Use this to turn dictionary columns into useful features
# We use the genre column as an example

column = "genre"  # Change this to a different column if you prefer

movie_dataset[f'{column}_list'] = movie_dataset['column'].apply(lambda x: [dic['name'] for dic in x])

# Let's initialize the MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Now we fit and transform the 'genre_list' column and put the result in a new DataFrame
binary_matrix = pd.DataFrame(mlb.fit_transform(movie_dataset[f'{column}_list']), columns=mlb.classes_)

# Now you have a binary matrix, where each genre is a feature. 
# To add these features back into your original dataframe, we use the pandas function 'concat'.
new_feature_names = binary_matrix.columns
movie_dataset = pd.concat([movie_dataset, binary_matrix], axis=1)



In [None]:
# Helper Function: Two Bar Chart Plots
groupby_variable = "column_one"
y_value = "column_2"

fig, axs = plt.subplots(2, 1, figsize=(12, 6))
listings.groupby(groupby_variable)[y_value].mean().plot(kind="bar", ax=axs[0], title=f"Average {y_value}")
listings.groupby(groupby_variable)[y_value].count().plot(kind="bar", ax=axs[1], title="Count of Listings")
fig.tight_layout()

In [None]:
# Helper Function: Scatter Plot

x_value = "column_1"
y_value = "column_2"

listings.plot(x=x_value, y=y_value, kind="scatter", alpha=0.2)

In [None]:
# Helper Function: Model Training
features = ["column_1", "column_2", "column_3", "etc..."]
features = ["popularity", "revenue"]

target = "revenue"
model_type = "regression"


if model_type == "regression":
    model = RandomForestRegressor()
else:
    model = RandomForestClassifier()

shuffled_data = movie_data.sample(len(movie_data))  # Shuffle our data
train_data = shuffled_data[:int(len(shuffled_data)*0.8)]
validation_data = shuffled_data[int(len(shuffled_data)*0.8):]

model.fit(train_data[features], train_data[target])

train_data[f"predicted_{target}"] = model.predict(train_data[features])
validation_data[f"predicted_{target}"] = model.predict(validation_data[features])

# How do we measure our success?
print("Training Data Statistics")
print("mean_absolute_error: ", mean_absolute_error(train_data[target], train_data[f"predicted_{target}"]))
print("mean_squared_error", mean_squared_error(train_data[target], train_data[f"predicted_{target}"]))
print("R**2", r2_score(train_data[target], train_data[f"predicted_{target}"]))
print("")

print("Validation Data Statistics")
print("mean_absolute_error: ", mean_absolute_error(validation_data[target], validation_data[f"predicted_{target}"]))
print("mean_squared_error", mean_squared_error(validation_data[target], validation_data[f"predicted_{target}"]))
print("R**2", r2_score(validation_data[target], validation_data[f"predicted_{target}"]))

print("")
for i in range(len(model.feature_importances_)):
    print(f"{features[i]}: {model.feature_importances_[i]}")