# **Reel Returns**
#### *Machine Learning Insights into Movie Profitability*

In [None]:
# Importing dependencies
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, classification_report, balanced_accuracy_score, mean_squared_error, r2_score

In [None]:
# Installing gdown (comment out if uneeded)
%pip install gdown --quiet

# Importing gdown
import gdown

# **Data**

## Movie Data

Numerous factors contribue to the successes and failures within the film industry, both on critical and financial scales. Using a simplified view to focus on more generalized classifications, as well as to meet the puposes of our modelling, several contributing factors were chosen to highlight and predict each scale of success within our selected dataset. The features ultimately chosen were;

* Vote Average (a given movie's rating from zero (0) to ten (10))
* Vote Count
* Revenue (total earnings in USD for a given movie)
* Runtime
* Budget (total expenses in USD to produce and promote a given movie)
* Title
* Original Title
* Genres
* Production Companies

To create our `critical_success` indicator - a classification as to how well recieved a title was by fans and critics - we focused on the `vote_average` feature to create guidelines for ranges of scores.

To engineer a `financial_success` indicator - a measure of what level of returns a tital produced - we used the percentage calculated as `budget` subtracted from `revenue`, then divided by `budget`, and compared the results to industry standard breakpoints.

---

The following dataset is courtesy of __[Kaggle](https://www.kaggle.com/)__.

**__[TMDB_all_movies.csv](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates?select=TMDB_all_movies.csv)__**

Per the dataset description;

* This dataset was curated from __[The Movie Database](https://www.themoviedb.org/?language=en-US)__, and inspired by __[asaniczka](https://www.kaggle.com/asaniczka)__'s __[dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)__
* While updated daily, the dataset used in this notebook was downloaded on **7/7/2024**

In [None]:
# Declaring `url` and `output` for dataset
url = 'https://drive.google.com/file/d/1Om73B4In4cHj0Rf6aIGOi3-8iXWbDAWg/view?usp=sharing'
output = 'Resources/TMDB_all_movies.csv'

# Downloading dataset
gdown.download(url, output, fuzzy=True, quiet=True)

In [None]:
# Reading in dataset
tmdb_data = pd.read_csv(output)

### Defining functions

The following function will be used code to help streamline the flow of the code.

#### Universal functions

Applicable to all datasets

**Null Percentages**

Calculating the percentage of null values, by feature, in a given dataset

**Records Total**

Confirming the total records for a given dataset

In [None]:
# Defining a function to calculate the percentage of null values in
# each feature of given DF
def null_percentages(df):
    return df.isnull().sum()/len(df)*100

# Defining a function to display the total records for a given DF
def records_total(df):
    print(f'Total records: {df.shape[0]}')

#### Situational functions

Applicable to some datasets or situations

**Records Total by Feature Value**

Confirming the total records for a given dataset based on a stated value for a single selected feature

In [None]:
# Defining a function to display the total records for a given DF based on
# a stated value for a single selected feature
def records_total_feat(df, feature, value):
    print(f"Total selected records for '{feature}' of '{value}': {df.loc[df[feature] == value].shape[0]}")

### Initial EDA

In [None]:
# Viewing `tmdb_data`
tmdb_data.head()

In [None]:
# Applying `null_percentages` to `tmdb_data`
null_percentages(tmdb_data)

#### Reducing features and dataset

Dropping uneeded features and reducing dataset to domestic released movies

In [None]:
# Creating a list of uneeded features
movie_cols_to_drop = [
    'imdb_id','overview', 'tagline', 'director_of_photography', 'music_composer'
]

# Dropping uneeded features
tmdb_data.drop(columns=movie_cols_to_drop,inplace=True)

# Reapplying `null_percentages` to `tmdb_data`
null_percentages(tmdb_data)

In [None]:
# Confirming values of `status`
tmdb_data['status'].value_counts()

In [None]:
# Confirming values of `production_countries`
tmdb_data['production_countries'].value_counts()

In [None]:
# Reducing dataset to only `Released` movies produced in `United States of America`
df_movies = tmdb_data.loc[
    (tmdb_data['production_countries'] == 'United States of America') &
    (tmdb_data['status'] == 'Released')
].copy()

# Applying `null_percentages` to `us_tmdb_df`
null_percentages(df_movies)

### Engineering

Engineering the two success target values, converting `rlease_date` to datetime, and reducing the `genres` feature

#### Critical success

With a rating scale of zero (0) to ten (10), ranges can be established to break a given movie's critical success down by the following scale;

* **0 to 2.5**: `panned`
* **2.5 to 5**: `alright`
* **5 to 7.5**: `well liked`
* **7.5 to 10**: `critical success`

In [None]:
# Confirming values of `vote_average`
df_movies['vote_average'].describe()

In [None]:
# Creating bins to rate values of `vote_average`
bins = [0, 2.5, 5, 7.5, 10]

# Labelling bins
critical_success = ['panned', 'alright', 'well liked', 'critical success']

# Slicing the data and placing values in `critical_success`
df_movies['critical_success'] = pd.cut(
    df_movies['vote_average'], bins, labels=critical_success, include_lowest=True
)

# Confirming binned correctly
df_movies.head()

#### Financial success

Classifying a given movie's financial success can be accomplished by generating a percentage to represent the return on investment (`roi`) calculated as;

> ((`revenue`-`budget`)/`budget`) * 100

The resulting value can then be compared to industry standard breakpoints to describe the folling classifications; 

* **Less than 0%**: `failure`
* **Exactly 0%**: `broke even`
* **Between 0% and 50%**: `modest returns`
* **Between 50% and 100%**: `moderate returns`
* **Between 100% and 500%**: `excellent returns`
* **Over 500%**: `extraordinary returns`

*Note: For the purposes of our modeling, only records with a `budget` NOT equal to zero (0) will be retained*

In [None]:
# Confirming total records with a `budget` of `0
records_total_feat(df_movies, 'budget', 0)

In [None]:
# Selecting only records with a `budget` NOT equal to `0`
df_movies = df_movies[df_movies['budget'] != 0].copy()

# Confirming new total records with a `budget` of `0`
records_total_feat(df_movies, 'budget', 0)

In [None]:
# Calculating `roi` as described above
df_movies['roi'] = (
    (df_movies['revenue'] - df_movies['budget'])/df_movies['budget']
) * 100

# Confirming values of `roi`
df_movies['roi'].describe()

In [None]:
# Confirming total records with a `roi` of `0`
records_total_feat(df_movies, 'roi', 0)

In [None]:
# Creating bins to rate values of `roi`
bins = [-float('inf'), 0, 50, 100, 500, float('inf')]

# Labelling bins
financial_success = [
    'failure', 'modest returns', 'moderate returns',
    'excellent returns', 'extraordinary returns'
]

# Slicing the data and placing values in `financial_success`
df_movies['financial_success'] = pd.cut(
    df_movies['roi'], bins, labels=financial_success, include_lowest=True
)

# Adding classification 'broke even' to `financial_success`
df_movies['financial_success'] = df_movies['financial_success'].cat.add_categories('broke even')

# Classifying where `roi` equals `0` as `broke even`
df_movies.loc[df_movies['roi'] == 0, 'financial_success'] = 'broke even'

# Confirming when `roi` is `0`, `financial_success` is 'broke even'
df_movies.loc[df_movies['roi'] == 0, 'financial_success'].value_counts()

#### Datetime values

Converting `release_date` to a datetime value and extracting the year and month for later concatenation

In [None]:
# Confirming dtype for `release_date`
df_movies['release_date'].dtype

In [None]:
# Converting 'release_date' to datetime
df_movies['release_date'] = pd.to_datetime(
    df_movies['release_date'], format='%Y-%m-%d', errors='coerce'
)

# Confirming conversion
df_movies['release_date'].dtype

In [None]:
# Confirming records with `NaT` (Not a Time) values for `release_date`
print(
    'Total records where `release_date` has a `NaT` value: ' +\
    str(df_movies['release_date'].isna().sum())
)

In [None]:
# Droping records where 'release_date' is `NaT`
df_movies.dropna(subset=['release_date'], inplace=True)

# Reconfirming records with `NaT` (Not a Time) values for `release_date` 
print(
    'Total records where `release_date` has a `NaT` value: ' +\
    str(df_movies['release_date'].isna().sum())
)

In [None]:
# Extracting year, month, and day from 'release_date'
df_movies['released_year'] = df_movies['release_date'].dt.year
df_movies['released_month'] = df_movies['release_date'].dt.month
df_movies['released_day'] = df_movies['release_date'].dt.day

# Converting `released_year`, `released_month`, and `released_day` to integers
df_movies['released_year'] = df_movies['released_year'].astype(int)
df_movies['released_month'] = df_movies['released_month'].astype(int)
df_movies['released_day'] = df_movies['released_day'].astype(int)

#### Genres

Reducing the `genres` feature to a single value

*Note: In the cases where multiple genres are listed for a given movie, only the first listed genre will be retained*

In [None]:
# Confirming values of `genres`
df_movies['genres'].value_counts()

In [None]:
# Splitting strings in `genres` into lists
df_movies['genres'] = df_movies['genres'].str.split(',')

# Seperating records with multiple `genres` into individual records
df_movies = df_movies.explode('genres')

# Stripping white spaces from `genres`
df_movies['genres'] = df_movies['genres'].str.strip()

# Reconfirming values of `genres`
df_movies['genres'].value_counts()

In [None]:
# Identifying duplicate records in `df_movies` by `id`
df_movies_duplicate = df_movies[df_movies.duplicated(subset=['id'], keep=False)]

# Confirming total records of `df_movies_duplicate`
records_total(df_movies_duplicate)

In [None]:
# Dropping duplicate records by `id`
df_movies = df_movies.drop_duplicates(subset=['id'], keep='first')

# Identifying remaining duplicate records in `df_movies` by `id
df_movies_duplicate = df_movies[df_movies.duplicated(subset=['id'], keep=False)]

# Confirming total records of `df_movies_duplicate` and `df_movies`
print('For `df_movies_duplicate`:')
records_total(df_movies_duplicate)
print('\nFor `df_movies`:')
records_total(df_movies)

## Economics Data

While classifying economic states is a complex and nuanced issue, it is not unreasonable to draw more broad-strokes generalizations about a given timeframe based on more limited factors. To serve the purposes of our modeling, the following three factors were chosen to highlight the economic status at a given movie's release date;

* Consumer Confidence Indicator (CCI)
* Consumer Price Index (CPI)
* Unemployment Rate

These features stand as adequate datapoints to answer three respective questions;

* How likely are people to be spending money?
* How much do things cost when they do spend money?
* How many people have jobs to earn money to spend?

As detailed below, this information came as monthly measures over several decades. To create our `Economic Climate` indicator - a classification as to whether or not the economics of a given time were on the better side for consumers - we will need to calculate a rolling 12-month percent change in the mean of those monthly values in order to show if a given feature was on a positive or negative trend for the provided period.

---

The following datasets are courtesy of __[Kaggle](https://www.kaggle.com/)__.

**__['CCI_OECD.csv'](https://www.kaggle.com/datasets/iqbalsyahakbar/cci-oecd)__**

*renamed from `DP_LIVE_16112023095843236.csv`*

Per the Organisation for Economic Co-operation and Development (OECD);

* The CCI is an indication of developments for future households' consumption and saving based on expected financial situation, sentiment regarding the general economic situation, employment status, and capacity for savings
* An indicator above `100` indicates an optimistic outlook and a greater likliehood to spend money over cautious saving
* An indicator below `100` indicates a pessimistic outlook and both a higher likeliehood to save money and a lower tendency to consume

**__['US_inflation_rates.csv'](https://www.kaggle.com/datasets/pavankrishnanarne/us-inflation-dataset-1947-present)__**

Per the dataset description;

* The CPI is a critical economic indicator for measuring the purchasing power of money over time, measuring the average change over time in the prices paid by urban consumers for goods and services
* The CPI is the value at the end of the respective month

---

The following dataset is courtesy of the __[Economic Policy Institute’s (EPI) State of Working America Data Library](https://www.epi.org/data/)__.

**__['Unemployment.csv'](https://www.epi.org/data/#?subject=unemp)__**

Per EPI description;

* Unemployment is the share of the labor force wihout a job
* Monthly percentages calculated as a rolling 12-month average (mean)

In [None]:
# Reading in datasets
df_unemp = pd.read_csv("./Resources/EPI Data Library - Unemployment.csv")
df_cci = pd.read_csv("./Resources/CCI_OECD.csv")
df_inflation = pd.read_csv("./Resources/US_inflation_rates.csv")

### Defining functions

Since each dataset will need similar preprocessing, the following functions will be used to help streamline the flow of the code.

#### Universal functions

Applicable to all datasets

**EDA routine**

Labelling and displaying pertinant information about a given dataset for the purposes of expedited EDA

**Copying datasets**

Creating a working copy of a given dataset to preserve the original DF with unneeded features dropped

**Renaming needed features**

Renaming selected features for a given dataset

**Rolling mean and mean percent change**

Calculating the rolling 12-month mean and the rolling 12-month percent change for a given feature

In [None]:
# Defining a function to display the `.describe()`, `.shape`, and `.dtypes`
# for a given DF
def eda_routine(df):
    print('Describe:')
    display(df.describe())
    print(f'Shape: {df.shape}\n')
    print(f'Data types:')
    display(df.dtypes)

# Defining a function to copy a dataset with only the needed features
def copy_df(df, features_to_keep):
    df_copy = df[features_to_keep].copy()
    return df_copy

# Defining a function to rename needed features
def rename_features(df, feature1, feature1new, feature2, feature2new):
    df.rename(columns={
        feature1: feature1new,
        feature2: feature2new
    }, inplace=True)
    return df

# Defining a function to calculate the rolling 12-month means and percent changes
# for a given feature
def rolling_calcs(df, feature, feature_mean, feature_pct_chng):
    df[feature_mean] = df[feature].rolling(window=12).mean()
    df[feature_pct_chng] = df[feature_mean].pct_change(periods=12) * 100
    return df

### Situational functions

Applicable to select datasets

#### Datetime indexing

Converting the feature containing the raw datetime information into a suitable datetime index

*Cannot be used on `Unemployment` dataset*

#### Removing '%'

Removing the `'%'` from a given feature and converting the remaining `object` dtype to `float`

*Specifically for `Unemployment` dataset*

In [None]:
# Defining a function to set a `Date` feature as a datetime index
def datetime_index(df, datetime_feature):
    df[datetime_feature] = pd.to_datetime(df[datetime_feature])
    df.set_index(datetime_feature, inplace=True)
    df.sort_index(inplace=True)
    return df

# Defining a function to remove '%' and convert data `float`
def convert_percentage(feature):
    return float(feature.strip('%'))

# Defining a function to apply `convert_percentage`
def apply_percentage(df, feature):
    df[feature] = df[feature].apply(convert_percentage)
    return df

### CCI

This dataset came with internaitonal records and uneeded features, so only records for US CCI will be retained. Once those records have been selected, the resulting DF will need to be prepared for concatenation with the remainined economic datasets. To do this, the `TIME` feature will be converted to datetime and set as the index.

#### Initial EDA

In [None]:
# Viewing `df_cci`
df_cci.head()

In [None]:
# Applying `eda_routine` to `df_cci`
eda_routine(df_cci)

#### Reducing dataset

Reducing the dataset to only domestic data

In [None]:
# Confirming values of `LOCATION`
df_cci['LOCATION'].unique()

In [None]:
# Copying domestic data from `df_cci` to `df_cci_us`
df_cci_us = df_cci.loc[df_cci['LOCATION'] == 'USA'].copy()

#### Applying defined functions

In [None]:
# Copying `df_cci_us` and dropping uneeded features
df_cci_form = copy_df(df_cci_us, ['TIME', 'Value'])

# Renamining retained features
df_cci_form = rename_features(
    df_cci_form, 'TIME', 'Date', 'Value', 'CCI Value'
)

# Converting `Date` to a datetime index
df_cci_form = datetime_index(df_cci_form, 'Date')

# Calculating rolling 12-month means and percent change in means
df_cci_form = rolling_calcs(
    df_cci_form, 'CCI Value', 'CCI Rolling Mean', 'CCI Rolling Percent Change'
)

# Confirming `df_cci_form` ready to concatenate
display(df_cci_form.head())
display(df_cci_form.tail())

### Inflation

Seeing as the dataset came with only the needed features, little will be needed to prepare the DF for concatenation with the other economic datasets. `date` will be converted to datetime and set as the index.

#### Initial EDA

In [None]:
# Viewing `df_inflation`
df_inflation.head()

In [None]:
# Applying `eda_routine` to `df_inflation`
eda_routine(df_inflation)

#### Applying defined functions

In [None]:
# Copying `df_inflation` and dropping uneeded features
df_inflation_form = copy_df(df_inflation, ['date', 'value'])

# Renamining retained features
df_inflation_form = rename_features(
    df_inflation_form, 'date', 'Date', 'value', 'CPI Value'
)

# Converting `Date` to a datetime index
df_inflation_form = datetime_index(df_inflation_form, 'Date')

# Calculating rolling 12-month means and percent change in means
df_inflation_form = rolling_calcs(
    df_inflation_form,
    'CPI Value',
    'CPI Rolling Mean',
    'CPI Rolling Percent Change'
)

# Confirming `df_inflation_form` ready to concatenate
display(df_inflation_form.head())
display(df_inflation_form.tail())

### Unemployment

This dataset came with uneeded features that will need to be dropped, as well as the needed features will need to be converted to `float`. Additionally, the `Date` feature will need to be converted to datetime and set to the index in preparation for concatenation with the other economic datasets.

#### Initial EDA

In [None]:
# Viewing `df_unemp`
df_unemp.head()

In [None]:
# Applying `eda_routine` to `df_unemp`
eda_routine(df_unemp)

#### Applying defined functions (first pass)

Given the nature of the `Date` feature in this dataset, the datetime indexing will need to be handled outside of the defined functions

In [None]:
# Copying `df_unemp` and dropping unneeded features
df_unemp_form = copy_df(df_unemp, ['Date', 'All'])

#### Renaming and indexing

This dataset only needed one feature, `All`, to be renmaned, therefore the `rename_features` defined function is not applicable

Additionally, the `Date` feature will need to be engineered into a workable datetime feature

In [None]:
# Renaming the reatined feature
df_unemp_form.rename(columns={'All': 'Unemployment Rate (%)'}, inplace=True)

In [None]:
# Creating a dictionary of Months
month_map = {
    'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
    'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12
}

# Mapping integer month values to `Date Month`
df_unemp_form['Date Month'] = df_unemp_form['Date'].str.slice(0,3).map(month_map)

# Slicing `Date Year`
df_unemp_form['Date Year'] = df_unemp_form['Date'].str.slice(4,8)

# Converting `Date` to datetime using `Date Month` and `Date Year`
df_unemp_form['Date'] = pd.to_datetime({
    'year': df_unemp_form['Date Year'],
    'month': df_unemp_form['Date Month'],
    'day': 1
})

# Dropping engineered features `Date Month` and `Date Year`
df_unemp_form.drop(columns=['Date Month', 'Date Year'], inplace=True)

# Setting `Date` as index
df_unemp_form.set_index('Date', inplace=True)

# Ensuring index is sorted with ascending dates
df_unemp_form.sort_index(inplace=True)

#### Applying defined functions (second pass)

In [None]:
# Applying `apply_percentage` to `Unemployment Rate (%)`
df_unemp_form = apply_percentage(df_unemp_form, 'Unemployment Rate (%)')

# Calculating rolling 12-month means and percent change in means
df_unemp_form = rolling_calcs(
    df_unemp_form,
    'Unemployment Rate (%)',
    'Unemployment Rate (%) Rolling Mean',
    'Unemployment Rate Rolling Percent Change',
)

# Confirming `df_unemp_form` ready to concatenate
display(df_unemp_form.head())
display(df_unemp_form.tail())

## Combined Economics

With all datasets set to a monthly datetime index, the relevent features can be combined into one DF, and any NaN records can be dropped.

#### Concatenation

In [None]:
# Reconfirming total records and features for datasets
print('CCI:')
records_total(df_cci_form)
print('\nInflation (CPI):')
records_total(df_inflation_form)
print('\nUnemployment:')
records_total(df_unemp_form)

In [None]:
# Concatenating the economic datasets into `df_economics`
df_economics = pd.concat(
    [
        df_cci_form,
        df_inflation_form,
        df_unemp_form
    ], axis=1, join='outer'
)

#### `NaN` records

In [None]:
# Confirming total records and features
df_economics.shape

In [None]:
# Checking total `NaN` records
df_economics.isna().count()

In [None]:
# Dropping `NaN` records
df_economics.dropna(inplace=True)

# Confirming remaining records
records_total(df_economics)

In [None]:
# Confirming final economic DF
display(df_economics.head())
display(df_economics.tail())

### Engineering

Engineering the economic target values

#### Economic Climate

As stated, the goal is to create an indicator for `Economic Climate` based on broad-strokes observations of our datasets. Having calculated the rolling 12-month percent change for each feature - based off the rolling 12-month mean - we can look for a positive or negative change in values and flag the movement accordingly. From there, we can make the following simple statements;

* For **CCI**, a positive change is "good", as it indicates an increase in the likelihood of consumers to spend money
* For **CPI**, a negative change is "good", as it indicates a decrease in the costs for goods and services
* For **Unemployment Rate**, a negative change is "good", as it indicates an incrase in the population active in the workforce

Therefore, we can interpret movement contrary to those changes as "bad". With this simplified view of the features, we can draw a classification as follows;

* If **at least two (2) features** have a "good" value, we can set `Economic Climate` to `Comfortable to Good`
* If **at least two (2) features** have a "bad" value, we can set `Economic Climate` to `Lean to Bad`

In this way, we can gague whether the ecnomic state at a given rlease date supports or disproves our hypothesis.

In [None]:
# Confirming ranges and statistics of `df_economics`
df_economics.describe()

In [None]:
# Creating a list of features
features_to_flag = [
    'CCI Rolling Percent Change',
    'CPI Rolling Percent Change',
    'Unemployment Rate Rolling Percent Change'
]

# Looping through `features_to_flag` to assign `positive` and `negative` indicators
for col in df_economics[features_to_flag].columns:
    new_col = str(col) + ' Flag'
    df_economics.loc[df_economics[col] > 0, new_col] = 'positive'
    df_economics.loc[df_economics[col] <= 0, new_col] = 'negative'

# Creating a of list flagged features
flag_cols = [
    'CCI Rolling Percent Change Flag',
    'CPI Rolling Percent Change Flag',
    'Unemployment Rate Rolling Percent Change Flag'
]

# Confirming indicators applied
df_economics[flag_cols].head()

In [None]:
# Creating a list of conditions and classifications
conditions = [
    ((df_economics[flag_cols[0]] == 'positive') &   # CCI = 'positive'/good
    (df_economics[flag_cols[1]] == 'positive') &    # CPI = 'positive'/bad
    (df_economics[flag_cols[2]] == 'positive'),     # Unemplyment = 'positive'/bad
    'Lean to Bad'),
    ((df_economics[flag_cols[0]] == 'positive') &   # CCI = 'positive'/good
    (df_economics[flag_cols[1]] == 'positive') &    # CPI = 'positive'/bad
    (df_economics[flag_cols[2]] == 'negative'),     # Unemplyment = 'negative'/good
    'Comfortable to Good'),
    ((df_economics[flag_cols[0]] == 'positive') &   # CCI = 'positive'/good
    (df_economics[flag_cols[1]] == 'negative') &    # CPI = 'negative'/good
    (df_economics[flag_cols[2]] == 'positive'),     # Unemplyment = 'positive'/bad
    'Comfortable to Good'),
    ((df_economics[flag_cols[0]] == 'negative') &   # CCI = 'negative'/bad
    (df_economics[flag_cols[1]] == 'positive') &    # CPI = 'positive'/bad
    (df_economics[flag_cols[2]] == 'positive'),     # Unemplyment = 'positive'/bad
    'Lean to Bad'),
    ((df_economics[flag_cols[0]] == 'negative') &   # CCI = 'negative'/bad
    (df_economics[flag_cols[1]] == 'negative') &    # CPI = 'negative'/good
    (df_economics[flag_cols[2]] == 'positive'),     # Unemplyment = 'positive'/bad
    'Lean to Bad'),
    ((df_economics[flag_cols[0]] == 'negative') &   # CCI = 'negative'/bad
    (df_economics[flag_cols[1]] == 'positive') &    # CPI = 'positive'/bad
    (df_economics[flag_cols[2]] == 'negative'),     # Unemplyment = 'negative'/good
    'Lean to Bad'),
    ((df_economics[flag_cols[0]] == 'positive') &   # CCI = 'positive'/good
    (df_economics[flag_cols[1]] == 'negative') &    # CPI = 'negative'/good
    (df_economics[flag_cols[2]] == 'negative'),     # Unemplyment = 'negative'/good
    'Comfortable to Good'),
    ((df_economics[flag_cols[0]] == 'negative') &   # CCI = 'negative'/bad
    (df_economics[flag_cols[1]] == 'negative') &    # CPI = 'negative'/good
    (df_economics[flag_cols[2]] == 'negative'),     # Unemplyment = 'negative'/good
    'Comfortable to Good')
]

# Declaring `Economic Climate` with a `PLACEHOLDER` value
df_economics['Economic Climate'] = 'PLACEHOLDER'

# Applying conditions and classifications to `Economic Climate`
for condition, classification in conditions:
    df_economics.loc[condition, 'Economic Climate'] = classification

# Confirming classifications applied
df_economics['Economic Climate'].value_counts()

### Visualizations

Generating visualizations for `df_economics`

#### Features and baselines

Declaring some helpful lists and values for plotting

In [None]:
# Creating a list of features
features_to_plot = [
    'CCI Value',
    'CPI Value',
    'Unemployment Rate (%)'
]

# Creating a value of `0` to show positive and negative values
zero_line = pd.DataFrame({
    'Date': df_economics.index,
    'val': [x for x in 0*df_economics[features_to_flag[2]]]
})
zero_line.set_index('Date', inplace=True)

# Creating a value of `100` to show break point for CCI
hundred_line = pd.DataFrame({
    'Date': df_economics.index,
    'val': [x for x in (0*df_economics[features_to_flag[0]])+100]
})
hundred_line.set_index('Date', inplace=True)

#### CCI

In [None]:
# Visualizing trends for `CCI Value`
plt.plot(hundred_line, color='black', linestyle='--')
plt.plot(df_economics[features_to_plot[0]], label='CCI', color='blue')
plt.title('Values above 100 (visualized)\n indicate consumers more likely to spend vs save')
plt.legend()
plt.show()

In [None]:
# Visualizing trends for `CCI Rolling Percent Change`
plt.plot(zero_line, color='black', linestyle='--')
plt.plot(df_economics[features_to_flag[0]], label='% Changes in CCI', color='blue')
plt.legend()
plt.show()

#### CPI

In [None]:
# Visualizing trends for `CPI Value`
plt.plot(df_economics[features_to_plot[1]], label='CPI', color='red')
plt.legend()
plt.show()

In [None]:
# Visualizing trends for `CPI Rolling Percent Change`
plt.plot(zero_line, color='black', linestyle='--')
plt.plot(df_economics[features_to_flag[1]], label='% Changes in CPI', color='red')
plt.legend()
plt.show()

#### Unemployment

In [None]:
# Visualizing trends for `Unemployment Rate (%)`
plt.plot(df_economics[features_to_plot[2]], label='Unemployment Rate (%)', color='red')
plt.legend()
plt.show()

In [None]:
# Visualizing trends for `Unemployment Rate Rolling Percent Change`
plt.plot(zero_line, color='black', linestyle='--')
plt.plot(df_economics[features_to_flag[2]], label='% Changes in Unemployment Rate', color='red')
plt.legend()
plt.show()

#### Economic Climate

In [None]:
# Vizualizing total years classified in `Economic Climate`
plt.barh(
    y=df_economics['Economic Climate'].value_counts().index,
    width=df_economics['Economic Climate'].value_counts()/12,
    color=['darkblue', 'darkgreen'],
    label=['26.42', '16.25']
)
plt.title(
    'Years from 1981 to 2023 Classified as',
    loc='left',
    pad=15
)
plt.legend()
plt.show()

### Indexing

In [None]:
# Resetting the index to recreate `Date` for later concatenation
df_economics.reset_index(inplace=True)

## Combined Data

A combined dataset will need to be prepared for modeling

### Merging

With both `df_movies` and `df_economics` prepared, the two datasets can be merged into one final working DF

#### Indexing

The `df_movies` dataset will need to be set to a `Date` index, and the year and month will need to be extracted from the `Date` of the `df_economics` dataset

In [None]:
# Creating a 'Date' for a datetime index
df_movies['Date'] = pd.to_datetime({
    'year': df_movies['released_year'],
    'month': df_movies['released_month'],
    'day': df_movies['released_day']
})

# Setting `Date` as index
df_movies.set_index('Date', inplace=True)

# Ensuring index is sorted with ascending dates
df_movies.sort_index(inplace=True)

In [None]:
# Creating a `Year` and `Month` for concatenation
df_economics['Year'] = df_economics['Date'].dt.strftime('%Y').astype(int)
df_economics['Month'] = df_economics['Date'].dt.strftime('%m').astype(int)

# Renaming to `Year` and `Month` for concatenation
df_movies.rename(columns={
'released_year': 'Year',
'released_month': 'Month'
}, inplace=True)

#### Merging

Generating the final record counts before and after merging the two datasets

In [None]:
# Confirming total records before concatenation
records_total(df_economics)
records_total(df_movies)

In [None]:
# Combining datasets through concatenation
df_combined = pd.merge(df_economics, df_movies, how='left', on=['Year', 'Month'])

# Confirming total records after concatenation
records_total(df_combined)

### EDA

Continuing EDA on the compiled DF

#### Target value

Concatenating the two engineered target values from the `df_movies` dataset with the engineered target from the `df_economics` dataset

In [None]:
# Creating the eventual `Target` for modeling
df_combined['Target'] = df_combined['critical_success'].astype(str) + ' ' +\
                        df_combined['financial_success'].astype(str) + ' ' +\
                        df_combined['Economic Climate'].astype(str)


#### Reducing features and dataset

Dropping uneeded features and removing `NaN` records

In [None]:
# Creating a list of features to drop
cols_to_drop = [
    'Date',
    'CCI Rolling Mean',
    'CPI Rolling Mean',
    'Unemployment Rate (%) Rolling Mean',
    'Year',
    'Month',
    'id',
    'cast',
    'original_language',
    'director',
    'writers',
    'producers',
    'popularity', 
    'critical_success',
    'financial_success',
    'release_date',
    'released_day',
    'production_countries',
    'status',
    'spoken_languages'
]

# Dropping unneeded features
df_combined.drop(columns=cols_to_drop, inplace=True)

In [None]:
# Dropping `NaN` records
df_combined.dropna(inplace=True)

In [None]:
# Confirming total records after concatenation
print(f'Total records: {df_combined.shape[0]}')

#### Visualizations

Generating visualizations for `df_combined`

In [None]:
# Aggregating `roi` on `mean()` by `genres`
agg_roi_mean_genre_most = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] != 'TV Movie') &
        (df_combined['genres'] != 'Comedy') &
        (df_combined['genres'] != 'Action')
    ].groupby('genres')['roi'].mean()
).reset_index()

# Plotting
ax = agg_roi_mean_genre_most.plot(
    kind='bar', x='genres', y='roi',
    figsize=(10, 6), legend=False, color='darkred'
)

# Adding title and labels
plt.title("Mean ROI by Genre\n (Less 'TV Movie', 'Comedy', and 'Action')")
plt.xlabel('Genres')
plt.ylabel('MeanROI (%)')
plt.xticks(rotation=45, ha='right')

# Format the y-axis to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

plt.tight_layout()

# Show plot
plt.show()

In [None]:
# Aggregating `roi` on `mean()` by `genres`
agg_roi_mean_genre_top_3 = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] == 'Comedy') |
        (df_combined['genres'] == 'Action')
    ].groupby('genres')['roi'].mean()
).reset_index()

# Plotting
ax = agg_roi_mean_genre_top_3.plot(
    kind='bar', x='genres', y='roi',
    figsize=(10, 6), legend=False, color='darkred'
)

# Adding title and labels
plt.title("Mean ROI by Genre\n ('Comedy' and 'Action' Only)")
plt.xlabel('Genres')
plt.ylabel('MeanROI (%)')
plt.xticks(rotation=45, ha='right')

# Format the y-axis to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

plt.tight_layout()

# Show plot
plt.show()

In [None]:
# Aggregating `roi` on `mean()` by `genres`
agg_roi_mean_genre_top_1 = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] == 'TV Movie')
    ].groupby('genres')['roi'].mean()
).reset_index()

# Plotting
ax = agg_roi_mean_genre_top_1.plot(
    kind='bar', x='genres', y='roi',
    figsize=(10, 6), legend=False, color='darkred'
)

# Adding title and labels
plt.title("Mean ROI by Genre\n ('TV Movie' Only)")
plt.xlabel('Genres')
plt.ylabel('MeanROI (%)')
plt.xticks(rotation=45, ha='right')

# Format the y-axis to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

plt.tight_layout()

# Show plot
plt.show()

In [None]:
# Aggregating `revenue` on `sum()` by `genres`
agg_rev_sum_genre = pd.DataFrame(
    df_combined.groupby('genres')['revenue'].sum()
).reset_index()

# Plotting
ax = agg_rev_sum_genre.plot(
    kind='bar', x='genres', y='revenue',
    figsize=(10, 6), legend=False, color='green'
)

# Adding title and labels
plt.title('Total Revenue by Genre')
plt.xlabel('Genres')
plt.ylabel('Total Revenue (USD)')
plt.xticks(rotation=45, ha='right')

# Format the y-axis to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

plt.tight_layout()

# Show plot
plt.show()

In [None]:
# Aggregating `roi` on `mean()` by `Economic Climate`
agg_roi_mean_economy = pd.DataFrame(
    df_combined.groupby('Economic Climate')['roi'].mean()
).reset_index()

# Plotting
ax = agg_roi_mean_economy.plot(
    kind='bar', x='Economic Climate', y='roi',
    figsize=(10, 6), legend=False, color=['darkblue', 'darkred']
)

# Adding title and labels
plt.title('Mean ROI by Economic Climate')
plt.xlabel('Economic Climate')
plt.ylabel('ROI (%)')
plt.xticks(rotation=0)

# Format the y-axis to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

plt.tight_layout()

# Show plot
plt.show()

In [None]:
# Aggregating `revenue` on `sum()` by `Economic Climate`
agg_rev_sum_economy = pd.DataFrame(
    df_combined.groupby('Economic Climate')['revenue'].sum()
).reset_index()

# Plotting
ax = agg_rev_sum_economy.plot(
    kind='bar', x='Economic Climate', y='revenue',
    figsize=(10, 6), legend=False, color=['darkblue', 'darkred']
)

# Adding title and labels
plt.title('Total Revenue by Economic Climate')
plt.xlabel('Economic Climate')
plt.ylabel('Total Revenue (USD)')
plt.xticks(rotation=0)

# Format the y-axis to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

plt.tight_layout()

# Show plot
plt.show()

In [None]:
# Aggregating `roi` on `mean()` by `Economic Climate` and `genres`
agg_roi_mean_most = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] != 'TV Movie') &
        (df_combined['genres'] != 'Comedy') &
        (df_combined['genres'] != 'Action')
    ].groupby(['Economic Climate', 'genres'])['roi'].mean()
).reset_index()

# Pivotting the table for plotting
pivot_table = agg_roi_mean_most.pivot(index='genres', columns='Economic Climate', values='roi')

# Plotting
ax = pivot_table.plot(kind='bar', figsize=(14, 8), color=['blue', 'red'])

# Adding title and labels
plt.title("Mean ROI by Genre and Economic Climate\n (Less 'TV Movie', 'Comedy', and 'Action')")
plt.xlabel('Genres')
plt.ylabel('ROI (%)')
plt.legend(title='Economic Climate')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Formatting to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

# Displaying plot
plt.show()

In [None]:
# Aggregating `roi` on `mean()` by `Economic Climate` and `genres`
agg_roi_mean_both_top_3 = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] == 'Comedy') |
        (df_combined['genres'] == 'Action')
    ].groupby(['Economic Climate', 'genres'])['roi'].mean()
).reset_index()

# Pivotting the table for plotting
pivot_table = agg_roi_mean_both_top_3.pivot(index='genres', columns='Economic Climate', values='roi')

# Plotting
ax = pivot_table.plot(kind='bar', figsize=(14, 8), color=['blue', 'red'])

# Adding title and labels
plt.title("Mean ROI by Genre and Economic Climate\n ('Comedy' and 'Action' Only)")
plt.xlabel('Genres')
plt.ylabel('ROI (%)')
plt.legend(title='Economic Climate')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Formatting to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

# Displaying plot
plt.show()

In [None]:
# Aggregating `roi` on `mean()` by `Economic Climate` and `genres`
agg_roi_mean_both_top_1_good = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] == 'TV Movie') &
        (df_combined['Economic Climate'] == 'Comfortable to Good')
    ].groupby(['Economic Climate', 'genres'])['roi'].mean()
).reset_index()

# Aggregating `roi` on `mean()` by `Economic Climate` and `genres`
agg_roi_mean_both_top_1_bad = pd.DataFrame(
    df_combined.loc[
        (df_combined['genres'] == 'TV Movie') &
        (df_combined['Economic Climate'] == 'Lean to Bad')
    ].groupby(['Economic Climate', 'genres'])['roi'].mean()
).reset_index()

# Pivotting the tables for plotting
pivot_table_good = agg_roi_mean_both_top_1_good.pivot(index='genres', columns='Economic Climate', values='roi')
pivot_table_bad = agg_roi_mean_both_top_1_bad.pivot(index='genres', columns='Economic Climate', values='roi')
                                                
# Creating subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 8))

# Plotting for 'Lean to Bad'
ax1 = pivot_table_bad.plot(kind='bar', ax=axes[0], color='red')
ax1.set_title("Mean ROI by Genre in 'Lean to Bad' Economic Climate\n ('TV Movie' Only)")
ax1.set_xlabel('Genres')
ax1.set_ylabel('ROI (%)')
ax1.legend(title='Economic Climate')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
ax1.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

# Plotting for 'Comfortable to Good'
ax2 = pivot_table_good.plot(kind='bar', ax=axes[1], color='blue')
ax2.set_title("Mean ROI by Genre in 'Comfortable to Good' Economic Climate\n ('TV Movie' Only)")
ax2.set_xlabel('Genres')
ax2.set_ylabel('ROI (%)')
ax2.legend(title='Economic Climate')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')
ax2.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

# Adjusting layout
plt.tight_layout()

# Displaying plot
plt.show()

In [None]:
# Aggregating `revenue` on `sum()` by `Economic Climate` and `genres`
agg_rev_sum_both = pd.DataFrame(
    df_combined.groupby(['Economic Climate', 'genres'])['revenue'].sum()
).reset_index()

# Pivotting the table for plotting
pivot_table = agg_rev_sum_both.pivot(index='genres', columns='Economic Climate', values='revenue')

# Plotting
ax = pivot_table.plot(kind='bar', figsize=(14, 8), color=['blue', 'red'])

# Adding title and labels
plt.title('Total Revenue by Genre and Economic Climate')
plt.xlabel('Genres')
plt.ylabel('Revenue (USD)')
plt.legend(title='Economic Climate')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Formatting to avoid scientific notation
ax.get_yaxis().set_major_formatter(StrMethodFormatter('{x:,.0f}'))

# Displaying plot
plt.show()

#### Reducing features

Dropping the final uneeded feature before proceeding

In [None]:
# Dropping unneeded `Economic Climate`
df_combined.drop(columns=['Economic Climate'], inplace=True)

# **Train Test Splitting**

In [None]:
# Setup X and y variables
X = df_combined.drop(columns='Target')
y = df_combined['Target']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

# **Scaling and Econding**

In [None]:
# Defining columns to scale and encode
col_to_scale = [
    'CCI Value', 'CCI Rolling Percent Change', 'CPI Value',
    'CPI Rolling Percent Change', 'Unemployment Rate (%)', 
    'Unemployment Rate Rolling Percent Change','vote_average', 'vote_count',
    'revenue','runtime','budget', 'roi'
]

col_to_encode = [
    'CCI Rolling Percent Change Flag', 'CPI Rolling Percent Change Flag',
    'Unemployment Rate Rolling Percent Change Flag', 'title', 'original_title',
    'genres', 'production_companies'
]

In [None]:
# Creating an instance for `StandardScalar()`
scaler = StandardScaler()

# Fitting and transforming to `col_to_scale`
X_train_scaled = scaler.fit_transform(X_train[col_to_scale])
X_test_scaled = scaler.transform(X_test[col_to_scale])

# Converting results to DF for later concatenation
X_train_scaled = pd.DataFrame(X_train_scaled, columns=col_to_scale)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=col_to_scale)

In [None]:
# Creating an instance for `OneHotEncoder()` for `X_train[col_to_encode]`
encoder_x = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fitting to `col_to_encode`
encoder_x.fit(X_train[col_to_encode])

# Creating an instance for `OneHotEncoder()` for `y_train`
encoder_y = LabelEncoder()

#Fitting
encoder_y.fit(y_train.values.ravel())

# Transforming `X_train[col_to_encode]` and `X_test[col_to_encode]`
X_train_encoded = encoder_x.transform(X_train[col_to_encode])
X_test_encoded = encoder_x.transform(X_test[col_to_encode])

# Transforming `y_train` and `y_test`
y_train_encoded = encoder_y.transform(y_train.values.ravel())
y_test_encoded = encoder_y.transform(y_test.values.ravel())

# Converting results to DF for later concatenation
X_train_encoded = pd.DataFrame(X_train_encoded, columns=encoder_x.get_feature_names_out())
X_test_encoded = pd.DataFrame(X_test_encoded, columns=encoder_x.get_feature_names_out())

In [None]:
# Concatenating the `col_to_scale` with `col_to_encode` for `X_train` and `X_test`
X_train = pd.concat([X_train_scaled, X_train_encoded], axis=1)
X_test = pd.concat([X_test_scaled, X_test_encoded], axis=1)

In [None]:
# Confirming total records after concatenation
print(f'Total X records: {X_train.shape[0] + X_test.shape[0]}')

# Modeling

Playtime!!

# Eric's Space

# Funda's Space

# Kalvin's Space

In [None]:
# Create a untuned KNN classifier
untuned_model = KNeighborsClassifier()

## Train the untuned model
untuned_model.fit(X_train, y_train_encoded)

# Check the model's accuracy on the test set
untuned_y_test_pred = untuned_model.predict(X_test)
print(accuracy_score(y_test_encoded, untuned_y_test_pred))
print(precision_score(y_test_encoded, untuned_y_test_pred, average='weighted', zero_division=0))

In [None]:
# Instantiate the PCA instance and declare the number of PCA variables to retain maximum variance
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
# Create a untuned KNN classifier
untuned_model = KNeighborsClassifier()

## Train the untuned model
untuned_model.fit(X_train_pca, y_train_encoded)

# Check the model's accuracy on the test set
untuned_y_test_pred = untuned_model.predict(X_test_pca)
print(accuracy_score(y_test_encoded, untuned_y_test_pred))
print(precision_score(y_test_encoded, untuned_y_test_pred, average='weighted', zero_division=0))

In [None]:
# Create a KNN classfier to loop through different k values to find which has the highest accuracy.
# Note: We use only odd numbers because we don't want any ties.
train_scores = []
test_scores = []
for k in range(1, 40, 2):
    loop_model = KNeighborsClassifier(n_neighbors=k)
    loop_model.fit(X_train, y_train_encoded)
    train_score = loop_model.score(X_train, y_train_encoded)
    test_score = loop_model.score(X_test, y_test_encoded)
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f"k: {k}, Train/Test Score: {train_score:.3f}/{test_score:.3f}")
    
# Plot the results
plt.plot(range(1, 40, 2), train_scores, marker='o', label="training scores")
plt.plot(range(1, 40, 2), test_scores, marker="x", label="testing scores")
plt.xlabel("k neighbors")
plt.ylabel("accuracy score")
plt.legend()
plt.show()

In [None]:
# Choose the best k, and refit the KNN classifier by using that k value.
# Note that k=23 provides the best accuracy where the classifier is not overfitting.
loop_model = KNeighborsClassifier(n_neighbors=23)
loop_model.fit(X_train, y_train_encoded)

# Check the model's accuracy on the test set
loop_y_test_pred = loop_model.predict(X_test)
print(accuracy_score(y_test_encoded, loop_y_test_pred))
print(precision_score(y_test_encoded, loop_y_test_pred, average='weighted', zero_division=0))

In [None]:
'''
The grid search below used to hyperparameter tune the KNN Classifier provided
a k value = 1, with an accurancy score of .578.
The code has been commented out code because it took 24min to run.
'''
# # Create a grid search KNN classifier
# grid_model = KNeighborsClassifier()

# # Define the parameter grid tuned KNN classifier
# param_grid = {'n_neighbors': list(range(1, 25, 2)),
#             'weights': ['uniform', 'distance'],
#             'leaf_size': [10, 50, 100, 500]
# }

# # Create a GridSearchCV model
# grid = GridSearchCV(grid_model, param_grid, verbose=3)

# # Fit the model by using the grid search estimator.
# # This will take the KNN model and try each combination of parameters.
# grid.fit(X_train, y_train_encoded)

# # Best parameter and score
# print(f"Best k: {grid.best_params_['n_neighbors']}")
# print(f"Best cross-validated accuracy: {grid.best_score_}")

# Odele's Space

#### Peta-Gaye's LR modelling

In [None]:
# Declare a logistic regression model.
logistic_regression_model = LogisticRegression(max_iter=500, solver='lbfgs')

In [None]:
# Fit and save the logistic regression model using the training data
df_combined_lr_model = logistic_regression_model.fit(X_train, y_train_encoded)

In [None]:
# Generate predictions from the logistic regression model using the test data
lr_predictions = logistic_regression_model.predict(X_test)

# Review the predictions
lr_predictions

In [None]:
# Display the accuracy score for the test dataset.
accuracy_score(y_test_encoded, lr_predictions)

In [None]:
# Display the precision score for the test dataset.
precision_score(y_test_encoded, lr_predictions, average='weighted', zero_division=1)

#### End Peta-Gaye's LR modelling

#### Eric's AdaBoost modelling

In [None]:
# Declaring an `AdaBoostClassifier` model
ada_model = AdaBoostClassifier(algorithm='SAMME', random_state=1)

In [None]:
# Fitting the model
ada_model.fit(X_train, y_train_encoded)

In [None]:
# Displaying model scores
print(f'Training score: {ada_model.score(X_train, y_train_encoded)}')
print(f'Testing score: {ada_model.score(X_test, y_test_encoded)}')

In [None]:
# Predicting with the model
ada_pred = ada_model.predict(X_test)

In [None]:
# Displaying the accuracy score for the test dataset
accuracy_score(y_test_encoded, ada_pred)

In [None]:
# Display the precision score for the test dataset
precision_score(y_test_encoded, ada_pred, average='weighted', zero_division=0)

#### End Eric's AdaBoost modelling

#### Funda's Linear Regression Modeling

In [None]:
# Select the "roi" column from X_train and X_test
X_train_roi = X_train[["roi"]]
X_test_roi = X_test[["roi"]]

# Initialize and fit the model with Y as the independent variable and ROI as the dependent variable
model = LinearRegression()
model.fit(y_train_encoded.reshape(-1, 1), X_train_roi)

# Predict ROI on the test set using Y
predicted_roi = model.predict(y_test_encoded.reshape(-1, 1))

# Calculate and print the metrics
print("Mean Squared Error:", mean_squared_error(X_test_roi, predicted_roi))
print("R2 Score:", r2_score(X_test_roi, predicted_roi))

# Plotting the results with flipped axes
plt.figure(figsize=(10, 6))
plt.scatter(y_test_encoded, X_test_roi, color='blue', label='Actual Data')
plt.plot(y_test_encoded, predicted_roi, color='red', linewidth=2, label='Regression Line')
plt.ylabel('ROI')
plt.xlabel('Time (Y)')
plt.title('Linear Regression of ROI Over Time')
plt.legend()
plt.show()

#### End Funda's Linear Regression Modeling

# Peta's Space

# Vadim's Space

In [None]:
# Declare a Random Forest Classifier model
random_forest_model = RandomForestClassifier(random_state=1)

In [None]:
# Fit and save the Random Forest Classifier model using the training data
random_forest_model.fit(X_train, y_train_encoded)

# Generate predictions from the model using the test data
RFM_pred = random_forest_model.predict(X_test)

# Review the predictions
RFM_pred

In [None]:
# Displaying model scores
print(f'Training score: {random_forest_model.score(X_train, y_train_encoded)}')
print(f'Testing score: {random_forest_model.score(X_test, y_test_encoded)}')

In [None]:
# Display the accuracy score for the test dataset
accuracy_score(y_test_encoded, RFM_pred)

In [None]:
# Display the precision score for the test dataset.
precision_score(y_test_encoded, RFM_pred, average='weighted', zero_division=1)

# Findings

# Additional

# Citations and Licenses

## Citaions

### **Unemployment.csv**

Economic Policy Institute, *State of Working America Data Library*, “Unemployment”, 2024

## Licenses

### **TMBD_all_movies.csv**

Copyright 2024 __[Alan Vourc'h](https://www.kaggle.com/alanvourch)__

Licensed under the Apache License, Version 2.0 (the "License");
You may not use this file except in compliance with the License. You may obtain a copy of the License at

> __http://www.apache.org/licenses/LICENSE-2.0__

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

### **CCI_OECD.csv** and **US_inflation_rates.csv**

CCO: Public Domain