<a href="https://colab.research.google.com/github/brook-miller/mbai-417-data/blob/main/operationalizing-data/in-class/content_recommender-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating content recommendations for Kellogg+

> “He called me out of the blue and said, ‘Do you want to do Moneyball for Hollywood?’” Marolda recalls, referring to the book and movie about how the Oakland A’s used statistical analysis to improve their team’s fortunes. “I thought for 30 seconds and said, ‘Yeah.’” Legendary hired Marolda and acquired part of StratBridge’s business.
>
> In the sports world, says Marolda, using analytics “never created a great athlete. But they can help a manager put him in the best position to succeed. It’s the same with entertainment content. You want to put it in the best position to succeed. We’re just trying to increase our odds and be smarter.”  
>
> [Making Movies the Moneyball Way, Boston Globe, 2016](https://www.bostonglobe.com/business/technology/2016/03/31/making-movies-moneyball-way/Uzgwh2cdGthA1N3nZHqz0N/story.html)

As part of the new streaming service Kellogg+ surveys were conducted to identify the favorite movies as well as key demographic info for a sample of Kellogg+ customers.

Kellogg+ is interested in recruiting more diverse and younger audiences to their platform and would like to have a profile of the favorite movies of their audience in order to make better recommendations on what types of content to produce on the streaming service.

In addition to the favorite movies in [favorite-movies.csv](https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/data-governance/homework/favorite-movies.csv), the team procured data from [IMDB](https://www.imdb.com/) with key statistics and genre information for each movie in [imdb-movie-data.csv](https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/data-governance/homework/imdb-movie-data.csv).

For this assigment:
1. Remove PII and create a unique id for each of the 3,198 responses.
2. Organize the data and remove movie titles that can't be matched to the IMDB data.
3. Determine the most popular movies for the 24 and under age group.
4. Find out the top genre preferences of the 24 and under age group.
5. Determine if favorite movies vary by ethnicity (all age groups)

*Bonus: Some execs believe that content is too long.  Can we determine if the 24 and under age group prefers shorter movies?


In [None]:
#@title installing pandasql 
#@markdown While pandas is unrivaled for data manipulation, it can be more challenging for reporting.  Wanted to provide the pandasql library.  
#@markdown [short tutorial / examples](https://www.analyticsvidhya.com/blog/2021/07/pandasql-best-way-to-run-sql-queries-and-codes-in-jupyter-notebook-using-python/) as an alternative for reporting.
!pip install pandasql
from pandasql import sqldf
sqlquery = lambda q: sqldf(q, globals())

In [None]:
#@title standard imports - we'll use in most EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

from datetime import datetime, timedelta
from dateutil.parser import parse
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [None]:
#@title loading data
favoritesdf = pd.read_csv("https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/data-governance/homework/favorite-movies.csv")
moviesdf = pd.read_csv("https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/data-governance/homework/imdb-movie-data.csv")

In [None]:
favoritesdf

In [None]:
moviesdf = moviesdf.drop_duplicates(subset=["Title"])
moviesdf

In [None]:
#@title An example query using pandasql if you're more comfortable using that for reporting
age_genders = sqlquery("""
  select gender_age, count(*) as counts
    from favoritesdf
    group by gender_age
    order by counts desc
  """)
age_genders

In [None]:
#@title An example query using pandasql if you're more comfortable using that for reporting
ethnicities = sqlquery("""
  select ethnicity, count(*) as counts
    from favoritesdf
    group by ethnicity
    order by counts desc
  """)
ethnicities

In [None]:
#@title using murmur3 to hash the social security numbers
len(favoritesdf.social_security_num.unique())
!pip install mmh3
import mmh3
favoritesdf["id"] = favoritesdf["social_security_num"].apply(lambda x: mmh3.hash(x))

In [None]:
#@title data wrangling, organizing data to get all movies into a single column
fav2 = favoritesdf.drop(["name", "email_address", "social_security_num"], axis = 1)
fav3 = pd.melt(fav2, id_vars=["gender_age", "ethnicity", "id"], value_vars=["movie1", "movie2", "movie3", "movie4", "movie5"],
        var_name='temp', value_name='movie')
fav4 = fav3.drop(["temp"], axis = 1)
movies2 = moviesdf[["Title"]]
fav5 = pd.merge(left = fav4, right = movies2,  left_on = "movie", right_on = "Title")
fav6 = fav5.drop(["Title"], axis = 1)
fav6[['gender', 'age']] = fav6['gender_age'].str.split(' ', 1, expand=True)
fav8 = fav6[["id","gender", "age", "ethnicity", "movie"]]

In [None]:
#@title long tail
fig = px.histogram(fav8, x="movie").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
fig = px.histogram(fav8.loc[(fav8.age == "14-17") | (fav8.age == "18-24")], x="movie").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
fav8

In [None]:
#@title grouping by gender, age, ethnicity and creating count
grouped = fav8.groupby(by=["gender", "age", "ethnicity", "movie"])["id"].count().reset_index(name="count")
grouped

In [None]:
#@title creating a column for every age, gender, ethnicity combination and a total column (the column names are actually tuples)
fav9 = grouped.pivot(index=["movie"], columns=["gender","age","ethnicity"], values = "count").fillna(value=0)
fav9.columns = fav9.columns.to_flat_index()
fav9["total"] = fav9.sum(axis =1)
fav9 = fav9.sort_values(by="total", ascending=False)
fav10 = fav9.loc[fav9["total"] >= 10].reset_index()
fav10

In [None]:
#@title creating a new summary dataframe for plotting
female_columns = [col for col in fav10.columns if "Female" in col ]
youth_columns = [col for col in fav10.columns if "17" in col[1] or "24" in col[1]]
                                                  
summary_df = fav10[["movie", "total"]].copy()
summary_df["femalen"] = fav10[female_columns].sum(axis =1) / fav10["total"]
summary_df["youth"] = fav10[youth_columns].sum(axis=1) / fav10["total"]
summary_df = summary_df.sort_values(by="youthness", ascending=False)
summary_df

In [None]:
plotdf = summary_df.sort_values(by="total", ascending=False)[:100]
fig = px.scatter(plotdf, x="female", y="youth", text="movie", size = "total", height=1000, color="total")
fig.update_traces(textposition='bottom center')
fig.show()

# Lab Exercise

In the code cells below, create a plot showing top movies percentage of female responses vs. percentage of "African-American or Black" responses

In [None]:
#@title creating a new summary dataframe for plotting
female_columns = [col for col in fav10.columns if "Female" in col[0] ]
aab_columns = [col for col in fav10.columns if "Afri" in col[2]]
                                                  
summary_df = fav10[["movie", "total"]].copy()
summary_df["female"] = fav10[female_columns].sum(axis =1) / fav10["total"]
summary_df["african-american_black"] = fav10[aab_columns].sum(axis=1) / fav10["total"]
summary_df = summary_df.sort_values(by="african-american_black", ascending=False)

plotdf = summary_df.sort_values(by="total", ascending=False)[:100]
fig = px.scatter(plotdf, x="female", y="african-american_black", text="movie", size = "total", height=1000, color="total")
fig.update_traces(textposition='bottom center')
fig.show()

# Towards Feature Engineering

In [None]:
#@title adding genre columns for each movie
fav11 = pd.merge(left=fav10, right=moviesdf[["Title", "Genres"]], left_on="movie", right_on="Title", how="left" ).drop("Title", axis=1)
fav11["Genres"] = fav11["Genres"].apply(lambda x : x.replace(",","|").replace(" ",""))
genres = fav11["Genres"].str.get_dummies().add_prefix('genre_')
fav12 = pd.concat([fav11, genres], axis=1)
fav13 = fav12.drop(["Genres", "total"], axis=1)
fav13

In [None]:
#@title sklearn scaler errors on tuple column names, scale features and then use t-SNE to represent the features in 2 dimensions
def tuplerenamer(colname):
  if type(colname) == tuple:
    return '_'.join(colname)
  return colname

from sklearn.preprocessing import MinMaxScaler

prepped = fav13.rename(columns= lambda x : tuplerenamer(x))

scaler = MinMaxScaler()
scaled = scaler.fit_transform(prepped.iloc[:,1:])
prepped.iloc[:,1:] = scaled

from sklearn.manifold import TSNE
model_tsne = TSNE(n_components=2, random_state=0)
Y = model_tsne.fit_transform(scaled)

prepped[["x", "y"]] = Y
prepped

In [None]:
#@title visualizing the titles in t-SNE embedded space
prepped["total"] = fav12["total"];
plotdf = prepped.sort_values(by="total", ascending=False)[:150]
fig = px.scatter(plotdf, x="x", y="y", text="movie", size = "total", height=1000, color="total")
fig.update_traces(textposition='bottom center')
fig.show()

In [None]:
#@title Find recommended movies by choosing a movie { run: "auto" }
from scipy.spatial import distance
movies = prepped["movie"].tolist()
dropdown = 'Avengers: Endgame' #@param ['Avengers: Endgame', 'Titanic', 'Toy Story', 'Forrest Gump', 'Pulp Fiction', 'Superbad']

numerics = prepped.iloc[:,1:61] #keeping only the numeric columns
selected = numerics[prepped["movie"] == dropdown] #finding the row of the movie from the selected drop down
prepped["dist"]= numerics.apply(lambda row: distance.euclidean(row, selected), axis=1) #adding a distance calculation

# Create a new dataframe with distances.
results = prepped.sort_values(by="dist") #order the results by movies closest to the selected movie
print(results["movie"].tolist()[:5])

# For more: [Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241/)