<a href="https://colab.research.google.com/github/JaohBlack/Analytics/blob/main/AB%20Testing%20Movie%20prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Brief Description**

This assignment uses stat models for doing regression analysis especially with logged variables to check for percentage change. It also incorporates spicy for A/B testing analysis and sklearn to form a movie prediction system using TFID, title only and titles with genre incorporated.

**Question 1: Price Elasticity Analysis**

1. Using the dataset mmix_data.csv, estimate a log-log regression model where the dependent variable is the natural log of quantity sold (ln_quantity), and the independent variable is the natural log of price (ln_price).

2. Interpret the coefficient of ln_price in the regression output. What does the price elasticity value indicate about consumer sensitivity to price changes?

3. Based on the estimated elasticity, recommend whether you should consider increasing or decreasing the price of HDTVs to maximize revenue. Justify your answer.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import adfuller
data = pd.read_csv('mmix_data.csv').dropna()

#1
data_logged = (
    data
    .assign(ln_quantity = lambda d: np.log(d.quantity),
        ln_price = lambda d: np.log(d.price))
)

formula = 'ln_quantity ~ ln_price'
mmix_log_reg = smf.ols(formula = formula, data = data_logged).fit()
print(mmix_log_reg.summary())


                            OLS Regression Results                            
Dep. Variable:            ln_quantity   R-squared:                       0.221
Model:                            OLS   Adj. R-squared:                  0.220
Method:                 Least Squares   F-statistic:                     207.0
Date:                Mon, 16 Dec 2024   Prob (F-statistic):           1.69e-41
Time:                        16:16:09   Log-Likelihood:                -542.31
No. Observations:                 731   AIC:                             1089.
Df Residuals:                     729   BIC:                             1098.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     19.4093      1.034     18.776      0.0

  and should_run_async(code)


2. The coefficient for ln_price is -2.5715. This value indicates that the price elasticity of demand is -2.5715. This means that a 1% increase in price will lead to a 2.57% decrease in quantity demanded Conversely, a 1% decrease in price will result in a 2.57% increase in quantity demanded.

3. The magnitude of elasticity suggests that consumers have an elastic demand and are quite sensitive to price changes. This means that small changes in price lead to relatively large changes in quantity demanded. To maximize revenue, HDTVs should consider decreasing the price. A decrease in price will lead to a proportionally larger increase in quantity sold, which will increase overall revenue.

**Question 2: Budget Allocation for Advertising Channels**

1.	Start by creating the dataset mmix_data_logged by performing the following transformations on mmix_data.csv:

    - Compute the natural logarithm of quantity sold (ln_quantity): ln_quantity = log(quantity).
    - Compute the natural logarithm of price (ln_price): ln_price = log(price).
    - Compute the natural logarithms of the advertising expenditures for each channel:
        - ln_digital_ad = log(digital_ad)
        - ln_digital_search = log(digital_search)
        - ln_print = log(print + 1) (to handle zero expenditures in print ads, add 1 before taking the logarithm).
        - ln_tv = log(tv)

Once these transformations are completed, use this dataset (mmix_data_logged) to estimate a regression model where the dependent variable is ln_quantity, and the independent variables are ln_digital_ad, ln_digital_search, ln_print, ln_tv, and ln_price.

2.	Interpret the coefficients of ln_digital_ad, ln_print, and ln_tv. Which channel shows the highest sales elasticity?

3.	Assume a total advertising budget of \$20,000. Propose an optimal budget allocation across digital ads, digital search, print, and TV to maximize predicted sales. Use the regression model to calculate predicted sales for your proposed allocation and explain your reasoning.

In [None]:
#1
mmix_data_logged = (
    data
    .assign(ln_quantity = lambda d: np.log(d.quantity),
            ln_price = lambda d: np.log(d.price),
            ln_digital_ad = lambda d: np.log(d.digital_ad),
            ln_digital_search = lambda d: np.log(d.digital_search),
            ln_print = lambda d: np.log(d.print + 1),
            ln_tv = lambda d: np.log(d.tv))
)

formula = 'ln_quantity ~ ln_digital_ad + ln_digital_search + ln_print + ln_tv + ln_price'
mmix_log_reg1 = smf.ols(formula = formula, data = mmix_data_logged).fit()
print(mmix_log_reg1.summary())

#3
digital_ad_spend = np.log(0.20 * 20000)
digital_search_spend = np.log(0.30 * 20000)
print_spend = np.log(0.35 * 20000)
tv_spend = np.log(0.15 * 20000)
ln_price = mmix_data_logged['ln_price'].mean()
predicted_ln_quantity1 = (
    mmix_adv_reg1.params['Intercept'] +
    mmix_adv_reg1.params['ln_digital_ad'] * digital_ad_spend +
    mmix_adv_reg1.params['ln_digital_search'] * digital_search_spend +
    mmix_adv_reg1.params['ln_print'] * print_spend +
    mmix_adv_reg1.params['ln_tv'] * tv_spend +
    mmix_adv_reg1.params['ln_price'] * ln_price
        )
print(predicted_ln_quantity1)

# Trying to predict reallocating 10% from digital_search_spend to digital_ad_spend
# and 5% from print_spend to tv_spend.
digital_ad_spend = np.log(0.30 * 20000)
digital_search_spend = np.log(0.20 * 20000)
print_spend = np.log(0.30* 20000)
tv_spend = np.log(0.20 * 20000)
ln_price = mmix_data_logged['ln_price'].mean()
predicted_ln_quantity2 = (
    mmix_adv_reg1.params['Intercept'] +
    mmix_adv_reg1.params['ln_digital_ad'] * digital_ad_spend +
    mmix_adv_reg1.params['ln_digital_search'] * digital_search_spend +
    mmix_adv_reg1.params['ln_print'] * print_spend +
    mmix_adv_reg1.params['ln_tv'] * tv_spend +
    mmix_adv_reg1.params['ln_price'] * ln_price
        )
print(predicted_ln_quantity2)

difference = predicted_ln_quantity1 - predicted_ln_quantity2
exp_difference = np.exp(difference)
pct_change = (exp_difference - 1) * 100
print(pct_change)

                            OLS Regression Results                            
Dep. Variable:            ln_quantity   R-squared:                       0.295
Model:                            OLS   Adj. R-squared:                  0.291
Method:                 Least Squares   F-statistic:                     60.82
Date:                Mon, 16 Dec 2024   Prob (F-statistic):           6.13e-53
Time:                        16:27:28   Log-Likelihood:                -505.66
No. Observations:                 731   AIC:                             1023.
Df Residuals:                     725   BIC:                             1051.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            15.2940      1.27

  and should_run_async(code)


2. A 1% increase in digital advertising spending results in a 0.0644% increase in quantity sold. Digital ads have a positive but moderate impact on sales.
- A 1% increase in print advertising spending leads to a 0.0781% increase in quantity sold. Print ads have a larger effect on sales than digital ads.
- A 1% increase in TV advertising spending results in a 0.0546% increase in quantity sold. TV ads also positively impact sales, but the effect is lower than that of print ads.

- Therefore, Print advertising shows the highest sales elasticity, with a coefficient of 0.0781. Indicating that print ads provide the highest return in terms of sales growth per 1% increase in spending.

3. Based on this model, FreshMart Grocery Stores should prioritize the first allocation, as it slightly outperforms the second allocation in terms of predicted sales, while still maintaining a balanced investment in digital advertising and TV. Reallocating more to digital advertising and TV is not optimal in this case as it does not increae predicted sales. Therefore, the optimal is the first allocation.

**Question 3: A/B Testing for Revenue Analysis**

Create a new dataset, mmix_data_weekend, by performing the following steps:

- Load the mmix_data.csv file into a pandas DataFrame.
- Convert the date column to datetime format using pd.to_datetime().
- Extract the day of the week using pd.to_datetime(df['date']).dt.day_name() and create a new column called weekdays.
- Create a new column called weekend that assigns the value 1 for Saturday and Sunday and 0 for all other days

1. Using the mmix_data_weekend dataset, perform a two-sample t-test to determine whether the average revenue on weekends is significantly different from that on weekdays.
2. Formulate the null and alternative hypotheses for the test. State the significance level you are using and justify your choice.
3. Report the test statistic, p-value, and your decision regarding the null hypothesis. What does this result suggest about consumer behavior on weekends versus weekdays?
4. Discuss how ClearView Electronics could use these insights to adjust its promotional strategies.

In [None]:
mmix_data_weekend = pd.read_csv('mmix_data.csv').dropna()

# Weekdays column
mmix_data_weekend = mmix_data_weekend.assign(
    weekdays=lambda d: pd.to_datetime(d['date']).dt.day_name()
)

# Weekend column
mmix_data_weekend = mmix_data_weekend.assign(
    weekend=lambda d: np.where(d.weekdays.isin(['Saturday', 'Sunday']), 1, 0)
)


#1
from scipy import stats
weekday_revenue = mmix_data_weekend[lambda d: d.weekend == 0]["revenue"]
weekend_revenue = mmix_data_weekend[lambda d: d.weekend == 1]["revenue"]
t_stat, p_value = stats.ttest_ind(weekday_revenue, weekend_revenue, equal_var = False)
alpha = 0.01

print(f"t_statistic: {t_stat:.4f}")
print(f"p_value: {p_value:.4f}")
# Hypothesis testing

if p_value < alpha:
  print("Reject the null hypothesis")
  print("Average revenue on weekends is significantly different from average revenue from weekdays")
else:
  print("Fail to reject the null hypothesis")
  print("Average revenue on weekends is same as average from weekdays")

t_statistic: 0.6682
p_value: 0.5043
Fail to reject the null hypothesis
Average revenue on weekends is same as average from weekdays


2. Hypothesis statement
- H0: Average revenue on weekends is same as average revenue from weekdays
- H1: Average revenue on weekends is significantly different from average revenue from weekdays
- I will use a significance level of 0.05 because it is widely accepted across many fields of research, making results easier to compare across studies. Additionally, this level indicates only a 5% chance of rejecting the null hypothesis when it is true (i.e. a Type I error), thereby minimizing errors. It also strikes a balance between being too lenient and too strict, ensuring a reasonable threshold for determining statistical significance.

3. Test Results:
- Test Statistic: 0.6682
- P-value: 0.5043
- Decision: Fail to reject the null hypothesis.
    There is insufficient evidence to conclude that weekend revenue differs significantly from weekday revenue. Meaning that that the difference in average revenue between weekends and weekdays is not statistically significant at the 0.05 level. This suggests consumer spending behavior is relatively consistent across weekends and weekdays.

4. Through this insight, ClaerView Electronics can shift from weekend-focused promotions to a more balanced strategy throughout the week as weekend promotions does not give them any clear advantage. I recommend that they introduce loyalty programs to encourage repeated purchases and also including time limited weeklong deals. On another hand, they can stimulate weekend sales by having specialized price discount over bulk purchase over the weekend.

**Question 4: Market Basket Analysis and Association Metrics**

1. Using the provided order_products.csv dataset, perform the following steps to prepare the data for Market Basket Analysis:
    - Create a transactional dataset where each row represents an order and each column represents a product.
    - Ensure the dataset includes binary values indicating whether a product was purchased in a transaction.

2. Use the Apriori algorithm to identify frequent itemsets with a minimum support threshold of 0.02. Generate association rules based on these itemsets.

3. Interpret the following metrics for the top 5 association rules:
    - Support: What does this metric tell you about the frequency of itemsets?
    - Confidence: How reliable are the generated rules based on this metric?
    - Lift: Which rule shows the strongest positive association between items?

4. Based on the results, recommend two product bundling or cross-selling strategies FreshMart Grocery Stores could implement to increase the average basket size.

In [None]:
import pandas as pd
import random
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from pprint import pprint

orders = pd.read_csv('order_products.csv')
order_ids = orders.order_id.to_list()
random_order_ids = random.sample(order_ids, 10000)
orders_subset = (
    data.loc[lambda d: d.order_id.isin(random_order_ids)]
)

product_ids = (
    pd.read_csv('products.csv')
      .loc[:,["product_id","product_name"]]
)


merged = (
    orders_subset
        .merge(product_ids,
               how = "inner",
               on = "product_id")
)

# Creating the transactional dataset
mkt_basket = (
    merged
        .loc[:,["order_id","product_name"]]
        .assign(quantity = 1)
        .sort_values(by = "order_id")
        .groupby(['order_id', 'product_name'])['quantity']
          .sum()
          .unstack()
          .reset_index()
          .fillna(0)
          .set_index('order_id')
)

#2 Apriori algorithm
frequent_itemsets = apriori(mkt_basket,
                            min_support = 0.02,
                            use_colnames = True)

# Association rule
rules = association_rules(frequent_itemsets,
                          metric = "lift",
                          min_threshold = 1.0,
                          num_itemsets = 2)
print(rules.loc[0:5,["antecedents",
                      "consequents",
                      "support",
                      "lift",
                      "confidence"]])

  and should_run_async(code)


                antecedents               consequents   support      lift  \
0    (Organic Baby Spinach)  (Bag of Organic Bananas)  0.028768  1.623174   
1  (Bag of Organic Bananas)    (Organic Baby Spinach)  0.028768  1.623174   
2    (Organic Hass Avocado)  (Bag of Organic Bananas)  0.035748  2.447238   
3  (Bag of Organic Bananas)    (Organic Hass Avocado)  0.035748  2.447238   
4     (Organic Raspberries)  (Bag of Organic Bananas)  0.024960  2.258989   
5  (Bag of Organic Bananas)     (Organic Raspberries)  0.024960  2.258989   

   confidence  
0    0.261287  
1    0.178712  
2    0.393939  
3    0.222076  
4    0.363636  
5    0.155059  


3. Support tells how frequently an itemset appears in transactions. Higher support means the itemset is more common.
- Confidence measures how likely the consequent will be purchased if the antecedent is bought. Higher confidence indicates stronger reliability in the rule.
- The highest lift is 2.355387, between Organic Hass Avocados and Bag of Organic Bananas, showing the strongest positive association.

4. To increase average basket size, FreshMart could implement selling Bag of Organic Bananas and Organic Baby Spinach bundles. They can bundle these items together, leveraging their positive association, to encourage customers to purchase both.

- In terms of cross-selling,  they can suggest Organic Bananas to customers buying Organic Hass Avocados and vice versa, capitalizing on their strong lift of 2.28 for cross-selling.


Question 5: Recommendation Systems Using Text Embeddings

1. Using the movie_titles.csv dataset, generate text embeddings for the description column using a pre-trained model such as all-MiniLM-L6-v2. Calculate a cosine similarity matrix for these embeddings.

2. Create a function that accepts a movie title and returns the top 5 most similar movies based on their description embeddings. Test this function using a movie title of your choice.

3. Analyze the output of your recommendation function. Are the recommendations relevant? What patterns do you observe in the recommended movies?

4. Improve the recommendation system by incorporating genres (listed_in column) into the similarity calculation. Adjust your function to prioritize recommendations within the same genre as the input movie. Compare the new recommendations with the original ones.

5. Comment on how embedding-based recommendations differ from traditional TF-IDF-based recommendations and their potential advantages in enhancing user experience.

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from pprint import pprint

#1
movie_titles = pd.read_csv('movie_titles.csv')
movie_titles_clean = movie_titles.dropna(subset=["description"])

# Generating embeddings using all-MiniLM-L6-v2
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(movie_titles_clean.description)
similarity_matrix = cosine_similarity(embeddings)

#2
# Function to get recommendations based on embeddings
def get_recommendation(title, similarity_matrix, movie_df, top_n=5):
    idx = movie_df.index[movie_df['title'] == title].tolist()[0]
    sim_scores = similarity_matrix[idx]
    sim_scores = sorted(enumerate(sim_scores),
                        key=lambda x: x[1],
                        reverse=True)[1:top_n+1]
    sim_indices, sim_values = zip(*sim_scores)
    recommended_titles = movie_df['title'].iloc[list(sim_indices)]
    return list(zip(recommended_titles, sim_values))

#3
# Testing the function with the movie "Dick Johnson Is Dead"
recommendations = get_recommendation('Dick Johnson Is Dead',
                                     similarity_matrix,
                                     movie_titles_clean)
pprint(recommendations)

# Testing the first 2 movies to see if they are so similar
print("Description for Dick Johnson Is Dead")
(
pprint(movie_titles
[lambda d: d.title == "Dick Johnson Is Dead"]
      ['description'].to_list()[0])
)
print("Description for Kodachrome")
(
pprint(movie_titles
[lambda d: d.title == "Kodachrome"]
      ['description'].to_list()[0])
)
print("Description for Before I Fall")
(
pprint(movie_titles
[lambda d: d.title == "Before I Fall"]
      ['description'].to_list()[0])
)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[('Kodachrome', 0.49972707),
 ('Before I Fall', 0.4631701),
 ('Life After Beth', 0.46094507),
 ('Casting JonBenet', 0.4440267),
 ('The Sky Is Pink', 0.4438929)]
Description for Dick Johnson Is Dead
('As her father nears the end of his life, filmmaker Kirsten Johnson stages '
 'his death in inventive and comical ways to help them both face the '
 'inevitable.')
Description for Kodachrome
("A record company exec joins his estranged dad, a famous photographer who's "
 'dying, on a road trip to the last lab still developing Kodachrome film.')
Description for Before I Fall
('Forced to continually relive the day she dies in a car crash, a privileged '
 'high schooler must unravel the cosmic mystery of her suddenly looping life.')


In [None]:
movie_titles_clean.head(5)

In [None]:
#4
# Function to get genre-based recommendations
def get_recommendation_genre(title, genre, similarity_matrix, movie_df, top_n=5):
    fdf = movie_df[movie_df['listed_in'].str.contains(genre, regex=False)]
    if title not in fdf['title'].values:
        return []
    idx = fdf.index[fdf['title'] == title].tolist()[0]
    sim_scores = [(i, similarity_matrix[idx][i]) for i in fdf.index]
    sim_scores = sorted(sim_scores,
                        key=lambda x: x[1],
                        reverse=True)[1:top_n+1]
    if sim_scores:
        sim_indices, sim_values = zip(*sim_scores)
        recommended_titles = fdf.loc[list(sim_indices), 'title']
        return list(zip(recommended_titles, sim_values))
    else:
        return []

# Testing the genre-based recommendation function
print("Here is a list after incorporating genres:")
recommendation_genre = get_recommendation_genre('Dick Johnson Is Dead',
                         "Documentaries",
                         similarity_matrix,
                         movie_titles_clean)
pprint(recommendation_genre)

# Testing the first 2 movies to see if they are so similar
print("Description for Dick Johnson Is Dead")
(
pprint(movie_titles
[lambda d: d.title == "Dick Johnson Is Dead"]
      ['description'].to_list()[0])
)
print("Description for Casting JonBenet")
(
pprint(movie_titles
[lambda d: d.title == "Casting JonBenet"]
      ['description'].to_list()[0])
)
print("Description for A Gray State")
(
pprint(movie_titles
[lambda d: d.title == "A Gray State"]
      ['description'].to_list()[0])
)

Here is a list after incorporating genres:
[('Casting JonBenet', 0.4440269),
 ('A Gray State', 0.43848854),
 ('Strong Island', 0.42499763),
 ('Diana: 7 Days That Shook the World', 0.41442138),
 ('27: Gone Too Soon', 0.41334918)]
Description for Dick Johnson Is Dead
('As her father nears the end of his life, filmmaker Kirsten Johnson stages '
 'his death in inventive and comical ways to help them both face the '
 'inevitable.')
Description for Casting JonBenet
("Local actors from JonBenet Ramsey's hometown offer multiple perspectives on "
 'her 1996 murder as they vie to play roles in a dramatization of the case.')
Description for A Gray State
('This documentary dissects the case of a filmmaker whose death, along with '
 'the deaths of his wife and daughter, sparked alt-right conspiracy theories.')


  and should_run_async(code)


3. The recommendations for Dick Johnson Is Dead using this recommendation system highlight strong thematic overlaps, especially with Kodachrome and Before I Fall, which share themes of death and self-reflection. However, the system lacks genre differentiation, limiting relevance. For example, Kodachrome is a drama while Dick Johnson Is Dead is a documentary, making it a less suitable recommendation. To address this, we need to refine our approach to prioritize movies within the same genre.

4. Incorporating genres into the recommendation system improved relevance by narrowing suggestions to documentaries, aligning better with Dick Johnson Is Dead. New recommendations like Casting JonBenet and A Gray State share theme and style similarities, focusing on introspective and investigative storytelling. This genre-aware adjustment significantly enhances the system’s effectiveness compared to the original recommendations.

5. While TF-IDF focuses on word frequency and importance within the dataset, embeddings capture contextual meaning and semantic relationships between words. This allows embedding-based systems to identify similarities beyond exact word matches, enabling them to recommend movies with related themes or emotions, even if their descriptions use different terms.

- The potential advantages of embedding-based recommendations include a more nuanced understanding of content, improved relevance, and a better user experience, as recommendations can account for subtle contextual and semantic overlaps that TF-IDF might miss due to its reliance on word frequency.