Day 13 of Python Summer Party

by Interview Master

Shake Shack

New Milkshake Flavor Selection for Launch

You are a Product Analyst working with the Shake Shack R&D team to evaluate customer ratings for experimental milkshake flavors. Your team has collected ratings data from a small sampling test. Your task is to systematically analyze and clean the ratings data to identify top-performing flavors.

In [1]:
import pandas as pd
import numpy as np

# Load the CSV file into a DataFrame
milkshake_ratings = pd.read_csv('milkshake_ratings.csv')
milkshake_ratings_df = milkshake_ratings.copy()
print(milkshake_ratings_df)
print()
print(milkshake_ratings_df.info())


               flavor  rating customer_id rating_date
0   Classic Chocolate     4.5     CUST001  2024-07-05
1    Strawberry Swirl     3.8     CUST002  2024-07-10
2        Vanilla Bean     4.2     CUST003  2024-07-15
3     Caramel Delight     3.5     CUST004  2024-07-20
4          Mocha Bean     NaN     CUST005  2024-07-25
5   Classic Chocolate     4.5     CUST001  2024-07-05
6   Classic Chocolate     5.0     CUST006  2024-08-01
7    Strawberry Swirl     4.0     CUST007  2024-08-02
8        Vanilla Bean     3.9     CUST008  2024-08-03
9     Caramel Delight     4.8     CUST009  2024-10-04
10         Mocha Bean     2.5     CUST010  2024-09-05
11  Classic Chocolate     4.7     CUST011  2024-10-06
12   Strawberry Swirl     NaN     CUST012  2024-10-07
13       Vanilla Bean     4.3     CUST013  2024-10-08
14    Caramel Delight     4.9     CUST014  2024-10-09
15         Mocha Bean     3.3     CUST015  2024-08-10
16  Classic Chocolate     1.0     CUST016  2024-08-11
17   Strawberry Swirl     6.

Question 1 of 3

There was an error in our data collection process, and we unknowingly introduced duplciate rows into our data. Remove any duplicate entries in the customer ratings data to ensure the accuracy of the analysis.

In [2]:
# We can quickly do this by using .duplicated().sum()
duplicate_records = milkshake_ratings_df.duplicated().sum()
print('The number of duplicate values on the data set is:', duplicate_records)
print()

# Identify all duplicate rows, including the first occurrence
all_duplicate_rows = milkshake_ratings_df[milkshake_ratings_df.duplicated(keep=False)]
print('All duplicate rows in the data set are:') ; 
print(all_duplicate_rows)
print()

# Now that we have identified the duplicate rows, we can drop them
clean_milkshake_ratings_df = milkshake_ratings_df.drop_duplicates()
print("Answer 1: Cleaned dataframe with no duplicate rows:")
print(clean_milkshake_ratings_df.info())
print()
print("=" * 100)


The number of duplicate values on the data set is: 1

All duplicate rows in the data set are:
              flavor  rating customer_id rating_date
0  Classic Chocolate     4.5     CUST001  2024-07-05
5  Classic Chocolate     4.5     CUST001  2024-07-05

Answer 1: Cleaned dataframe with no duplicate rows:
<class 'pandas.core.frame.DataFrame'>
Index: 59 entries, 0 to 59
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   flavor       59 non-null     object 
 1   rating       53 non-null     float64
 2   customer_id  59 non-null     object 
 3   rating_date  59 non-null     object 
dtypes: float64(1), object(3)
memory usage: 2.3+ KB
None



Question 2:

For each milkshake flavor, calculate the average customer rating and append this as a new column to the milkshake_ratings DataFrame. Don't forget to clean the DataFrame first by dropping duplicate values.

In [3]:
# Before we work on this question we need to address the null values, we will be dropping them so they dont affect our analysis
clean_milkshake_ratings_df = clean_milkshake_ratings_df.dropna()
print(clean_milkshake_ratings_df.info())
print()


<class 'pandas.core.frame.DataFrame'>
Index: 53 entries, 0 to 59
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   flavor       53 non-null     object 
 1   rating       53 non-null     float64
 2   customer_id  53 non-null     object 
 3   rating_date  53 non-null     object 
dtypes: float64(1), object(3)
memory usage: 2.1+ KB
None



In [4]:
# Now tat the null values are removed, we can work on this question
# We need to group the data by flavor and get the average rating for each flavor
flavor_ratings = clean_milkshake_ratings_df.groupby('flavor').agg({'rating': 'mean'}).round(2).rename(columns={'rating': 'avg_rating'})
print(flavor_ratings)
print()


                   avg_rating
flavor                       
Caramel Delight          4.20
Classic Chocolate        4.17
Mocha Bean               3.64
Strawberry Swirl         4.24
Vanilla Bean             3.91



In [5]:
# Now we just need to append this data to the milkshake_ratings data set
append_milkshake_ratings_df = pd.merge(clean_milkshake_ratings_df, flavor_ratings, how='right', on='flavor').sort_values('customer_id').reset_index(drop=True)
print(append_milkshake_ratings_df.info())
print("\nAnswer 2: Cleaned dataframe with new 'avg_rating' column:")
print(append_milkshake_ratings_df)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   flavor       53 non-null     object 
 1   rating       53 non-null     float64
 2   customer_id  53 non-null     object 
 3   rating_date  53 non-null     object 
 4   avg_rating   53 non-null     float64
dtypes: float64(2), object(3)
memory usage: 2.2+ KB
None

Answer 2: Cleaned dataframe with new 'avg_rating' column:
               flavor  rating customer_id rating_date  avg_rating
0   Classic Chocolate     4.5     CUST001  2024-07-05        4.17
1    Strawberry Swirl     3.8     CUST002  2024-07-10        4.24
2        Vanilla Bean     4.2     CUST003  2024-07-15        3.91
3     Caramel Delight     3.5     CUST004  2024-07-20        4.20
4   Classic Chocolate     5.0     CUST006  2024-08-01        4.17
5    Strawberry Swirl     4.0     CUST007  2024-08-02        4.24
6        Vanilla Bean     

Question 3:

For each row in the dataset, calculate the difference between that customer's rating and the average rating for the flavor. Don't forget to clean the DataFrame first by dropping duplicate values.

In [6]:
# Calculating and creating a new column with the difference between the rating and the average rating
append_milkshake_ratings_df['difference'] = append_milkshake_ratings_df['rating'] - append_milkshake_ratings_df['avg_rating']
print(append_milkshake_ratings_df)


               flavor  rating customer_id rating_date  avg_rating  difference
0   Classic Chocolate     4.5     CUST001  2024-07-05        4.17        0.33
1    Strawberry Swirl     3.8     CUST002  2024-07-10        4.24       -0.44
2        Vanilla Bean     4.2     CUST003  2024-07-15        3.91        0.29
3     Caramel Delight     3.5     CUST004  2024-07-20        4.20       -0.70
4   Classic Chocolate     5.0     CUST006  2024-08-01        4.17        0.83
5    Strawberry Swirl     4.0     CUST007  2024-08-02        4.24       -0.24
6        Vanilla Bean     3.9     CUST008  2024-08-03        3.91       -0.01
7     Caramel Delight     4.8     CUST009  2024-10-04        4.20        0.60
8          Mocha Bean     2.5     CUST010  2024-09-05        3.64       -1.14
9   Classic Chocolate     4.7     CUST011  2024-10-06        4.17        0.53
10       Vanilla Bean     4.3     CUST013  2024-10-08        3.91        0.39
11    Caramel Delight     4.9     CUST014  2024-10-09        4.2