<a href="https://colab.research.google.com/github/AnamHJ24/datascience-python-challenges/blob/main/notebooks/Day_13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 13 - Shake Shack
You are a Product Analyst working with the Shake Shack R&D team to evaluate customer ratings for experimental milkshake flavors. Your team has collected ratings data from a small sampling test. Your task is to systematically analyze and clean the ratings data to identify top-performing flavors.

In [11]:
# Import required libraries
import pandas as pd
import numpy as np

# Import data file
url = "https://raw.githubusercontent.com/AnamHJ24/datascience-python-challenges/refs/heads/main/Data/Day_13.txt"
milkshake_ratings = pd.read_csv(url)
milkshake_ratings.head()

Unnamed: 0,flavor,rating,customer_id,rating_date
0,Classic Chocolate,4.5,CUST001,2024-07-05
1,Strawberry Swirl,3.8,CUST002,2024-07-10
2,Vanilla Bean,4.2,CUST003,2024-07-15
3,Caramel Delight,3.5,CUST004,2024-07-20
4,Mocha Bean,,CUST005,2024-07-25


## Question 1
There was an error in our data collection process, and we unknowingly introduced duplciate rows into our data. Remove any duplicate entries in the customer ratings data to ensure the accuracy of the analysis.

## Solution

In [12]:
# Remove duplicate rows
milkshake_ratings = milkshake_ratings.drop_duplicates()
print(milkshake_ratings.head())

              flavor  rating customer_id rating_date
0  Classic Chocolate     4.5     CUST001  2024-07-05
1   Strawberry Swirl     3.8     CUST002  2024-07-10
2       Vanilla Bean     4.2     CUST003  2024-07-15
3    Caramel Delight     3.5     CUST004  2024-07-20
4         Mocha Bean     NaN     CUST005  2024-07-25


In [13]:
# Remove rows with incomplete data
milkshake_ratings = milkshake_ratings.dropna()
print(milkshake_ratings.head())

              flavor  rating customer_id rating_date
0  Classic Chocolate     4.5     CUST001  2024-07-05
1   Strawberry Swirl     3.8     CUST002  2024-07-10
2       Vanilla Bean     4.2     CUST003  2024-07-15
3    Caramel Delight     3.5     CUST004  2024-07-20
6  Classic Chocolate     5.0     CUST006  2024-08-01


## Question 2
For each milkshake flavor, calculate the average customer rating and append this as a new column to the milkshake_ratings DataFrame. Don't forget to clean the DataFrame first by dropping duplicate values.

## Solution

In [17]:
# Calculate average customer rating per flavor and add new column to dataframe
milkshake_ratings['avg_rating'] = round(milkshake_ratings.groupby('flavor')['rating'].transform('mean'),2)
print(milkshake_ratings.head(10))

               flavor  rating customer_id rating_date  avg_rating
0   Classic Chocolate     4.5     CUST001  2024-07-05        4.17
1    Strawberry Swirl     3.8     CUST002  2024-07-10        4.24
2        Vanilla Bean     4.2     CUST003  2024-07-15        3.91
3     Caramel Delight     3.5     CUST004  2024-07-20        4.20
6   Classic Chocolate     5.0     CUST006  2024-08-01        4.17
7    Strawberry Swirl     4.0     CUST007  2024-08-02        4.24
8        Vanilla Bean     3.9     CUST008  2024-08-03        3.91
9     Caramel Delight     4.8     CUST009  2024-10-04        4.20
10         Mocha Bean     2.5     CUST010  2024-09-05        3.64
11  Classic Chocolate     4.7     CUST011  2024-10-06        4.17


## Question 3
For each row in dataset, calculate the difference between that customer's rating and the average rating for the flavor. Don't forget to clean the DataFrame first by dropping duplicate values.

## Solution

In [21]:
# Calculate difference between customer and average ratings
milkshake_ratings['difference'] = milkshake_ratings['rating'] - milkshake_ratings['avg_rating']

print(milkshake_ratings.head(10))

               flavor  rating customer_id rating_date  avg_rating  difference
0   Classic Chocolate     4.5     CUST001  2024-07-05        4.17        0.33
1    Strawberry Swirl     3.8     CUST002  2024-07-10        4.24       -0.44
2        Vanilla Bean     4.2     CUST003  2024-07-15        3.91        0.29
3     Caramel Delight     3.5     CUST004  2024-07-20        4.20       -0.70
6   Classic Chocolate     5.0     CUST006  2024-08-01        4.17        0.83
7    Strawberry Swirl     4.0     CUST007  2024-08-02        4.24       -0.24
8        Vanilla Bean     3.9     CUST008  2024-08-03        3.91       -0.01
9     Caramel Delight     4.8     CUST009  2024-10-04        4.20        0.60
10         Mocha Bean     2.5     CUST010  2024-09-05        3.64       -1.14
11  Classic Chocolate     4.7     CUST011  2024-10-06        4.17        0.53
