# RFP: Targeted Taco Bell Ads

## Project Overview
You are invited to submit a proposal that answers the following question:

### What ad will you create and why?

*Please submit your proposal by **1/30/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you will need to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Customer Demographics](https://drive.google.com/file/d/1HK42Oa3bhhRDWR1y1wVBDAQ2tbNwg1gS/view?usp=sharing)
- [Ad Response Data](https://drive.google.com/file/d/1cuLqXPNKhP66m5BP9BAlci2G--Vopt-Z/view?usp=sharing)

*Note, a level 5 dataset combines these two data sets.*

In [1]:
import pandas as pd

# Read each CSV into a separate dataframe
customer_data = pd.read_csv('customer_data.csv')
ad_data = pd.read_csv('ad_data.csv')

# Merge the dataframes based on 'customer_id'
combined_data = pd.merge(customer_data, ad_data, on='customer_id', how='inner')  # 'inner' ensures only matching customer_id rows are kept

# Show information about each dataframe
print("Customer Data Info:")
customer_data.info()
print("\nAd Data Info:")
ad_data.info()

# Show the columns of each dataframe
print("\nCustomer Data Columns:")
print(customer_data.columns)

print("\nAd Data Columns:")
print(ad_data.columns)

# Display combined data info
print("\nCombined Data Info:")
combined_data.info()

# Optionally, display the first few rows of the merged dataframe to inspect
print("\nCombined Data (First 5 rows):")
print(combined_data.head())


Customer Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  10000 non-null  int64  
 1   state        10000 non-null  object 
 2   sex          10000 non-null  object 
 3   age          10000 non-null  float64
 4   occupation   10000 non-null  object 
 5   family_size  10000 non-null  int64  
 6   income       10000 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 547.0+ KB

Ad Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customer_id       10000 non-null  int64 
 1   ad_type           10000 non-null  object
 2   ad_medium         10000 non-null  object
 3   ad_response       10000 non-null  bool  
 4   items_purchased   10000 non-

### 2. Training Your Model
In the cell seen below, write the code you need to train a K-means clustering model. Make sure you describe the center of each cluster found.

*Note, level 5 work uses at least 3 features to train a K-means model using only the standard Python library and Pandas. A level 4 uses external libraries like scikit or numpy.*

In [4]:
import pandas as pd
from tqdm import tqdm  # Import tqdm for the loading bar

# Read in the customer_data and ad_data
customer_data = pd.read_csv('customer_data.csv')
ad_data = pd.read_csv('ad_data.csv')

# Merge customer_data and ad_data based on customer_id
combined_data = pd.merge(customer_data, ad_data, on='customer_id', how='inner')

# Inspect the columns of the combined data
print("Combined Data Columns:")
print(combined_data.columns)

# Function to apply clustering on a given dataframe
def cluster_data(dataframe, columns, k=10):
    print(f"\nClustering DataFrame with columns: {columns}")

    # Select only the columns that are present in the dataframe
    available_columns = [col for col in columns if col in dataframe.columns]
    print(f"Available Columns in DataFrame: {available_columns}")

    # Select the relevant columns
    features = dataframe[available_columns]

    # Check if 'state' column exists and filter the data to include only rows where the state is "CO"
    if 'state' in features.columns:
        features = features[features['state'] == 'CO']
    else:
        print("No 'state' column found. Skipping the state filter.")

    # Check if the filtered dataset is empty
    if features.empty:
        raise ValueError("No data available for the state of Colorado ('CO'). Please check the dataset.")

    # Drop the 'state' column, as it's no longer needed after filtering
    features = features.drop(columns=['state'], errors='ignore')

    # Fill missing data where applicable
    if 'ad_type' in features.columns:
        features['ad_type'] = features['ad_type'].fillna('Unknown')
    if 'ad_medium' in features.columns:
        features['ad_medium'] = features['ad_medium'].fillna('Unknown')
    if 'ad_response' in features.columns:
        features['ad_response'] = features['ad_response'].fillna(0)  # Assuming 0 means no response
    if 'items_purchased' in features.columns:
        features['items_purchased'] = features['items_purchased'].fillna(0)
    if 'drinks_purchased' in features.columns:
        features['drinks_purchased'] = features['drinks_purchased'].fillna(0)

    # One-hot encode categorical columns
    features_encoded = pd.get_dummies(features, drop_first=True)

    # Check for remaining missing data after filling
    print("Missing Data After Filling:")
    print(features_encoded.isnull().sum())

    # Drop rows with remaining missing data (if any)
    features_encoded = features_encoded.dropna()

    # Set the number of clusters
    centroids = features_encoded.sample(n=k, random_state=42).values.tolist()

    # Euclidean distance function for clustering
    def euclidean_distance(point1, point2):
        return sum((p1 - p2) ** 2 for p1, p2 in zip(point1, point2)) ** 0.5

    # Run the K-means clustering algorithm with tqdm for the loading bar
    for _ in tqdm(range(100), desc="Clustering Iterations"):
        clusters = {i: [] for i in range(k)}
        for _, row in features_encoded.iterrows():
            distances = [euclidean_distance(row, centroid) for centroid in centroids]
            cluster_index = distances.index(min(distances))
            clusters[cluster_index].append(row.values)

        new_centroids = []
        for i in range(k):
            cluster_points = clusters[i]
            if cluster_points:
                new_centroids.append([sum(x) / len(x) for x in zip(*cluster_points)])
            else:
                new_centroids.append(features_encoded.sample(n=1).values[0])

        # If centroids do not change, break the loop
        if centroids == new_centroids:
            break
        centroids = new_centroids

    # Output the final cluster centers with feature names
    feature_names = features_encoded.columns.tolist()
    for i, center in enumerate(centroids):
        print(f"Cluster {i + 1} center:")
        for feature, value in zip(feature_names, center):
            print(f"  {feature}: {value}")
        print()

columns = ['customer_id', 'state', 'sex', 'age', 'occupation', 'family_size',
           'income', 'ad_type', 'ad_medium', 'ad_response', 
           'items_purchased', 'drinks_purchased']

cluster_data(combined_data, columns)


Combined Data Columns:
Index(['customer_id', 'state', 'sex', 'age', 'occupation', 'family_size',
       'income', 'ad_type', 'ad_medium', 'ad_response', 'items_purchased',
       'drinks_purchased'],
      dtype='object')

Clustering DataFrame with columns: ['customer_id', 'state', 'sex', 'age', 'occupation', 'family_size', 'income', 'ad_type', 'ad_medium', 'ad_response', 'items_purchased', 'drinks_purchased']
Available Columns in DataFrame: ['customer_id', 'state', 'sex', 'age', 'occupation', 'family_size', 'income', 'ad_type', 'ad_medium', 'ad_response', 'items_purchased', 'drinks_purchased']
Missing Data After Filling:
customer_id                                                                                             0
age                                                                                                     0
family_size                                                                                             0
income                                              

Clustering Iterations:   7%|▋         | 7/100 [00:00<00:05, 15.57it/s]

Cluster 1 center:
  customer_id: 3545.5
  age: 41.75
  family_size: 2.875
  income: 109643.75
  ad_response: 0.625
  sex_M: 0.5
  occupation_Food Service: 0.0
  occupation_Government: 0.375
  occupation_Healthcare: 0.625
  occupation_IT: 0.0
  occupation_Other: 0.0
  occupation_Retail: 0.0
  occupation_Retired: 0.0
  occupation_Student: 0.0
  occupation_Unemployed: 0.0
  ad_type_BOGO - Garlic Steak Nacho Fries: 0.25
  ad_type_DISCOUNT-10%: 0.125
  ad_type_DISCOUNT-20%: 0.125
  ad_type_DISCOUNT-5%: 0.125
  ad_type_DISCOUNT-50%: 0.0
  ad_type_REWARD - Free Baja Blast with purchase of $20 or more: 0.125
  ad_type_REWARD - Free Garlic Steak Nacho Fries with purchase of $20 or more: 0.0
  ad_medium_15 sec YouTube ad: 0.25
  ad_medium_30 sec Hulu commercial: 0.125
  ad_medium_30 sec cable TV ad: 0.0
  ad_medium_Instagram photo ad: 0.375
  ad_medium_Newspaper ad: 0.0
  ad_medium_Static Facebook ad: 0.25
  items_purchased_['beefy 5 layer burrito', 'cheesy bean and rice burrito']: 0.0
  items_p




Cluster Two:

Predominately female
Students

Variety of ages

Respond well to youtube ads

familes of 1-2

Low income

Enjoy drinks

item to sell: Baja Blast!

### 3. Testing Your Model
In the cell seen below, write the code you need to test your K-means model. Then, interpret your findings.

*Note, level 5 testing uses both an elbow plot and a silhouette score to evaluate your model. Level 4 uses one or the other.*

#### Interpret your elbow plot and/or silhouette score here.

### 4. Final Answer

In the first cell seen below, describe the cluster you have chosen to target with your ad, making sure to describe the type of ad they were the most likely to respond to. Then, use software of your choosing to create the ad you will need to target this cluster. You do not need to create an ad for both the nacho fries and the Baja Blast. You can focus on one if that's what your cluster cares about most.

In the second cell seen below, include a link to your ad.

*Note, a level 5 ad uses the medium (static image or video) the cluster most likely responded to.* 

#### Describe the cluster you are targeting here.

#### Link your ad here.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

df = pd.read_csv("penguins.csv")

df = df.drop(columns=['sex'])

df = df.select_dtypes(include=['float64', 'int64'])

df = df.dropna()

print(df.head())

train, test = train_test_split(df, test_size=0.1, random_state=42)

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(train)

print(kmeans.n_clusters)

   culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g
0              39.1             18.7              181.0       3750.0
1              39.5             17.4              186.0       3800.0
2              40.3             18.0              195.0       3250.0
4              36.7             19.3              193.0       3450.0
5              39.3             20.6              190.0       3650.0
3
