<a href="https://colab.research.google.com/github/zhouy185/Individual-Assignment-II/blob/main/Assignment_2_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: Before you start, enter your name and student number below.

**Full Name**: Matthew Zychowicz

**Student Number**: 400145284

# Customer Data Visualization

Answering the following questions with Nata Supermarkets dataset. Your goal is to use `plotly.express` to create visualizations and draw insights based on the data.

Before starting your work, upload your data file directly to Colab and load it as a pandas data frame.

**Note**: Recall that the dataset has missing values. Drop rows with missing values before you proceed to tasks.

In [1]:
import pandas as pd

df = pd.read_csv('NataData.csv')

# See rows with missing values
print(df[df.isnull().any(axis=1)])

# Drop rows with missing values
df = df.dropna()

print(df.head())



         ID  Year_Birth   Education Marital_Status  Income  Kidhome  Teenhome  \
10     1994        1983  Graduation        Married     NaN        1         0   
27     5255        1986  Graduation         Single     NaN        1         0   
43     7281        1959         PhD         Single     NaN        0         0   
48     7244        1951  Graduation         Single     NaN        2         1   
58     8557        1982  Graduation         Single     NaN        1         0   
71    10629        1973    2n Cycle        Married     NaN        1         0   
90     8996        1957         PhD        Married     NaN        2         1   
91     9235        1957  Graduation         Single     NaN        1         1   
92     5798        1973      Master       Together     NaN        0         0   
128    8268        1961         PhD        Married     NaN        0         1   
133    1295        1963  Graduation        Married     NaN        0         1   
312    2437        1989  Gra

## Task A (30 points)

In this question, we investigate how customers' online activity relates to their web purchasing behavior.

### Task A.1 (15 points)

i. Choose an appropriate chart type that can reveal **whether people who visit the website more often also make more purchases online**. In your chart, each data record corresponding to a customer should be represented by a **point**, the coordinates of which represent the *number of website visits last month* and the *number of online purchases*, respectively.
  * Develop the chart with visual encodings --  use **size** to encode customers' **income** and **color** to encode the **campaign response** (whether accepted any campaigns) in your chart.
  * When a user hovers their mouse on elements in the chart, let the chart show relevant information on *income*, *education*, and *marital status* of the corresponding data record.

ii. Further, answer the following questions in a **text (markdown) cell** and make sure to explain your answer based on the chart:
  * Do more frequent site visitors tend to buy more?
  * How does income affect online purchases?
  * Are campaign responders visually distinguishable? If so, how are they different?


**Hint**: You can use the `size_max=` argument to adjust the scale of the point sizes and make the size differences easier to see.

In [2]:
import plotly.express as px

# Convert Response to string to make it categorical
df['Response'] = df['Response'].astype(str)

# Create a scatter plot of website visits vs. online purchases
fig = px.scatter(df, x='NumWebVisitsMonth', y='NumWebPurchases', 
    size='Income', 
    facet_col='Response',  # Separate plots due to overlapping of points issue
    hover_data=['Income', 'Education', 'Marital_Status'],
    size_max=40,
    color='Response',
    color_discrete_map={'0': '#FF6B6B', '1': '#4ECDC4'})
    
fig.update_traces(marker_symbol=None)
fig.update_layout(title='Website Visits vs. Online Purchases by Response')
fig.show()

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### Do more frequent site visitors tend to buy more?

Yes, there is a slight positive relationship between website visits and online purchases. As the number of website visits increases from 0 to around 7-8 visits per month, online purchases tend to increase modestly. However, the relationship is weak - the increase in purchases is minimal even with more visits.
</dive>

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### How does income affect online purchases?

Income has a strong positive effect on online purchases. The larger bubbles (representing higher income) are concentrated in the upper portions of both charts, indicating that higher-income customers make significantly more online purchases. This shows income is a key predictor of purchasing behavior regardless of campaign response.
</div>

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### Are campaign responders visually distinguishable? If so, how are they different?

Campaign responders are NOT clearly distinguishable based on website visits and online purchases alone. Both panels show substantial overlap in the same ranges - both groups have customers with 2-10 website visits and 0-11 online purchases. The concentration and patterns look very similar between Response=1 and Response=0. Both groups show customers making 3-11 purchases at similar visit frequencies, and both have clusters at lower purchase levels. This suggests that website visit frequency and purchase count alone are not strong differentiators of campaign response.
</div>

### Task A.2 (15 points)

i. Use the same type of chart as in A.1 to explore whether wealthier customers necessarily spend more overall.
* In your chart, the value of the $x$ and $y$ axes should represent *income* and *total spending (across all types of products)*, respectively.
* Use **size** to visually encode **number of online purchases** and **color** to encode **campaign response** (whether the customer accepted any campaign).
* When a user hovers their mouse on elements in the chart, let the chart show relevant information on *income*, *education*, and *marital status* of the corresponding data record.

ii. Further, answer the following questions in a **text (markdown) cell** and make sure to explain your answer based on the chart:

  * What is the relationship between income and spending?

  * Do campaign responders cluster in particular regions of the plot?

  * For those purchased more times online, do they tend to spend more or less?

**Hint**: You may use `range_x=[start_value,end_value]`  to adjust the visible range on the x-axis for better visualizing the distribution of the points.

In [3]:
import plotly.express as px

# Calculate total spending across all product types
df['TotalSpending'] = (df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + 
                       df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds'])

# Create scatter plot
fig = px.scatter(df, x='Income', y='TotalSpending', 
                 size='NumWebPurchases', 
                 color='Response',
                 facet_col='Response', # Separate plots due to overlapping of points issue
                 hover_data=['Income', 'Education', 'Marital_Status'],
                 size_max=40,
                 color_discrete_map={0: 'green', 1: 'purple'})

# Add title and axis labels
fig.update_layout(
    title='Income vs. Total Spending',
    xaxis_title='Income ($)',
    yaxis_title='Total Spending ($)',
    legend_title='Campaign Response'
)

# Set range of x and y axes for better visualization
fig.update_xaxes(range=[0, 120000])
fig.update_yaxes(range=[0, 3000])

fig.update_traces(marker_symbol=None)

fig.show()

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### What is the relationship between income and spending?

There is a strong positive relationship between income and total spending. As income increases, spending increases proportionally for both campaign responders and non-responders. Customers earning below $50k typically spend under $1,000, while those earning $80k+ often spend over $2,000. This linear relationship reflects that higher-income customers have greater disposable income for purchases.

</div>

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### Do campaign responders cluster in particular regions of the plot?

No, campaign responders do not show distinct clustering patterns. Both Response=1 (green) and Response=0 (purple) distributions are similar, covering comparable income ranges ($20k-$100k) and spending levels ($500-$2,500) with similar spread across the plot. The two panels look nearly identical, indicating that campaign response is not strongly associated with specific income-spending combinations. This suggests factors other than income and spending levels may be more important in determining campaign response.

</div>

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### For those who purchased more times online, do they tend to spend more or less?

Customers who make more online purchases tend to spend more overall. There is a positive correlation between online purchase frequency and total spending, with customers making 6-10 online purchases generally spending more than those with 0-3 purchases.

</div>

**Hint**: In this exercise, you may need to use the `df.melt()` function to convert a wide table into a long table. You can find an example of this function in the assignment repository (see **melt_example.ipynb**).

### Task B.1 (15 points)

i. Consider that customers are grouped by their **education levels**. Develop a bar chart with several sets of bars, in which:

* Each set of bars (i.e.,bar of the same color) represent the spendings by customers in different groups on a spefic product category. For example, spending by customers "Graduation", "Master", "PhD", ..., on *fish products*.  
* The chart should allow you to easily visuallize both the **total spending** by customers in a group on all product categories and the **composition** of their spending on those product categories.

ii. Further, answer the following questions in a **text (markdown) cell** and make sure to explain your answer based on the chart:
* Which customer group contributes the highest revenue?

**Hint**: By default, each row in the dataset is shown as a separate rectangle in the bar, and these rectangles have borders that can make the chart look cluttered. In that case, use `fig.update_traces(opacity=1, marker_line_width=0)` to remove the border lines.

In [4]:
import plotly.express as px

# Product spending columns
product_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 
                   'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

# Calculate total spending by education level for each product category
education_spending = df.groupby('Education')[product_columns].sum().reset_index()

# Reshape data to long format for plotting
education_spending_long = education_spending.melt(
    id_vars='Education',
    value_vars=product_columns,
    var_name='Product_Category',
    value_name='Total_Spending'
)

# Clean up product category names (remove 'Mnt' prefix)
education_spending_long['Product_Category'] = education_spending_long['Product_Category'].str.replace('Mnt', '')

# Create stacked bar chart
fig = px.bar(
    education_spending_long,
    x='Education',
    y='Total_Spending',
    color='Product_Category',
    title='Customer Spending by Education Level Across Product Categories',
    labels={
        'Total_Spending': 'Total Spending ($)',
        'Education': 'Education Level',
        'Product_Category': 'Product Category'
    },
    color_discrete_sequence=px.colors.qualitative.Set1
)

fig.update_layout(
    xaxis_tickangle=-45,
    height=600,
    showlegend=True,
    legend_title_text='Product Category',
    barmode='stack'
)

fig.show()

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### Which customer group contributes the highest revenue?

Graduation customers contribute the highest revenue with approximately $690,000 in total spending, significantly exceeding all other education levels (PhD: $320,000, Master: $220,000, 2n Cycle: $100,000, Basic: $2,000).

</div>

### Task B.2 (15 points)

i. Consider that customers are grouped by their **marital statuses**. Develop a bar chart with several sets of bars, in which:

* Each set of bars (bars of the same color) represent the spendings by customers from different groups on a specific product category. For example, customers of "single", "married", ..., on Fruits.
* Within each group, the chart should allow you to easily compare the spending of customers on different product categories (e.g., idenfity the highest spending category by single customers).

ii. Further, answer the following questions in a **text (markdown) cell** and make sure to explain your answer based on the chart:
* For each marital status group, which category of product do they spend the highest amount on?
* Does the marital status affect customers' spending distribution across the product categories?  How?

In [5]:
import plotly.express as px

# Product spending columns
product_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 
                   'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

# Calculate total spending by marital status for each product category
marital_spending = df.groupby('Marital_Status')[product_columns].sum().reset_index()

# Reshape data to long format for plotting
marital_spending_long = marital_spending.melt(
    id_vars='Marital_Status',
    value_vars=product_columns,
    var_name='Product_Category',
    value_name='Total_Spending'
)

# Clean up product category names (remove 'Mnt' prefix)
marital_spending_long['Product_Category'] = marital_spending_long['Product_Category'].str.replace('Mnt', '')

# Create GROUPED bar chart
fig = px.bar(
    marital_spending_long,
    x='Marital_Status',
    y='Total_Spending',
    color='Product_Category',
    barmode='group',  # This creates side-by-side bars
    title='Customer Spending by Marital Status Across Product Categories',
    labels={
        'Total_Spending': 'Total Spending ($)',
        'Marital_Status': 'Marital Status',
        'Product_Category': 'Product Category'
    },
    color_discrete_sequence=px.colors.qualitative.Set1
)

fig.update_layout(
    xaxis_tickangle=-45,
    height=600,
    showlegend=True,
    legend_title_text='Product Category'
)

fig.show()

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### For each marital status group, which category of product do they spend the highest amount on?

Wines is the highest spending category for ALL marital status groups without exception. Married customers lead with approximately $256,000 on wines, followed by Together ($176,000), Single ($137,000), Divorced ($75,000), and Widow ($28,000). The teal bars are clearly the tallest across every marital status, indicating wine products generate the most revenue regardless of relationship status. Meat Products consistently rank second across most groups.

</div>

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### Does the marital status affect customers' spending distribution across the product categories? How?

Yes, marital status affects spending volume but not category preferences. Married customers show the highest absolute spending across all categories, followed by Together and Single customers, while Divorced and Widow customers spend considerably less. This likely reflects differences in household income and size. However, spending proportions remain consistent across all groups with Wines dominating, Meat Products ranking second, and other categories following in similar order. Marital status primarily affects how much customers spend, not what they prefer to buy, suggesting product preferences are driven by another variable rather than relationship status.

</div>

## Part C (30 points)

In this part, let us investigate variability in customers' online purchasing behavior.

i. We will use a visualization chart that shows distribution of customers' online purchases. The chart should:
* Show multiple distributions, each being the distribution of number of online purchases by customers of a certain generation (Gen X vs. Millennials,  ignore other generations)
* Allow users to observe the 1st, 2nd, and 3rd quartiles of the distribution, along with any outliers.

ii. Based on the chart, answer the following question:
* How would you comment on the pattern of online purchases across different generations.

iii. In your chart, further divide each distribution into subgroups of customers based on the **number of kids at home** (not including teens).
How would you comment on the pattern of online purchases across the different subgroups of customers?




In [6]:
import plotly.express as px
import pandas as pd

# Define generations based on birth year
# Gen X: Born 1965-1980, Millennials: Born 1981-1996
df['Generation'] = pd.cut(
    df['Year_Birth'],
    bins=[1964, 1980, 1996, 2100],
    labels=['Gen X', 'Millennials', 'Other']
)

# Filter for only Gen X and Millennials
df_gen = df[df['Generation'].isin(['Gen X', 'Millennials'])]

# Create box plot
fig = px.box(
    df_gen,
    x='Generation',
    y='NumWebPurchases',
    title='Distribution of Online Purchases by Generation',
    color='Generation'
)

fig.show()

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### How would you comment on the pattern of online purchases across different generations?

Gen X and Millennials show similar central tendencies but different variability in online purchasing behavior. Both generations have comparable median values of approximately 3 online purchases, and their interquartile ranges overlap considerably, indicating the middle 50% of customers shop online with similar frequency. However, Gen X shows greater variability with a higher upper quartile (Q3 around 6 vs. 5 for Millennials) and more extreme outliers, including power users making up to 25 purchases compared to Millennials' maximum of 11. While typical behavior is similar, Gen X demonstrates more diverse shopping patterns while Millennials show more uniform habits.

</div>

In [7]:
import plotly.express as px

# Create box plot with kids at home subdivision
fig = px.box(
    df_gen,
    x='Generation',
    y='NumWebPurchases',
    color='Kidhome',
    title='Distribution of Online Purchases by Generation and Number of Kids at Home',
    labels={
        'NumWebPurchases': 'Number of Web Purchases',
        'Generation': 'Generation',
        'Kidhome': 'Number of Kids at Home'
    }
)

fig.show()

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

### How would you comment on the pattern of online purchases across the different subgroups of customers?

Having children dramatically reduces online purchase frequency, and this effect is consistent across both generations. Customers with 0 kids show the highest online shopping activity with medians around 4-5 purchases for both Gen X and Millennials. This drops by 50-60% to approximately 2 purchases when customers have 1 child, and remains similar with 2 kids. The number of kids is a stronger predictor than generation, as the difference between childless customers and parents is much larger than between Gen X and Millennials. The major behavioral shift occurs at the one child threshold, suggesting that once parents adapt to having children, their shopping behavior stabilizes regardless of additional children.

</div>

## Part D (2 points)

Briefly decribe how you used Gen. AI in this assignment

<div style="background-color:#004d00; color:white; padding:12px; border-radius:6px;">

I used Gen AI for completions after commenting what I wanted it to do. I edited (because completions aren't prefect) and made sure I understood the code. My IDE of choice is cursor and I used its built in AI functionality, I have a pro subscription. 

</div>

**Note**: The remaining 8 points will be assigned to readibility of your submission.