# Week 2: Categorical Data Analysis

**Course:** Statistics I (BSMA1002)  
**Level:** Foundation  
**Week:** 2

## Topics Covered
1. Frequency distributions and tables
2. Bar charts and pie charts
3. Contingency tables and cross-tabulation
4. Relative frequencies and percentages
5. Mode for categorical data

## Learning Objectives
- Organize categorical data using frequency distributions
- Create effective visualizations for categorical variables
- Analyze relationships between categorical variables
- Calculate and interpret frequency statistics


In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print('✓ Libraries imported successfully')

## 1. Frequency Distributions

### Definition
A **frequency distribution** shows how often each category or value appears in a dataset.

### Types of Frequencies

1. **Absolute Frequency**: Count of observations in each category
   $$f_i = \text{count of category } i$$

2. **Relative Frequency**: Proportion of observations in each category
   $$\text{RF}_i = \frac{f_i}{n} \text{ where } n = \sum f_i$$

3. **Percentage Frequency**: Relative frequency as percentage
   $$\text{PF}_i = \text{RF}_i \times 100\%$$

4. **Cumulative Frequency**: Running total of frequencies
   $$\text{CF}_i = \sum_{j=1}^{i} f_j$$

### Properties
- Sum of all frequencies equals total observations: $\sum f_i = n$
- Sum of relative frequencies equals 1: $\sum \text{RF}_i = 1$
- Sum of percentage frequencies equals 100%: $\sum \text{PF}_i = 100\%$


In [None]:
# Create sample customer data
categories = ['Electronics', 'Clothing', 'Food', 'Books', 'Sports']
purchases = np.random.choice(categories, size=200, 
                            p=[0.25, 0.30, 0.20, 0.15, 0.10])

# Create frequency distribution
freq_dist = pd.Series(purchases).value_counts().sort_index()
rel_freq = freq_dist / len(purchases)
pct_freq = rel_freq * 100
cum_freq = freq_dist.cumsum()

# Create comprehensive frequency table
freq_table = pd.DataFrame({
    'Frequency': freq_dist,
    'Relative Freq': rel_freq.round(3),
    'Percentage': pct_freq.round(1),
    'Cumulative': cum_freq
})

print('Frequency Distribution Table:')
print(freq_table)
print(f'\nTotal observations: {len(purchases)}')
print(f'Verification: Sum of frequencies = {freq_dist.sum()}')
print(f'Verification: Sum of relative frequencies = {rel_freq.sum():.3f}')

## 2. Categorical Data Visualizations

### Bar Charts
- Display frequencies using rectangular bars
- Height represents frequency
- Bars can be vertical or horizontal
- Categories are discrete (gaps between bars)

**When to use:**
- Comparing frequencies across categories
- Nominal or ordinal data
- More than 2-3 categories

### Pie Charts
- Display relative frequencies as slices of a circle
- Slice angle = $\frac{f_i}{n} \times 360°$
- Shows parts of a whole

**When to use:**
- Showing proportions
- 5-7 categories maximum
- Emphasizing percentage composition

**Best Practices:**
- Use bar charts for precise comparisons
- Use pie charts for showing composition
- Avoid 3D effects (distort perception)
- Order categories logically


In [None]:
# Create bar chart visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Vertical bar chart
freq_dist.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title('Purchase Categories - Bar Chart', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Product Category', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(freq_dist.values):
    axes[0].text(i, v + 1, str(v), ha='center', va='bottom', fontweight='bold')

# Horizontal bar chart with percentages
pct_freq.sort_values().plot(kind='barh', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Purchase Categories - Percentage', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Percentage (%)', fontsize=11)
axes[1].set_ylabel('Product Category', fontsize=11)
axes[1].grid(axis='x', alpha=0.3)

# Add percentage labels
for i, v in enumerate(pct_freq.sort_values().values):
    axes[1].text(v + 0.5, i, f'{v:.1f}%', ha='left', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

print('Most popular category:', freq_dist.idxmax())
print(f'with {freq_dist.max()} purchases ({pct_freq.max():.1f}%)')

In [None]:
# Create pie chart
fig, ax = plt.subplots(figsize=(10, 7))

colors = plt.cm.Set3(range(len(freq_dist)))
wedges, texts, autotexts = ax.pie(freq_dist.values, 
                                    labels=freq_dist.index,
                                    autopct='%1.1f%%',
                                    colors=colors,
                                    startangle=90,
                                    explode=[0.05 if i == freq_dist.argmax() else 0 
                                            for i in range(len(freq_dist))])

# Enhance text formatting
for text in texts:
    text.set_fontsize(12)
    text.set_fontweight('bold')

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontsize(10)
    autotext.set_fontweight('bold')

ax.set_title('Distribution of Purchase Categories', fontsize=14, fontweight='bold', pad=20)
plt.axis('equal')
plt.show()

print('\nPie chart shows the composition of purchases across categories')
print(f'Largest slice: {freq_dist.idxmax()} ({pct_freq.max():.1f}%)')

## 3. Contingency Tables (Cross-Tabulation)

### Definition
A **contingency table** displays the frequency distribution of two or more categorical variables simultaneously.

### Structure
- Rows represent categories of one variable
- Columns represent categories of another variable
- Cells contain joint frequencies
- Margins contain row/column totals

### Joint Frequency
Count of observations in cell $(i,j)$:
$$n_{ij} = \text{count where row } = i \text{ and column } = j$$

### Marginal Frequency
- Row total: $n_{i\cdot} = \sum_j n_{ij}$
- Column total: $n_{\cdot j} = \sum_i n_{ij}$
- Grand total: $n = \sum_i \sum_j n_{ij}$

### Applications
- Analyze relationships between categorical variables
- Customer segmentation analysis
- Survey data analysis
- Market research


In [None]:
# Create customer segmentation data
n_customers = 300
age_groups = np.random.choice(['18-25', '26-35', '36-50', '51+'], size=n_customers, 
                              p=[0.20, 0.35, 0.30, 0.15])
membership = np.random.choice(['Basic', 'Premium', 'VIP'], size=n_customers,
                             p=[0.50, 0.35, 0.15])

# Create DataFrame
customer_df = pd.DataFrame({
    'Age Group': age_groups,
    'Membership': membership
})

# Create contingency table with margins
contingency = pd.crosstab(customer_df['Age Group'], 
                         customer_df['Membership'], 
                         margins=True, 
                         margins_name='Total')

print('Contingency Table: Age Group vs Membership Type')
print('='*60)
print(contingency)
print('\n')

# Calculate and display proportions
print('Proportion Table (%):')
print('='*60)
prop_table = pd.crosstab(customer_df['Age Group'], 
                        customer_df['Membership'], 
                        normalize='all') * 100
print(prop_table.round(1))

In [None]:
# Create stacked bar chart
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Remove 'Total' row for visualization
contingency_no_total = contingency.drop('Total').drop('Total', axis=1)

# Stacked bar chart
contingency_no_total.plot(kind='bar', stacked=True, ax=axes[0], 
                         color=['#FF6B6B', '#4ECDC4', '#45B7D1'],
                         edgecolor='black')
axes[0].set_title('Membership Distribution by Age Group', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Age Group', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].legend(title='Membership', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].grid(axis='y', alpha=0.3)
plt.setp(axes[0].xaxis.get_majorticklabels(), rotation=45)

# Grouped bar chart
contingency_no_total.plot(kind='bar', ax=axes[1], 
                         color=['#FF6B6B', '#4ECDC4', '#45B7D1'],
                         edgecolor='black')
axes[1].set_title('Membership Comparison Across Age Groups', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Age Group', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].legend(title='Membership', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].grid(axis='y', alpha=0.3)
plt.setp(axes[1].xaxis.get_majorticklabels(), rotation=45)

plt.tight_layout()
plt.show()

## 4. Mode for Categorical Data

### Definition
The **mode** is the category that appears most frequently in the dataset.

### Properties
- Only measure of central tendency for nominal data
- Can be used for all types of data
- May have multiple modes (multimodal)
- Not affected by extreme values

### Types
- **Unimodal**: One mode
- **Bimodal**: Two modes
- **Multimodal**: More than two modes

### Formula
$$\text{Mode} = \text{category with maximum frequency}$$
$$\text{Mode} = \arg\max_i f_i$$

### Applications
- Most popular product
- Most common customer type
- Typical response in surveys


In [None]:
# Calculate mode for different categorical variables
from scipy import stats

# Product categories mode
product_mode = stats.mode(purchases, keepdims=True)
print('Product Category Analysis:')
print(f'Mode: {product_mode.mode[0]}')
print(f'Frequency: {product_mode.count[0]}')
print(f'Percentage: {(product_mode.count[0] / len(purchases)) * 100:.1f}%')

print('\n' + '='*60 + '\n')

# Membership mode by age group
print('Mode Membership Type by Age Group:')
for age in sorted(customer_df['Age Group'].unique()):
    age_data = customer_df[customer_df['Age Group'] == age]['Membership']
    mode_membership = age_data.mode().values[0]
    mode_count = (age_data == mode_membership).sum()
    mode_pct = (mode_count / len(age_data)) * 100
    print(f'{age:8s}: {mode_membership:8s} ({mode_count:2d} out of {len(age_data):2d}, {mode_pct:.1f}%)')

## 5. Real-World Application: E-commerce Analytics

### Business Problem
An e-commerce company wants to analyze customer purchase patterns to:
- Identify popular product categories
- Understand customer segmentation
- Optimize inventory and marketing strategies

### Analysis Approach
1. Frequency analysis of product categories
2. Cross-tabulation of customer demographics and purchases
3. Visualization of purchase patterns
4. Identify modes and trends


In [None]:
# Generate comprehensive e-commerce dataset
np.random.seed(123)
n_transactions = 500

ecommerce_data = pd.DataFrame({
    'Product_Category': np.random.choice(
        ['Electronics', 'Fashion', 'Home', 'Sports', 'Books'], 
        size=n_transactions, p=[0.30, 0.25, 0.20, 0.15, 0.10]
    ),
    'Customer_Type': np.random.choice(
        ['New', 'Regular', 'VIP'], 
        size=n_transactions, p=[0.40, 0.45, 0.15]
    ),
    'Payment_Method': np.random.choice(
        ['Credit Card', 'Debit Card', 'PayPal', 'Cash'], 
        size=n_transactions, p=[0.40, 0.30, 0.20, 0.10]
    )
})

# Comprehensive analysis
print('E-COMMERCE ANALYTICS DASHBOARD')
print('='*70)

print('\n1. Product Category Performance:')
product_stats = ecommerce_data['Product_Category'].value_counts()
for category, count in product_stats.items():
    pct = (count / n_transactions) * 100
    bar = '█' * int(pct / 2)
    print(f'{category:15s}: {bar:25s} {count:3d} ({pct:5.1f}%)')

print('\n2. Customer Segmentation:')
customer_stats = ecommerce_data['Customer_Type'].value_counts()
for cust_type, count in customer_stats.items():
    pct = (count / n_transactions) * 100
    print(f'{cust_type:10s}: {count:3d} transactions ({pct:5.1f}%)')

print('\n3. Cross-Analysis: Top Category by Customer Type:')
cross_tab = pd.crosstab(ecommerce_data['Customer_Type'], 
                       ecommerce_data['Product_Category'])
print(cross_tab)

print('\n4. Key Insights:')
mode_category = product_stats.idxmax()
mode_customer = customer_stats.idxmax()
print(f'   • Most popular category: {mode_category} ({product_stats.max()} purchases)')
print(f'   • Dominant customer type: {mode_customer} ({customer_stats.max()} customers)')
print(f'   • Total transactions analyzed: {n_transactions}')

## Summary & Key Takeaways

### Core Concepts
1. **Frequency Distributions**
   - Organize categorical data into tables
   - Calculate absolute, relative, percentage, and cumulative frequencies
   - Essential for understanding data composition

2. **Visualizations**
   - **Bar charts**: Best for comparing categories
   - **Pie charts**: Best for showing proportions
   - Choose based on data and communication goals

3. **Contingency Tables**
   - Analyze relationships between two categorical variables
   - Powerful tool for segmentation and cross-analysis
   - Foundation for chi-square tests (later weeks)

4. **Mode**
   - Only measure of central tendency for nominal data
   - Identifies most frequent category
   - Critical for business decision-making

### Practical Applications
- Customer segmentation
- Market research analysis
- Survey data interpretation
- Product portfolio optimization
- Business intelligence reporting

### Next Steps
- Week 3: Numerical data visualization (histograms, box plots)
- Week 4: Central tendency measures (mean, median)
- Week 5: Dispersion measures (variance, standard deviation)
