CCM Computing Programs Survey Analysis
Author: Logan Ash
Date: February 21, 2025
Purpose: To analyze and visualize trends in CCM computing program enrollment from 2020-2024

----------------------- SETUP & IMPORTS ------------------------------

In [None]:
# Section 1: Import necessary libraries and configure settings
# Import core data handling libraries
import pandas as pd
import numpy as np

In [None]:
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import utility libraries
import re
import os
from datetime import datetime

In [None]:
# Set visualization style for consistent appearance
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')
plt.rcParams.update({'font.size': 12})

In [None]:
# Display options for better output readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', '{:.2f}'.format)

----------------------- DATA LOADING ------------------------------

In [None]:
# Section 2: Load and examine the data files
# List of survey files to analyze (one for each year)
files = [
    'Majors Survey Results  Fall 2020.csv',
    'Majors Survey Results  Fall 2021.csv',
    'Majors Survey Results  Fall 2022.csv',
    'Majors Survey Results  Fall 2023.csv',
    'Majors Survey Results  Fall 2024.csv'
]

In [None]:
# Initial exploration of one file to understand structure
df_2023 = pd.read_csv('Majors Survey Results  Fall 2023.csv')

In [None]:
# Display basic information about the dataset
print(f"Dataset shape: {df_2023.shape}")
print("\nFirst 5 column names:")
print(df_2023.columns[:5].tolist())

----------------------- DATA EXPLORATION ---------------------------

Section 3: Explore the data structure
Examine core statistics and content before cleaning

In [None]:
# Check basic summary statistics
df_2023.describe().T

In [None]:
# View the first few rows to understand data format
df_2023.head(2)

In [None]:
# Check column data types
df_2023.dtypes

In [None]:
# Define key columns for analysis
key_columns = [
    'Which course are you enrolled in?',
    'What degree program are you currently enrolled in?',
    'Gender',
    'Age ',
    'Race/ethnicity',
    'On a scale of 1 to 5, with 1 being not at all interested and 5 being extremely interested, how interested are you in taking more computing classes?'
]

In [None]:
# Check for missing values in key columns
print("Missing values in key columns:")
df_2023[key_columns].isnull().sum()

----------------------- DATA CLEANING ------------------------------

Section 4: Define standardization functions
These functions will clean and normalize data across all survey years

In [None]:
def standardize_degree_program(program):
    """
    Standardize degree program names to consistent categories.
    
    Parameters:
    program (str): Raw program name from survey
    
    Returns:
    str: Standardized program category
    """
    if pd.isna(program):
        return 'Unknown'
    
    # Normalize case and trim whitespace
    normalized = str(program).strip()
    
    # Group similar programs using regex pattern matching
    if re.search(r'computer science', normalized, re.I):
        return 'Computer Science'
    elif re.search(r'information technology', normalized, re.I):
        return 'Information Technology'
    elif re.search(r'game development', normalized, re.I):
        return 'Game Development'
    elif re.search(r'engineering|engineer', normalized, re.I):
        return 'Engineering'
    elif re.search(r'sharetime|share time|csip', normalized, re.I):
        return 'ShareTime Program'
    elif re.search(r'certification|certificate|achievement', normalized, re.I):
        return 'Certificate Program'
    elif re.search(r'undecided', normalized, re.I):
        return 'Undecided'
    elif re.search(r'non degree|non-degree', normalized, re.I):
        return 'Non-Degree'
    elif re.search(r'data', normalized, re.I):
        return 'Data Analytics/Science'
    else:
        return 'Other'

In [None]:
def standardize_gender(gender):
    """
    Standardize gender values to consistent categories.
    
    Parameters:
    gender (str): Raw gender value from survey
    
    Returns:
    str: Standardized gender category
    """
    if pd.isna(gender):
        return 'Unknown'
    
    normalized_gender = str(gender).lower().strip()
    
    if normalized_gender in ['man', 'male']:
        return 'Man'
    elif normalized_gender in ['woman', 'female', 'women']:
        return 'Woman'
    elif 'non-binary' in normalized_gender or 'nonbinary' in normalized_gender:
        return 'Non-binary'
    elif 'prefer not' in normalized_gender or 'not say' in normalized_gender or 'do not identify' in normalized_gender:
        return 'Prefer not to say'
    else:
        return 'Other'

In [None]:
def standardize_age(age):
    """
    Standardize age ranges to consistent categories.
    
    Parameters:
    age (str): Raw age value from survey
    
    Returns:
    str: Standardized age category
    """
    if pd.isna(age):
        return 'Unknown'
    
    # Fix any formatting issues
    age_str = str(age).strip()
    
    # Handle specific known formats
    if age_str == '18 and younger"':
        return '18 and younger'
    if age_str == '25-34"':
        return '25-34'
    
    return age_str

In [None]:
def standardize_race(race):
    """
    Standardize race/ethnicity values to consistent categories.
    
    Parameters:
    race (str): Raw race/ethnicity value from survey
    
    Returns:
    str: Standardized race/ethnicity category
    """
    if pd.isna(race):
        return 'Unknown'
    
    # Convert to string
    race_str = str(race)
    
    # Handle multi-selection responses
    if ';' in race_str:
        return 'Multi-Racial'
    
    # Normalize single selections
    if re.search(r'white|caucasian', race_str, re.I):
        return 'White/Caucasian'
    elif re.search(r'black|african', race_str, re.I):
        return 'Black/African American'
    elif re.search(r'hispanic|latino', race_str, re.I):
        return 'Hispanic or Latino'
    elif re.search(r'asian', race_str, re.I):
        return 'Asian'
    elif re.search(r'native|indigenous|american indian', race_str, re.I):
        return 'Native American/Indigenous'
    elif re.search(r'choose not|prefer not|not say', race_str, re.I):
        return 'Prefer not to say'
    else:
        return 'Other'

In [None]:
def standardize_impact(impact):
    """
    Standardize impact values to consistent categories.
    
    Parameters:
    impact (str): Raw impact value from survey
    
    Returns:
    str: Standardized impact category
    """
    if pd.isna(impact):
        return 'Unknown'
    
    impact_str = str(impact).strip()
    
    if impact_str == 'High Impact':
        return 'High'
    if impact_str == 'Some Impact':
        return 'Medium'
    if impact_str == 'No Impact':
        return 'Low'
    if impact_str == 'N/A':
        return 'Not Applicable'
    
    return impact_str

----------------------- DATA INTEGRATION ---------------------------

Section 5: Process and combine all files
Read, clean, and combine all five years of data

In [None]:
# Create an empty list to store dataframes
dataframes = []

In [None]:
for file in files:
    # Extract year from filename
    year = re.search(r'Fall (\d{4})', file).group(1)
    print(f"Processing {year} data...")
    
    # Read the CSV file
    df = pd.read_csv(file)
    
    # Add year column
    df['year'] = int(year)
    
    # Select and rename columns
    selected_cols = [
        'year',
        'Which course are you enrolled in?',
        'What degree program are you currently enrolled in?',
        'Gender',
        'Age ',
        'Race/ethnicity',
        'On a scale of 1 to 5, with 1 being not at all interested and 5 being extremely interested, how interested are you in taking more computing classes?'
    ]
    
    try:
        # Create a new dataframe with selected columns
        df_selected = df[selected_cols].copy()
        
        # Rename columns for clarity and consistency
        df_selected.columns = [
            'year',
            'course',
            'degree_program',
            'gender',
            'age',
            'race_ethnicity',
            'interest_level'
        ]
        
        # Apply standardization functions
        df_selected['degree_program'] = df_selected['degree_program'].apply(standardize_degree_program)
        df_selected['gender'] = df_selected['gender'].apply(standardize_gender)
        df_selected['age'] = df_selected['age'].apply(standardize_age)
        df_selected['race_ethnicity'] = df_selected['race_ethnicity'].apply(standardize_race)
        
        # Add to list of dataframes
        dataframes.append(df_selected)
        
        print(f"Processed {len(df)} rows from {year}")
    except Exception as e:
        print(f"Error processing {file}: {e}")

In [None]:
# Combine all dataframes
combined_df = pd.concat(dataframes, ignore_index=True)

In [None]:
# Show the combined dataframe info
print("\nCombined Dataset Information:")
print(f"Total rows: {len(combined_df)}")
print(f"Columns: {combined_df.columns.tolist()}")

In [None]:
# Display the first few rows of the combined dataset
combined_df.head()

In [None]:
# Verify standardized values distribution
print("\nDegree Program Distribution:")
print(combined_df['degree_program'].value_counts())

In [None]:
print("\nGender Distribution:")
print(combined_df['gender'].value_counts())

In [None]:
print("\nAge Distribution:")
print(combined_df['age'].value_counts())

In [None]:
print("\nRace/Ethnicity Distribution:")
print(combined_df['race_ethnicity'].value_counts())

----------------------- DATA ANALYSIS ---------------------------

Section 6: Analyze trends over time
Create year-by-year summaries of key metrics

In [None]:
# Create a yearly summary of key metrics
yearly_summary = combined_df.groupby('year').agg(
    total_students=('degree_program', 'count'),
    cs_students=('degree_program', lambda x: (x == 'Computer Science').sum()),
    it_students=('degree_program', lambda x: (x == 'Information Technology').sum()),
    women_students=('gender', lambda x: (x == 'Woman').sum()),
    young_students=('age', lambda x: (x == '18 and younger').sum()),
    nonwhite_students=('race_ethnicity', lambda x: (~x.isin(['White/Caucasian', 'Unknown', 'Prefer not to say'])).sum())
)

In [None]:
# Calculate percentage metrics for better comparison
yearly_summary['cs_percent'] = (yearly_summary['cs_students'] / yearly_summary['total_students'] * 100).round(1)
yearly_summary['it_percent'] = (yearly_summary['it_students'] / yearly_summary['total_students'] * 100).round(1)
yearly_summary['women_percent'] = (yearly_summary['women_students'] / yearly_summary['total_students'] * 100).round(1)
yearly_summary['young_percent'] = (yearly_summary['young_students'] / yearly_summary['total_students'] * 100).round(1)
yearly_summary['nonwhite_percent'] = (yearly_summary['nonwhite_students'] / yearly_summary['total_students'] * 100).round(1)

In [None]:
# Display yearly summary table
yearly_summary

In [None]:
# Analyze interest levels by year
interest_by_year = combined_df.dropna(subset=['interest_level']).groupby('year').agg(
    avg_interest=('interest_level', 'mean'),
    response_count=('interest_level', 'count')
)

In [None]:
# Calculate high interest percentages (4-5 rating)
high_interest_by_year = combined_df.dropna(subset=['interest_level'])
high_interest_by_year['high_interest'] = high_interest_by_year['interest_level'] >= 4
high_interest_counts = high_interest_by_year.groupby('year').agg(
    high_interest_count=('high_interest', 'sum'),
    total_responses=('interest_level', 'count')
)
high_interest_counts['high_interest_percent'] = (high_interest_counts['high_interest_count'] / high_interest_counts['total_responses'] * 100).round(1)

In [None]:
# Combine with interest_by_year
interest_by_year = pd.merge(interest_by_year, high_interest_counts[['high_interest_percent']], left_index=True, right_index=True)

In [None]:
# Display interest summary
interest_by_year

----------------------- DATA VISUALIZATION ---------------------------

Section 7: Create overview visualizations of key metrics and trends
This section shows the primary metrics in a dashboard-style layout

In [None]:
# Set up plotting area for overview visualizations
plt.figure(figsize=(15, 10))

In [None]:
# 1. Total enrollment trend
plt.subplot(2, 3, 1)
plt.plot(yearly_summary.index, yearly_summary['total_students'], 'o-', linewidth=2)
plt.title('Total Student Enrollment by Year')
plt.xlabel('Year')
plt.ylabel('Number of Students')
plt.grid(True)

In [None]:
# 2. Program distribution trend
plt.subplot(2, 3, 2)
plt.plot(yearly_summary.index, yearly_summary['cs_percent'], 'o-', label='Computer Science')
plt.plot(yearly_summary.index, yearly_summary['it_percent'], 's-', label='Information Technology')
plt.title('Program Distribution Trend')
plt.xlabel('Year')
plt.ylabel('Percentage (%)')
plt.legend()
plt.grid(True)

In [None]:
# 3. Gender diversity trend
plt.subplot(2, 3, 3)
plt.plot(yearly_summary.index, yearly_summary['women_percent'], 'o-', color='#FF8042')
plt.title('Women Representation (%)')
plt.xlabel('Year')
plt.ylabel('Percentage (%)')
plt.grid(True)

In [None]:
# 4. Age distribution trend
plt.subplot(2, 3, 4)
plt.plot(yearly_summary.index, yearly_summary['young_percent'], 'o-', color='#8884d8')
plt.title('Students 18 and Younger (%)')
plt.xlabel('Year')
plt.ylabel('Percentage (%)')
plt.grid(True)

In [None]:
# 5. Interest level trend
plt.subplot(2, 3, 5)
plt.plot(interest_by_year.index, interest_by_year['avg_interest'], 'o-', color='#8884d8', label='Avg Interest (1-5)')
plt.plot(interest_by_year.index, interest_by_year['high_interest_percent']/20, 's-', color='#82ca9d', label='High Interest % (÷20)')
plt.title('Interest in Further Computing Classes')
plt.xlabel('Year')
plt.ylabel('Average Interest Level')
plt.legend()
plt.grid(True)

In [None]:
# 6. Diversity trend
plt.subplot(2, 3, 6)
plt.plot(yearly_summary.index, yearly_summary['nonwhite_percent'], 'o-', color='#82ca9d')
plt.title('Non-White Student Representation (%)')
plt.xlabel('Year')
plt.ylabel('Percentage (%)')
plt.grid(True)

In [None]:
# Save figure with meaningful filename and ensure proper layout
plt.tight_layout()
plt.savefig('ccm_survey_trends.png', dpi=300)
plt.show()

----------------------- DETAILED VISUALIZATIONS ---------------------------

Section 8: Create detailed category breakdown visualizations
These visualizations show more detailed breakdowns of the key categories

In [None]:
# 1. Detailed program distribution by year (stacked bar)
program_data = combined_df.groupby(['year', 'degree_program']).size().unstack().fillna(0)
program_data = program_data.reindex(columns=['Computer Science', 'Information Technology', 'Engineering', 
                                           'Game Development', 'ShareTime Program', 'Certificate Program', 
                                           'Undecided', 'Non-Degree', 'Data Analytics/Science', 'Other'])

In [None]:
plt.figure(figsize=(12, 6))
program_data.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Program Distribution by Year')
plt.xlabel('Year')
plt.ylabel('Number of Students')
plt.legend(title='Degree Program', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('ccm_program_distribution.png', dpi=300)
plt.show()

In [None]:
# 2. Gender distribution by year (stacked bar)
gender_data = combined_df.groupby(['year', 'gender']).size().unstack().fillna(0)

In [None]:
plt.figure(figsize=(12, 6))
gender_data.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Gender Distribution by Year')
plt.xlabel('Year')
plt.ylabel('Number of Students')
plt.legend(title='Gender', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('ccm_gender_distribution.png', dpi=300)
plt.show()

In [None]:
# 3. Age distribution by year (stacked bar)
age_data = combined_df.groupby(['year', 'age']).size().unstack().fillna(0)
# Sort columns by age category for logical order
age_order = ['18 and younger', '19-20', '21-24', '25-34', '35-64', '65+', 'Unknown']
age_data = age_data.reindex(columns=[col for col in age_order if col in age_data.columns])

In [None]:
plt.figure(figsize=(12, 6))
age_data.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Age Distribution by Year')
plt.xlabel('Year')
plt.ylabel('Number of Students')
plt.legend(title='Age Group', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('ccm_age_distribution.png', dpi=300)
plt.show()

In [None]:
# 4. Racial/ethnic diversity by year (stacked bar)
race_data = combined_df.groupby(['year', 'race_ethnicity']).size().unstack().fillna(0)

In [None]:
plt.figure(figsize=(12, 6))
race_data.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Racial/Ethnic Distribution by Year')
plt.xlabel('Year')
plt.ylabel('Number of Students')
plt.legend(title='Race/Ethnicity', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('ccm_race_distribution.png', dpi=300)
plt.show()

In [None]:
# 5. Interest levels histogram by year
plt.figure(figsize=(10, 6))
sns.histplot(data=combined_df.dropna(subset=['interest_level']), x='interest_level', hue='year', multiple='dodge', bins=5)
plt.title('Distribution of Interest Levels by Year')
plt.xlabel('Interest Level (1-5)')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('ccm_interest_distribution.png', dpi=300)
plt.show()

----------------------- DEMOGRAPHIC ANALYSIS ---------------------------

Section 9: Analyze relationships between variables
Explore how demographic factors relate to each other

In [None]:
# 1. Interest level by gender
interest_by_gender = combined_df.dropna(subset=['interest_level']).groupby(['gender']).agg(
    avg_interest=('interest_level', 'mean'),
    count=('interest_level', 'count')
).sort_values('count', ascending=False)

In [None]:
# Display results
interest_by_gender

In [None]:
# Visualize gender interest levels
plt.figure(figsize=(10, 6))
sns.barplot(x=interest_by_gender.index, y='avg_interest', data=interest_by_gender)
plt.title('Average Interest Level by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Interest Level (1-5)')
plt.ylim(0, 5)
for i, v in enumerate(interest_by_gender['avg_interest']):
    plt.text(i, v + 0.1, f"{v:.2f}", ha='center')
plt.tight_layout()
plt.savefig('ccm_interest_by_gender.png', dpi=300)
plt.show()

In [None]:
# 2. Interest level by degree program
interest_by_program = combined_df.dropna(subset=['interest_level']).groupby(['degree_program']).agg(
    avg_interest=('interest_level', 'mean'),
    count=('interest_level', 'count')
).sort_values('count', ascending=False).head(6)  # Top 6 programs by count

In [None]:
# Display results
interest_by_program

In [None]:
# Visualize program interest levels
plt.figure(figsize=(12, 6))
sns.barplot(x=interest_by_program.index, y='avg_interest', data=interest_by_program)
plt.title('Average Interest Level by Degree Program')
plt.xlabel('Degree Program')
plt.ylabel('Average Interest Level (1-5)')
plt.ylim(0, 5)
plt.xticks(rotation=45, ha='right')
for i, v in enumerate(interest_by_program['avg_interest']):
    plt.text(i, v + 0.1, f"{v:.2f}", ha='center')
plt.tight_layout()
plt.savefig('ccm_interest_by_program.png', dpi=300)
plt.show()

In [None]:
# 3. Degree program by gender (proportions)
program_gender = pd.crosstab(combined_df['degree_program'], combined_df['gender'], normalize='index') * 100
program_gender = program_gender.sort_values('Woman', ascending=False)

In [None]:
# Filter to top programs for clarity
top_programs = combined_df['degree_program'].value_counts().head(6).index
program_gender_filtered = program_gender.loc[top_programs]

In [None]:
# Display results
program_gender_filtered

In [None]:
# Visualize gender distribution by program
plt.figure(figsize=(12, 7))
program_gender_filtered[['Man', 'Woman']].plot(kind='bar', stacked=False)
plt.title('Gender Distribution by Degree Program')
plt.xlabel('Degree Program')
plt.ylabel('Percentage (%)')
plt.legend(title='Gender')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('ccm_gender_by_program.png', dpi=300)
plt.show()

----------------------- SAVE CLEANED DATA ---------------------------

Section 10: Save the cleaned data for future use

In [None]:
# Save the cleaned combined dataset
combined_df.to_csv('ccm_cleaned_survey_data.csv', index=False)
print("Cleaned data saved to 'ccm_cleaned_survey_data.csv'")

----------------------- FINDINGS SUMMARY ---------------------------

Section 11: Summary of key findings

In [None]:
print("""
# Key Findings from CCM Computing Programs Survey Analysis

1. **Enrollment Trends**:
   - Total enrollment in computing programs has shown a declining trend, from 298 students in 2020 to 197 in 2024.
   - Computer Science has become the most popular program since 2021, overtaking Information Technology.
   - The percentage of CS majors increased from 22.5% in 2020 to 31.5% in 2024.

2. **Gender Distribution**:
   - Computing programs at CCM show a significant gender imbalance, with men representing 73-82% of students.
   - Women's representation has slightly decreased from 22.1% in 2020 to 16.8% in 2024.
   - There has been a small increase in non-binary student representation.

3. **Age Demographics**:
   - The student population is trending younger, with students 18 and under increasing from 33.9% in 2020 to 40.6% in 2024.
   - The 19-24 age group remains a significant portion of the student body.

4. **Interest in Computing**:
   - Students consistently show high interest in taking additional computing classes.
   - The average interest level has remained relatively stable between 3.4-3.7 out of 5.
   - The percentage of students with high interest (4-5 rating) has increased from 45.6% in 2020 to 59.7% in 2024.

5. **Racial/Ethnic Diversity**:
   - The student population has become increasingly diverse over the years.
   - Non-white student representation grew from 51.3% in 2020 to 58.9% in 2024.
   - The Hispanic/Latino population has remained consistently significant (around 20-21%).
   - There has been growth in multi-racial student representation.

6. **Program Distribution**:
   - The most popular programs across all years are:
     - Computer Science (30.7%)
     - Information Technology (23.6%)
     - Engineering-related programs (8.8%)
     - ShareTime Program (6.8%)

7. **Interest Across Demographics**:
   - Interest levels are fairly consistent across gender groups (average 3.5-3.7)
   - Engineering and ShareTime Program students show the highest interest levels
   - Computer Science majors show slightly higher interest than Information Technology majors
""")