# Part 2: Data Exploration and Cleaning

**Name:** Brayden Uglione

**Date:** 10/1/24

**Exercise:** Data Cleaning with Pandas  

**Purpose:** Explore and clean survey data from computing and non-computing majors.

## Data Import
Import libraries and read survey results into dataframes.

In [1]:
import pandas as pd

# Read in the datasets
df0 = pd.read_csv('Non-Majors Survey Results/Non-Majors Survey Results - Fall 2020.csv')
df1 = pd.read_csv('Non-Majors Survey Results/Non-Majors Survey Results - Fall 2021.csv', encoding='latin-1')
df2 = pd.read_csv('Non-Majors Survey Results/Non-Majors Survey Results - Fall 2022.csv')
df3 = pd.read_csv('Non-Majors Survey Results/Non-Majors Survey Results - Fall 2023.csv')

# Combine all dataframes
df = pd.concat([df0, df1, df2, df3], ignore_index=True)

df.to_csv('concat_non_majors_survey_results.csv', index=False)

## Data Exploration
Explore the dataset to understand its structure and contents.

In [None]:
# Display basic information
print(df.info())

In [None]:
# Show first few rows
print(df.head())

In [None]:
# Display summary statistics
print(df.describe())

In [None]:
# Show column names and data types
print(df.columns)

## Data Cleaning
Clean the dataset by renaming columns, removing irrelevant features, and condensing values.

In [None]:
# Rename columns to lowercase with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('[^a-z0-9_]', '', regex=True).str.replace('?', '')

# Identify and remove irrelevant columns
columns_to_drop = ['timestamp']
for col in df.columns:
    if col.startswith('timestamp') or col in ['.1', '.2', '.3']:
        columns_to_drop.append(col)
df = df.drop(columns=columns_to_drop, errors='ignore')

# Clean and condense course names
course_column = 'which_course_are_you_currently_enrolled_in'
if course_column in df.columns:
    df[course_column] = df[course_column].replace({
        'CMP 126 Computer Technology and Applications': 'CMP 126',
        'CMP 101 Computer Information Literacy': 'CMP 101',
        'CMP 135 Computer Concepts with Applications': 'CMP 135',
    })

# Clean and condense motivation responses
motivation_columns = []
for col in df.columns:
    if col.startswith('what_motivated_you_to_seek_a_computing_class_at_ccm'):
        motivation_columns.append(col)
for col in motivation_columns:
    df[col] = df[col].replace({'Yes': 1, 'No': 0})

# Handle missing values
df = df.fillna('Unknown')

# Save cleaned dataset
df.to_csv('cleaned_non_majors_survey_results.csv', index=False)

# Display info of cleaned dataset
print(df.info())