<center>
<table>
    <tr>
        <th><h1>Decadal Insights: An Exploration of Degree Selection and Completion Rates Across Demographics (2001-2022)</h1></th>
    </tr>
    <tr>
        <td><h3>Author: Charles Atchison</h3></td>
    </tr>
    <tr>
        <td><h3>Date: May 4, 2024</h3></td>
    </tr>
</table>
</center>


## Project Overview

This project conducts a comprehensive analysis of higher education data spanning from 2001 to 2022. It aims to uncover how demographic attributes such as gender, race, and immigration status influence degree selection and completion rates across different academic disciplines and award levels.

### Objectives

The objectives of this project include:

- **Analyzing Trends**: Identify how different demographic groups select degrees and the influence of demographic attributes on these decisions.
- **Comparing Completion Rates**: Evaluate how demographic factors impact the rates of degree completion, identifying disparities and factors contributing to success or challenges in higher education.
- **Providing Insights**: Deliver actionable insights to educational institutions to enhance academic offerings and support services for a diverse student body.

### Analytical Question

How do demographic attributes such as gender, race, and immigration status influence degree selection and completion rates across different academic disciplines and award levels in higher education institutions from 2000 to 2022?

### Dataset

The data for this analysis was sourced from the Integrated Postsecondary Education Data System (IPEDS), available through the National Center for Education Statistics. You can access the dataset from the following [link](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx?year=2000&surveyNumber=3&sid=c610fb17-7be5-4a97-8c02-99076b161130&rtid=7).

By leveraging this dataset, the project will employ statistical and machine learning techniques to analyze and visualize trends, contributing valuable insights into the dynamics of higher education demographics over two decades.

---
##  System Architecture

### 1. **Data Ingestion**

#### Description
The Data Ingestion stage involved the automated aggregation of datasets from the years 2001 to 2022, streamlining the process to form a unified data framework. This phase was crucial for preparing the datasets for comprehensive analysis, ensuring that data from various years could be analyzed collectively to identify trends and patterns over time.

#### Data Acquisition
Data was acquired by scraping annual datasets available from the Integrated Postsecondary Education Data System (IPEDS), hosted by the National Center for Education Statistics. Each dataset corresponded to a specific year and contained various metrics related to higher education institutions, such as enrollment numbers, graduation rates, and demographic information.

#### Data Combination
The data from each year was aggregated into a single dataset. This involved reading individual datasets from yearly files, each containing data structured in a consistent format across years. These individual datasets were then concatenated into a single DataFrame, ensuring that each record was appropriately aligned and structured for subsequent phases of the project.

#### Outcome
The combined dataset resulted in a comprehensive DataFrame containing 5,797,529 records and 197 columns. This aggregation facilitates a more seamless analysis, allowing for cross-year comparisons and trend analysis across multiple dimensions of the data.

The final aggregated data was stored in a Parquet file, chosen for its efficiency in handling large datasets. This format supports advanced data compression and encoding schemes, which are optimized for complex data processing operations that are anticipated in later stages of this project.


In [36]:
import os
import seaborn as sns
import matplotlib.pyplot as plt

# Base path where the data files are stored
base_path = 'Data'

# List to store each year's DataFrame
dfs = []

# Loop through each year from 2013 to 2022
for year in range(2001, 2023):
    file_path = f'{base_path}/C{year}_A/c{year}_a.csv'
    try:
        # Read the CSV file
        df = pd.read_csv(file_path)
        
        # Add a year column to keep track of data by year
        df['year'] = year
        
        # Append the DataFrame to the list
        dfs.append(df)
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred while reading {file_path}: {str(e)}")

# Concatenate all DataFrames into one
raw_df = pd.concat(dfs, ignore_index=True)

# Save the combined DataFrame to a Parquet file
raw_df.to_parquet('raw_combined_data.parquet', index=False)
# Display the shape of the df
print(raw_df.shape)
# Display the data head
raw_df.head()

(5797529, 197)


Unnamed: 0,unitid,majornum,cipcode,awlevel,xcrace01,crace01,xcrace02,crace02,xcrace03,crace03,...,XDVCHSW,DVCHSW,XDVCWHT,DVCWHT,XDVCWHM,DVCWHM,XDVCWHW,DVCWHW,CNRALW,CDISTEDP
0,100636.0,1.0,51.2206,3.0,R,0.0,R,0.0,R,2.0,...,,,,,,,,,,
1,100654.0,1.0,52.0101,5.0,R,0.0,R,0.0,R,0.0,...,,,,,,,,,,
2,100663.0,1.0,9.0101,5.0,Z,0.0,R,1.0,R,1.0,...,,,,,,,,,,
3,100663.0,1.0,51.9999,5.0,Z,0.0,Z,0.0,R,1.0,...,,,,,,,,,,
4,100663.0,1.0,26.0101,7.0,Z,0.0,R,1.0,Z,0.0,...,,,,,,,,,,


---
### 2. **Data Cleanup and Transformation**
- **Description**: This phase focuses on standardizing, cleaning, and transforming raw data to prepare it for comprehensive analysis.
- **Tools**: Utilized Pandas for data manipulation and NumPy for numerical data operations.

In tackling the inconsistencies and disorganized nature of the raw dataset, I devised a strategic approach to streamline and standardize the data. I created a dictionary named `column_mappings` to align new, uniform column names with their various labels found in the dataset. This method ensured that data associated with similar content but differing labels, such as 'unitid' and 'UNITID', were consolidated under a single standardized column name.

To implement this standardization, I iterated over each mapping in the dictionary. For each new column name defined, I verified the presence of its corresponding old columns in the raw DataFrame. If present, I migrated the data into a newly established, clean DataFrame. In situations where multiple old columns mapped to a single new column, I amalgamated the data from these columns, selecting the first non-null value for each entry. This approach effectively managed duplicates and missing values, resulting in a consolidated and more practical dataset. I then used `clean_data.head()` to preview the cleaned data, ensuring the accuracy and integrity of the transformation process.

In the context of this dataset, the `cipcode` column is particularly noteworthy as it encodes the Classification of Instructional Programs (CIP) codes, which are six-digit figures formatted as xx.xxxx. These codes classify instructional program specialties within educational institutions. Initially, I have retained these codes as numerical values to simplify the processing of this extensive dataset. This decision aids in performing operations like sorting, filtering, and grouping during the early stages of data analysis.

Following the initial numerical analysis, I plan to map these `cipcode` numbers to their respective educational program descriptions. This mapping will transform the numerical codes into meaningful descriptions, for example, converting '01.0101' into 'Agricultural Business and Management, General'. This crucial step will render the data more interpretable and relatable for stakeholders, thereby enhancing the utility and impact of the analytical reports.

| Column Name | Code Value | Value Label | Description |
|-------------|------------|-------------|-------------|
| UNITID      |            |             | Unique ID for Each Institution (entity) |
| CIPCODE     |            |             | CIP Code - 2000 Classification. A six-digit code in the form xx.xxxx that identifies instructional program specialties within educational institutions. |
| MAJORNUM    |            |             | First or Second Major |
|             | 1          | First major | First major |
|             | 2          | Second major | Second major |
| AWLEVEL     |            |             | Award level code |
|             | 3          | Associate's degree | Associate's degree |
|             | 5          | Bachelor's degree | Bachelor's degree |
|             | 7          | Master's degree | Master's degree |
|             | 9          | Doctor's degree | Doctor's degree |
|             | 10         | First-professional degree | First-professional degree |
|             | 11         | Award of less than 1 academic year | Award of less than 1 academic year |
|             | 12         | Award of at least 1 but less than 2 academic years | Award of at least 1 but less than 2 academic years |
|             | 4          | Award of at least 2 but less than 4 academic years | Award of at least 2 but less than 4 academic years |
|             | 6          | Postbaccalaureate certificate | Postbaccalaureate certificate |
|             | 8          | Post-master's certificate | Post-master's certificate |
|             | 11         | First-professional certificate | First-professional certificate |

2001 Data Dictionary:

| Variable Name | Description                                  |
|---------------|----------------------------------------------|
| CRACE01       | Nonresident alien men                        |
| CRACE02       | Nonresident alien women                      |
| CRACE03       | Black non-Hispanic men                       |
| CRACE04       | Black non-Hispanic women                     |
| CRACE05       | American Indian or Alaskan Native men        |
| CRACE06       | American Indian or Alaskan Native women      |
| CRACE07       | Asian or Pacific Islander men                |
| CRACE08       | Asian or Pacific Islander women              |
| CRACE09       | Hispanic men                                 |
| CRACE10       | Hispanic women                               |
| CRACE11       | White non-Hispanic men                       |
| CRACE12       | White non-Hispanic women                     |
| CRACE13       | Race/ethnicity unknown men                   |
| CRACE14       | Race/ethnicity unknown women                 |
| CRACE15       | Grand total men                              |
| CRACE16       | Grand total women                            |

2002 - 2007 Data Dictionary:

| Variable Name | Data Type | Field Width | Format | Imputation Variable | Description                               |
|---------------|-----------|-------------|--------|---------------------|-------------------------------------------|
| CRACE01       | N         | 6           | Cont   | XCRACE01            | Nonresident alien men                     |
| CRACE02       | N         | 6           | Cont   | XCRACE02            | Nonresident alien women                   |
| CRACE03       | N         | 6           | Cont   | XCRACE03            | Black non-Hispanic men                    |
| CRACE04       | N         | 6           | Cont   | XCRACE04            | Black non-Hispanic women                  |
| CRACE05       | N         | 6           | Cont   | XCRACE05            | American Indian/Alaska Native men         |
| CRACE06       | N         | 6           | Cont   | XCRACE06            | American Indian/Alaska Native women       |
| CRACE07       | N         | 6           | Cont   | XCRACE07            | Asian or Pacific Islander men             |
| CRACE08       | N         | 6           | Cont   | XCRACE08            | Asian or Pacific Islander women           |
| CRACE09       | N         | 6           | Cont   | XCRACE09            | Hispanic men                              |
| CRACE10       | N         | 6           | Cont   | XCRACE10            | Hispanic  women                           |
| CRACE11       | N         | 6           | Cont   | XCRACE11            | White non-Hispanic men                    |
| CRACE12       | N         | 6           | Cont   | XCRACE12            | White non-Hispanic women                  |
| CRACE13       | N         | 6           | Cont   | XCRACE13            | Race/ethnicity unknown men                |
| CRACE14       | N         | 6           | Cont   | XCRACE14            | Race/ethnicity unknown women              |
| CRACE15       | N         | 6           | Cont   | XCRACE15            | Total men                                 |
| CRACE16       | N         | 6           | Cont   | XCRACE16            | Total women                               |
| CRACE17       | N         | 6           | Cont   | XCRACE17            | Nonresident alien total                   |
| CRACE18       | N         | 6           | Cont   | XCRACE18            | Black non-Hispanic  total                 |
| CRACE19       | N         | 6           | Cont   | XCRACE19            | American Indian/Alaska Native total       |
| CRACE20       | N         | 6           | Cont   | XCRACE20            | Asian or Pacific Islander total           |
| CRACE21       | N         | 6           | Cont   | XCRACE21            | Hispanic total                            |
| CRACE22       | N         | 6           | Cont   | XCRACE22            | White non-Hispanic total                  |
| CRACE23       | N         | 6           | Cont   | XCRACE23            | Race/ethnicity unknown total              |
| CRACE24       | N         | 6           | Cont   | XCRACE24            | Grand total                               |

2008 - 2010 Data Dictionary:

| Variable Name | Data Type | Field Width | Format | Imputation Variable | Description                                    |
|---------------|-----------|-------------|--------|---------------------|------------------------------------------------|
| CNRALM        | N         | 6           | Cont   | XCNRALM             | Nonresident alien men                          |
| CNRALW        | N         | 6           | Cont   | XCNRALW             | Nonresident alien women                        |
| CRACE03       | N         | 6           | Cont   | XCRACE03            | Black non-Hispanic men - old                   |
| CRACE04       | N         | 6           | Cont   | XCRACE04            | Black non-Hispanic women - old                 |
| CRACE05       | N         | 6           | Cont   | XCRACE05            | American Indian or Alaska Native men - old     |
| CRACE06       | N         | 6           | Cont   | XCRACE06            | American Indian or Alaska Native women - old   |
| CRACE07       | N         | 6           | Cont   | XCRACE07            | Asian or Pacific Islander men - old            |
| CRACE08       | N         | 6           | Cont   | XCRACE08            | Asian or Pacific Islander women - old          |
| CRACE09       | N         | 6           | Cont   | XCRACE09            | Hispanic men - old                             |
| CRACE10       | N         | 6           | Cont   | XCRACE10            | Hispanic women - old                           |
| CRACE11       | N         | 6           | Cont   | XCRACE11            | White non-Hispanic men - old                   |
| CRACE12       | N         | 6           | Cont   | XCRACE12            | White non-Hispanic women - old                 |
| CUNKNM        | N         | 6           | Cont   | XCUNKNM             | Race/ethnicity unknown men                     |
| CUNKNW        | N         | 6           | Cont   | XCUNKNW             | Race/ethnicity unknown women                   |
| CTOTALM       | N         | 6           | Cont   | XCTOTALM            | Grand total men                                |
| CTOTALW       | N         | 6           | Cont   | XCTOTALW            | Grand total women                              |
| CNRALT        | N         | 6           | Cont   | XCNRALT             | Nonresident alien total                        |
| CRACE18       | N         | 6           | Cont   | XCRACE18            | Black non-Hispanic  total - old                |
| CRACE19       | N         | 6           | Cont   | XCRACE19            | American Indian or Alaska Native total - old   |
| CRACE20       | N         | 6           | Cont   | XCRACE20            | Asian or Pacific Islander total - old          |
| CRACE21       | N         | 6           | Cont   | XCRACE21            | Hispanic total - old                           |
| CRACE22       | N         | 6           | Cont   | XCRACE22            | White non-Hispanic total - old                 |
| CUNKNT        | N         | 6           | Cont   | XCUNKNT             | Race/ethnicity unknown total                   |
| CTOTALT       | N         | 6           | Cont   | XCTOTALT            | Grand total                                    |

2011 - 2022 Data Dictionary:

| Variable Name | Data Type | Field Width | Format | Imputation Variable | Description                                                |
|---------------|-----------|-------------|--------|---------------------|------------------------------------------------------------|
| CTOTALT       | N         | 6           | Cont   | XCTOTALT            | Grand total                                                |
| CTOTALM       | N         | 6           | Cont   | XCTOTALM            | Grand total men                                            |
| CTOTALW       | N         | 6           | Cont   | XCTOTALW            | Grand total women                                          |
| CAIANT        | N         | 6           | Cont   | XCAIANT             | American Indian or Alaska Native total                     |
| CAIANM        | N         | 6           | Cont   | XCAIANM             | American Indian or Alaska Native men                       |
| CAIANW        | N         | 6           | Cont   | XCAIANW             | American Indian or Alaska Native women                     |
| CASIAT        | N         | 6           | Cont   | XCASIAT             | Asian total                                                |
| CASIAM        | N         | 6           | Cont   | XCASIAM             | Asian men                                                  |
| CASIAW        | N         | 6           | Cont   | XCASIAW             | Asian women                                                |
| CBKAAT        | N         | 6           | Cont   | XCBKAAT             | Black or African American total                            |
| CBKAAM        | N         | 6           | Cont   | XCBKAAM             | Black or African American men                              |
| CBKAAW        | N         | 6           | Cont   | XCBKAAW             | Black or African American women                            |
| CHISPT        | N         | 6           | Cont   | XCHISPT             | Hispanic or Latino total                                   |
| CHISPM        | N         | 6           | Cont   | XCHISPM             | Hispanic or Latino men                                     |
| CHISPW        | N         | 6           | Cont   | XCHISPW             | Hispanic or Latino women                                   |
| CNHPIT        | N         | 6           | Cont   | XCNHPIT             | Native Hawaiian or Other Pacific Islander total            |
| CNHPIM        | N         | 6           | Cont   | XCNHPIM             | Native Hawaiian or Other Pacific Islander men              |
| CNHPIW        | N         | 6           | Cont   | XCNHPIW             | Native Hawaiian or Other Pacific Islander women            |
| CWHITT        | N         | 6           | Cont   | XCWHITT             | White total                                                |
| CWHITM        | N         | 6           | Cont   | XCWHITM             | White men                                                  |
| CWHITW        | N         | 6           | Cont   | XCWHITW             | White women                                                |
| C2MORT        | N         | 6           | Cont   | XC2MORT             | Two or more races total                                    |
| C2MORM        | N         | 6           | Cont   | XC2MORM             | Two or more races men                                      |
| C2MORW        | N         | 6           | Cont   | XC2MORW             | Two or more races women                                    |
| CUNKNT        | N         | 6           | Cont   | XCUNKNT             | Race/ethnicity unknown total                               |
| CUNKNM        | N         | 6           | Cont   | XCUNKNM             | Race/ethnicity unknown men                                 |
| CUNKNW        | N         | 6           | Cont   | XCUNKNW             | Race/ethnicity unknown women                               |
| CNRALT        | N         | 6           | Cont   | XCNRALT             | Nonresident alien total                                    |
| CNRALM        | N         | 6           | Cont   | XCNRALM             | Nonresident alien men                                      |
| CNRALW        | N         | 6           | Cont   | XCNRALW             | Nonresident alien women                                    |



In [37]:
# Define the column mappings
column_mappings = {
    'unitid': ['unitid', 'UNITID'],
    'majornum': ['majornum', 'MAJORNUM'],
    'cipcode': ['cipcode', 'CIPCODE'],
    'awlevel': ['awlevel', 'AWLEVEL'],
    'nonresident_alien_men': ['crace01', 'CRACE01', 'cnralm', 'CNRALM'],
    'nonresident_alien_women': ['crace02', 'CRACE02', 'cnralw', 'CNRALW'],
    'black_non_hispanic_men': ['crace03', 'CRACE03', 'cbkaam', 'CBKAAM'],
    'black_non_hispanic_women': ['crace04', 'CRACE04', 'cbkaaw', 'CBKAAW'],
    'american_indian_alaskan_men': ['crace05', 'CRACE05', 'caianm', 'CAIANM'],
    'american_indian_alaskan_women': ['crace06', 'CRACE06', 'caianw', 'CAIANW'],
    'asian_pacific_islander_men': ['crace07', 'CRACE07', 'casiam', 'CASIAM'],
    'asian_pacific_islander_women': ['crace08', 'CRACE08', 'casiaw', 'CASIAW'],
    'hispanic_men': ['crace09', 'CRACE09', 'chispm', 'CHISPM'],
    'hispanic_women': ['crace10', 'CRACE10', 'chispw', 'CHISPW'],
    'white_non_hispanic_men': ['crace11', 'CRACE11', 'cwhitm', 'CWHITM'],
    'white_non_hispanic_women': ['crace12', 'CRACE12', 'cwhitw', 'CWHITW'],
    'race_ethnicity_unknown_men': ['crace13', 'CRACE13', 'cunknm', 'CUNKNM'],
    'race_ethnicity_unknown_women': ['crace14', 'CRACE14', 'cunknw', 'CUNKNW'],
    'total_men': ['crace15', 'CRACE15', 'ctotalm', 'CTOTALM'],
    'total_women': ['crace16', 'CRACE16', 'ctotalw', 'CTOTALW'],
    'year': ['year', 'YEAR']
}

# Initialize an empty DataFrame to store clean data
clean_data = pd.DataFrame()

# Process each mapping
for new_col, old_cols in column_mappings.items():
    for old_col in old_cols:
        if old_col in raw_df.columns:  # Check if the old column exists in DataFrame
            if new_col not in clean_data:
                clean_data[new_col] = raw_df[old_col]
            else:
                # Combine non-NA values
                clean_data[new_col] = clean_data[new_col].combine_first(raw_df[old_col])

print(clean_data.shape)
clean_data.head()

(5797529, 21)


Unnamed: 0,unitid,majornum,cipcode,awlevel,nonresident_alien_men,nonresident_alien_women,black_non_hispanic_men,black_non_hispanic_women,american_indian_alaskan_men,american_indian_alaskan_women,...,asian_pacific_islander_women,hispanic_men,hispanic_women,white_non_hispanic_men,white_non_hispanic_women,race_ethnicity_unknown_men,race_ethnicity_unknown_women,total_men,total_women,year
0,100636.0,1.0,51.2206,3.0,0.0,0.0,2.0,4.0,0.0,0.0,...,0.0,0.0,0.0,14.0,12.0,3.0,3.0,19.0,19.0,2001
1,100654.0,1.0,52.0101,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2001
2,100663.0,1.0,9.0101,5.0,0.0,1.0,1.0,10.0,0.0,0.0,...,1.0,0.0,2.0,21.0,28.0,2.0,0.0,24.0,42.0,2001
3,100663.0,1.0,51.9999,5.0,0.0,0.0,1.0,2.0,0.0,0.0,...,0.0,1.0,0.0,2.0,12.0,0.0,0.0,4.0,14.0,2001
4,100663.0,1.0,26.0101,7.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,3.0,0.0,0.0,3.0,4.0,2001


---
### 3. **Data Storage**

Efficient data management is crucial for both maintaining the integrity of raw data and ensuring the accessibility of processed information. In this project, data is handled as follows:

- **Raw Data Storage**: Initially, the raw data is stored locally in CSV format, merged into a combined dataframe that is stored in a Parquet format. This approach allows for straightforward import and manipulation using tools like Pandas.

- **Processed Data Storage**: After cleaning and transforming the data, it is stored in Parquet format on the local filesystem. Parquet is a columnar storage file format that offers efficient data compression and encoding schemes. This format is optimized for performance in handling large datasets and provides excellent support for advanced data operations. By storing the processed data in Parquet files, we ensure that it is ready for efficient access and detailed analysis, significantly enhancing the performance of data retrieval and processing tasks.

In [38]:
# Save the cleaned DataFrame as a Parquet file
clean_data.to_parquet('cleaned_data.parquet')  

---
### 4. **Data Analytics**
- **Description**: Leveraging data to extract actionable insights and answer key questions about educational trends.
- **SQL-Based Analysis**: Utilization of SQLite to execute SQL queries on the processed data, enabling precise data manipulation and extraction.
- **Visualization**: Employing Matplotlib and Seaborn for static charts, alongside Plotly for dynamic, interactive visualizations.

## Analytics and Visualization Goals

The primary objective of our data analysis is to understand various educational trends, focusing on demographic distributions and degree completion rates across different academic disciplines. We aim to answer critical questions such as:

- What are the predominant trends in major selections among different demographic groups?
- How do completion rates vary among different racial and ethnic groups?
- What correlations exist between degree levels and student demographics?

### Detailed Sections

#### **Descriptive Analysis**
- Utilize descriptive statistics to provide a foundational understanding of the dataset, summarizing key features like central tendency, dispersion, and the shape of the dataset's distribution.
- Perform initial exploratory data analysis (EDA) to identify patterns, anomalies, or inconsistencies in the data.

In [39]:
clean_data.describe()

Unnamed: 0,unitid,majornum,cipcode,awlevel,nonresident_alien_men,nonresident_alien_women,black_non_hispanic_men,black_non_hispanic_women,american_indian_alaskan_men,american_indian_alaskan_women,...,asian_pacific_islander_women,hispanic_men,hispanic_women,white_non_hispanic_men,white_non_hispanic_women,race_ethnicity_unknown_men,race_ethnicity_unknown_women,total_men,total_women,year
count,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0,2304057.0,5797529.0,5797529.0,5797529.0,5797529.0,...,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0,5797529.0
mean,200302.6,1.108518,38.02773,5.091122,0.8920149,0.640071,1.278536,2.532304,0.09329112,0.1497845,...,1.108821,1.707383,2.790939,8.219068,11.41709,0.7957928,1.089117,14.09935,20.26064,2012.292
std,82860.18,0.3110341,23.96211,3.660663,11.30516,6.335996,9.888295,19.43573,1.066686,1.558462,...,12.09771,17.67103,27.80695,53.24095,67.38594,9.417075,14.47786,87.49778,117.9142,6.185534
min,100636.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2001.0
25%,149222.0,1.0,15.0303,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2007.0
50%,186380.0,1.0,42.2804,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2.0,2.0,2013.0
75%,220075.0,1.0,51.1105,5.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,4.0,6.0,0.0,0.0,7.0,10.0,2018.0
max,498571.0,2.0,99.0,21.0,2272.0,1576.0,1771.0,4706.0,422.0,303.0,...,1863.0,3899.0,5165.0,10464.0,13413.0,3931.0,7907.0,13847.0,28341.0,2022.0


### Descriptive Analysis and Insights

#### Overview
The dataset under investigation provides a comprehensive view of academic demographics and program classification at various educational institutions. With data entries amounting to nearly 5.8 million records, this extensive dataset includes a range of attributes from demographic information to program details, mapped through standardized classification codes.

#### Key Statistical Insights
- **Count:** Each column in the dataset houses data for about 5.8 million entries, indicating no missing values in key numerical and categorical fields.
- **Mean and Median Values:**
  - The mean and median `unitid` values, correspond to unique identifiers for educational institutions, suggest a relatively uniform spread across the dataset.
  - `majornum`, indicating the major number, has both a mean and a median of approximately 1, suggesting most data pertains to primary majors.
  - The `cipcode` average of around 38 with a median of 42 shows a mid-range skew in program classification, possibly leaning towards certain categories of academic programs.
  - `awlevel`, representing the award level, with a mean close to 5 and a median also at 5, may indicate a common award level across the dataset (potentially bachelor's degrees).
- **Standard Deviation:** High standard deviations in demographic counts like `nonresident_alien_men`, `black_non_hispanic_men`, and `white_non_hispanic_women` reflect significant variability in enrollment figures across different institutions or programs.
- **Minimum and Maximum Values:**
  - The minimal and maximal values across demographic categories like `nonresident_alien_men` and `white_non_hispanic_men` ranging from 0 to over 10,000 indicate extreme variations, which could be due to the size and diversity of institutions.
- **Quartiles:**
  - The 25th percentile often hits 0 in demographic categories, which might suggest that many institutions do not have enrolments in certain demographic segments.
  - The 75th percentile in columns like `total_men` and `total_women` shows values of 7 and 10, respectively, implying that the majority of data clusters below these numbers, but there are significant outliers.

#### Implications for Further Analysis
The substantial spread and variability in demographics related to non-resident aliens and specific ethnic groups could warrant a deeper dive to understand the factors influencing these distributions. The concentration of data around specific `cipcodes` and `awlevel` suggests commonality in program offerings and degree levels that dominate the dataset.

The disparity between the 75th percentile and max values in many demographic fields indicates the presence of outliers, likely representing very large institutions or those with specific demographic focuses. This aspect highlights the need for outlier management and normalization in further statistical or predictive analysis to ensure robustness and representativeness.

In [40]:
# Ensure the 'images' directory exists
if not os.path.exists('images'):
    os.makedirs('images')

def plot_distribution_save(data, column, title, filename):
    """
    Plot and save the distribution of a variable using a histogram overlaid with a kernel density estimate (KDE).
    
    Parameters:
    data (DataFrame): The dataset containing the column to be plotted.
    column (str): The column name for which the distribution is plotted.
    title (str): The title of the plot.
    filename (str): The filename to save the plot.
    """
    plt.figure(figsize=(10, 6))
    sns.histplot(data[column], kde=True, color='blue', edgecolor='black')
    plt.title(title)
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.savefig(f'images/{filename}.png')
    plt.close()

def boxplot_save(data, column, title, filename):
    """
    Plot and save a boxplot for a specified column.
    
    Parameters:
    data (DataFrame): The dataset containing the data.
    column (str): The column name for which the boxplot is plotted.
    title (str): The title of the plot.
    filename (str): The filename to save the plot.
    """
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=data[column])
    plt.title(title)
    plt.xlabel(column)
    plt.grid(True)
    plt.savefig(f'images/{filename}.png')
    plt.close()

def correlation_heatmap_save(data, title, filename):
    """
    Plot and save a correlation heatmap of the DataFrame.
    
    Parameters:
    data (DataFrame): The dataset containing the data.
    title (str): The title of the heatmap.
    filename (str): The filename to save the heatmap.
    """
    plt.figure(figsize=(12, 10))
    correlation_matrix = data.corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=2, linecolor='black')
    plt.title(title)
    plt.savefig(f'images/{filename}.png')
    plt.close()

# Calling the functions
plot_distribution_save(clean_data, 'majornum', 'Distribution of Major Numbers', 'majornum_distribution')
plot_distribution_save(clean_data, 'cipcode', 'Distribution of CIP Codes', 'cipcode_distribution')
plot_distribution_save(clean_data, 'awlevel', 'Distribution of Award Levels', 'awlevel_distribution')
boxplot_save(clean_data, 'awlevel', 'Boxplot for Award Levels', 'awlevel_boxplot')
correlation_heatmap_save(clean_data, 'Correlation Matrix of Variables', 'correlation_matrix')

# Printing out the filenames for your review
print("Generated files:")
for file in os.listdir("images"):
    print(file)

Generated files:
american_indian_alaskan_men_popularity_over_time.png
american_indian_alaskan_women_popularity_over_time.png
asian_pacific_islander_men_popularity_over_time.png
asian_pacific_islander_women_popularity_over_time.png
awlevel_boxplot.png
awlevel_distribution.png
black_non_hispanic_men_popularity_over_time.png
black_non_hispanic_women_popularity_over_time.png
cipcode_distribution.png
correlation_matrix.png
hispanic_men_popularity_over_time.png
hispanic_women_popularity_over_time.png
majornum_distribution.png
nonresident_alien_men_popularity_over_time.png
nonresident_alien_women_popularity_over_time.png
race_ethnicity_unknown_men_popularity_over_time.png
race_ethnicity_unknown_women_popularity_over_time.png
total_men_popularity_over_time.png
total_women_popularity_over_time.png
white_non_hispanic_men_popularity_over_time.png
white_non_hispanic_women_popularity_over_time.png


### Detailed Analysis and Insights from Visualizations

#### 1. **Distribution of Major Numbers**
![Distribution of Major Numbers](images/majornum_distribution.png)
The histogram for `majornum` reveals a highly concentrated distribution primarily between 1.0 and 1.1, with a sparse presence up to 2.0. This indicates that the majority of the entries in the dataset concern primary majors (indicated by a `majornum` of 1), with a smaller proportion representing secondary majors (indicated by a `majornum` of 2). The near absence of values between these two points suggests that the data does not categorize majors with finer granularity.

#### 2. **Distribution of CIP Codes**
![Distribution of CIP Codes](images/cipcode_distribution.png)
The distribution of CIP codes shows several peaks, indicating concentrations of particular fields of study. Notable spikes around codes like 10, 40, and 50 reflect popular academic fields. Lower frequencies between these peaks may represent specialized or less common fields of study. The bimodal peaks around 40 and 50 could correspond to broad categories such as 'Health Professions' and 'Business', respectively, highlighting their prevalence in educational institutions.

#### 3. **Distribution of Award Levels**
![Distribution of Award Levels](images/awlevel_distribution.png)
The histogram for `awlevel` demonstrates significant activity at levels 3, 5, and above 10, suggesting a commonality of associate degrees, bachelor's degrees, and advanced degrees. The spikes at specific levels might correlate with standard educational pathways, with few entries at intermediate levels, indicating less common types of certifications or diplomas.

#### 4. **Boxplot for Award Levels**
![Boxplot for Award Levels](images/awlevel_boxplot.png)
The boxplot for award levels shows that the interquartile range is tightly concentrated around lower levels, with outliers scattered across higher values. This concentration at lower award levels (like associate and bachelor's degrees) likely reflects the commonality of these degrees as terminal educational achievements in many fields.

#### 5. **Correlation Matrix of Variables**
![Correlation Matrix of Variables](images/correlation_matrix.png)
The correlation matrix provides insight into how different demographic and institutional variables interact with each other. For instance:
- Strong positive correlations between gender-specific columns (e.g., `hispanic_men` and `hispanic_women`) suggest similar enrollment patterns across genders within the same ethnic group.
- `total_men` and `total_women` show high positive correlation, indicating that programs tend to have balanced gender enrollments.
- The low correlation between `unitid` and most demographic variables suggests that demographic distributions are relatively consistent across different institutions.
- The negative correlations seen sporadically throughout the matrix might indicate inverse relationships in enrollment trends between different demographic groups across certain programs or institutions.

Each of these visualizations and their analysis provide a richer understanding of the structure and dynamics within the data, enabling targeted inquiries and informed decision-making in subsequent analyses.

In [41]:
# Ensure the images directory exists
if not os.path.exists('images'):
    os.makedirs('images')

# Load and clean the CIPCode DataFrame
cip_code_df = pd.read_csv('Data/CIPCode2010.csv')
cip_code_df['CIPCode'] = cip_code_df['CIPCode'].str.replace(r'^="|"$', '', regex=True)
cip_code_df['CIPCode'] = pd.to_numeric(cip_code_df['CIPCode'], errors='coerce')

# Exclude entries with cipcode 99.0000 and convert 'cipcode' to float for merging
clean_data['cipcode'] = pd.to_numeric(clean_data['cipcode'], errors='coerce')
filtered_data = clean_data[clean_data['cipcode'] != 99.0000]

# Ensure all demographic columns are included in the aggregation
demographics = ['nonresident_alien_men', 'nonresident_alien_women', 'black_non_hispanic_men', 'black_non_hispanic_women', 'american_indian_alaskan_men', 'american_indian_alaskan_women', 'asian_pacific_islander_men', 'asian_pacific_islander_women', 'hispanic_men', 'hispanic_women', 'white_non_hispanic_men', 'white_non_hispanic_women', 'race_ethnicity_unknown_men', 'race_ethnicity_unknown_women', 'total_men', 'total_women']
aggregated_data = filtered_data.groupby(['year', 'cipcode']).agg({demo: 'sum' for demo in demographics}).reset_index()

# Merge with CIP code descriptions
result = pd.merge(aggregated_data, cip_code_df, left_on='cipcode', right_on='CIPCode', how='left')
result['cipcode'] = result['CIPTitle'].fillna(result['cipcode'])

# Generate and save plots for each demographic
for demo in demographics:
    top_5_per_year = result.groupby('year').apply(lambda x: x.nlargest(5, demo)).reset_index(drop=True)
    total_enrollments_per_year = result.groupby('year')[demo].sum()
    top_5_per_year['rate_per_1000'] = top_5_per_year.apply(
        lambda row: (row[demo] / total_enrollments_per_year.loc[row['year']] * 1000 if total_enrollments_per_year.loc[row['year']] != 0 else 0), axis=1)
    total_counts = top_5_per_year.groupby('cipcode')[demo].sum()
    sorted_cipcodes = total_counts.sort_values(ascending=False).index

    plt.figure(figsize=(14, 8))
    for cip in sorted_cipcodes:
        subset = top_5_per_year[top_5_per_year['cipcode'] == cip]
        plt.plot(subset['year'], subset['rate_per_1000'], label=f'{cip} ({int(total_counts[cip])} total)')

    plt.title(f'Degree Popularity per 1,000 {demo.replace("_", " ").title()} Over Years')
    plt.xlabel('Year')
    plt.ylabel(f'Enrollment Rate per 1,000 {demo.replace("_", " ").title()}')
    
    # Adjust the layout and legend position
    plt.tight_layout(rect=[0, 0, 0.75, 1])  # Adjust the rect to make room for the legend
    plt.legend(title='CIP Codes', bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
    
    plt.grid(True)
    plot_filename = f'images/{demo}_popularity_over_time.png'
    plt.savefig(plot_filename, bbox_inches='tight')  # Save with bbox_inches='tight' to include legend
    plt.close()
    print(f'Saved {plot_filename}')

Saved images/nonresident_alien_men_popularity_over_time.png
Saved images/nonresident_alien_women_popularity_over_time.png
Saved images/black_non_hispanic_men_popularity_over_time.png
Saved images/black_non_hispanic_women_popularity_over_time.png
Saved images/american_indian_alaskan_men_popularity_over_time.png
Saved images/american_indian_alaskan_women_popularity_over_time.png
Saved images/asian_pacific_islander_men_popularity_over_time.png
Saved images/asian_pacific_islander_women_popularity_over_time.png
Saved images/hispanic_men_popularity_over_time.png
Saved images/hispanic_women_popularity_over_time.png
Saved images/white_non_hispanic_men_popularity_over_time.png
Saved images/white_non_hispanic_women_popularity_over_time.png
Saved images/race_ethnicity_unknown_men_popularity_over_time.png
Saved images/race_ethnicity_unknown_women_popularity_over_time.png
Saved images/total_men_popularity_over_time.png
Saved images/total_women_popularity_over_time.png


### Detailed Analysis of Degree Popularity Over Time by Demographic

### 1. **Total Men**
![Total Men Degree Popularity](images/total_men_popularity_over_time.png)
The graph for total men shows steady interest in Business Administration and Liberal Arts, with both programs maintaining high enrollment rates per 1,000 men over the years. There is notable stability in these choices, likely reflecting the broader appeal and applicability of these fields in various career paths. Technical fields like Mechanical Engineering and Computer Science show growth, which may be tied to the increasing focus on technology and innovation in the job market. Notably, there's a slight decline in Business Administration toward the end, potentially indicating a shift towards more specialized or newly emerging fields.

### 2. **Total Women**
![Total Women Degree Popularity](images/total_women_popularity_over_time.png)
For total women, there's a significant rise in enrollment in Nursing and Psychology, illustrating a strong interest in health and social sciences. This could reflect broader societal trends where there is increasing emphasis on healthcare professions and mental health awareness. Business Administration also shows a growing trend, highlighting that women are increasingly entering fields traditionally dominated by men, promoting gender diversity in the professional sphere.

### 3. **Black Non-Hispanic Men**
![Black Non-Hispanic Men Degree Popularity](images/black_non_hispanic_men_popularity_over_time.png)
The chart for black non-Hispanic men indicates a high and stable enrollment in Business Administration and Liberal Arts, similar to the trend observed in the total men demographic. There’s a noteworthy presence in vocational training fields such as Welding Technology and HVAC, which might suggest a strong orientation towards immediate employment opportunities and practical skills. The dips and rises in these technical fields could reflect economic cycles and local job market demands.

### 4. **Black Non-Hispanic Women**
![Black Non-Hispanic Women Degree Popularity](images/black_non_hispanic_women_popularity_over_time.png)
For black non-Hispanic women, Nursing and Cosmetology are the standout fields, both showing peaks that suggest cycles of high demand or popularity. The trend in Nursing aligns with global needs for healthcare professionals, whereas the interest in Cosmetology could be driven by cultural trends and the personal care industry’s growth. Psychology also shows a consistent increase, which may indicate a growing interest and recognition of the importance of mental health.

### 5. **Nonresident Alien Men**
![Nonresident Alien Men Degree Popularity](images/nonresident_alien_men_popularity_over_time.png)
Nonresident alien men show distinct preferences for technical and high-skill areas such as Computer Science and Electrical Engineering, likely reflecting the international appeal of technical degrees that promise high returns on investment in terms of career opportunities, especially in the tech industry. The sharp declines after peaks might indicate changes in immigration policies, economic conditions, or shifts in the global educational landscape.

### 6. **Nonresident Alien Women**
![Nonresident Alien Women Degree Popularity](images/nonresident_alien_women_popularity_over_time.png)
The trends among nonresident alien women are quite pronounced in Business Administration and Liberal Arts, similar to their male counterparts, emphasizing the universal appeal of these fields. Notably, there is significant variability in specialized fields like Electrical Engineering and Nursing, possibly reflecting targeted career paths that are influenced by the prospects back in their home countries or in the global job market.

#### 7. **Asian Pacific Islander Men**
![Asian Pacific Islander Men Degree Popularity](images/asian_pacific_islander_men_popularity_over_time.png)
- **Trends**: The enrollment for Business Administration peaks around 2010 and then gradually declines, while Computer Science sees a steady increase post-2010, reflecting a shift in vocational preference towards STEM fields.
- **Observations**: The sharp increase in enrollments in Economics and Computer Science around 2010 may correlate with market demands and increased job opportunities in these fields. A notable decline in traditional fields like Electrical Engineering suggests a pivot towards more modern and versatile disciplines.

#### 8. **Asian Pacific Islander Women**
![Asian Pacific Islander Women Degree Popularity](images/asian_pacific_islander_women_popularity_over_time.png)
- **Trends**: A steady increase in Nursing around the mid-2000s peaks and stabilizes, indicating a strong, sustained demand for healthcare professionals. Business Administration maintains a high enrollment rate, though it experiences some fluctuations.
- **Observations**: The high enrollment in Nursing and Psychology might reflect societal trends and employment stability in these fields. The decline in Cosmetology and Accounting suggests shifting career interests among Asian Pacific Islander women.

#### 9. **Hispanic Men**
![Hispanic Men Degree Popularity](images/hispanic_men_popularity_over_time.png)
- **Trends**: Liberal Arts shows an initial increase but begins to wane after 2010. In contrast, vocational programs like Automotive Mechanics and Welding see fluctuating but generally stable interest.
- **Observations**: The rise and fall of Liberal Arts could be tied to economic factors where students might choose more directly vocational pathways during economic downturns. The steady interest in trade skills highlights a consistent demand for practical and applied skills.

#### 10. **Hispanic Women**
![Hispanic Women Degree Popularity](images/hispanic_women_popularity_over_time.png)
- **Trends**: Business Administration and Nursing show strong growth, with Nursing peaking in the late 2010s. Liberal Arts experiences a significant decline after 2010.
- **Observations**: The growth in Nursing and decline in Liberal Arts may reflect a strategic choice towards professions with more perceived job security and financial stability.

#### 11. **Race Ethnicity Unknown Men**
![Race Ethnicity Unknown Men Degree Popularity](images/race_ethnicity_unknown_men_popularity_over_time.png)
- **Trends**: There is a notable diversity in fields with initial high enrollments in Business Administration and Law, but a sharp decline post-2010 in all fields except Political Science and Government.
- **Observations**: The declines could be indicative of changing demographic profiles or data collection methods that better classify race and ethnicity over time.

#### 12. **Race Ethnicity Unknown Women**
![Race Ethnicity Unknown Women Degree Popularity](images/race_ethnicity_unknown_women_popularity_over_time.png)
- **Trends**: Registered Nursing and Business Administration see growth, while fields like Cosmetology decline significantly.
- **Observations**: This shift suggests a move towards more academically demanding and financially rewarding fields, possibly influenced by broader economic factors and societal values on education and gender roles.

#### 13. **White Non-Hispanic Men**
![White Non-Hispanic Men Degree Popularity](images/white_non_hispanic_men_popularity_over_time.png)
- **Trends**: Steady interest in Business Administration and a notable increase in Political Science and Government post-2010.
- **Observations**: The increase in Political Science might reflect greater political engagement or career prospects in public service and governance.

#### 14. **White Non-Hispanic Women**
![White Non-Hispanic Women Degree Popularity](images/white_non_hispanic_women_popularity_over_time.png)
- **Trends**: Consistent high enrollments in Nursing and Business Administration, with a sharp increase in Nursing around 2010.
- **Observations**: This trend underscores the ongoing demand for healthcare professionals and the appeal of stable, well-paying jobs in the nursing sector.

#### 15. **American Indian Alaskan Men**
![American Indian Alaskan Men Degree Popularity](images/american_indian_alaskan_men_popularity_over_time.png)
- **Trends**: Volatile interest across various fields, with no clear long-term growth in any specific area.
- **Observations**: The fluctuating data may indicate challenges in higher education access or participation among this demographic.

#### 16. **American Indian Alaskan Women**
![American Indian Alaskan Women Degree Popularity](images/american_indian_alaskan_women_popularity_over_time.png)
- **Trends**: Similar to their male counterparts, showing volatile interest with slight increases in fields like Nursing and Psychology.
- **Observations**: These changes might reflect shifting priorities towards fields that offer community-focused roles or stable employment.


#  Results

The data presented in the graphs reveals several interesting trends and disparities in degree popularity across various demographics and genders. This analysis will delve into these differences, discussing potential factors contributing to the observed patterns and their implications for higher education and the workforce.

## Gender Disparities

### Business Administration

Business Administration maintains a strong appeal across both genders, but there are notable differences:

- Women show a steady increase in enrollment over time, potentially indicating a growing interest in business careers and a shift towards greater gender equality in the field.
- Men, while still enrolling in high numbers, show a slight decline towards the end of the observed period. This could suggest a shift towards more specialized or emerging fields.

Possible explanations:
- Increasing emphasis on gender diversity in the corporate world may have encouraged more women to pursue business degrees.
- Changing job market demands and the rise of new industries might have drawn some men away from traditional business programs.

### Nursing

Nursing exhibits a significant gender disparity, with women consistently enrolling at much higher rates than men.

Possible explanations:
- Historical gender roles and stereotypes associating nursing with feminine caregiving may have influenced this trend.
- The perceived stability and growth of the healthcare industry might attract more women, who often prioritize job security.
- Lack of male role models in nursing could perpetuate the gender imbalance.

## Racial and Ethnic Disparities

### Liberal Arts

Liberal Arts enrollment shows varying trends across different racial and ethnic groups:

- Black non-Hispanic men and women maintain relatively stable interest, suggesting the broad appeal and versatility of a liberal arts education.
- Hispanic men and women, as well as those with unknown race/ethnicity, experience significant declines in liberal arts enrollment over time.

Possible explanations:
- Economic pressures and job market uncertainties might lead some groups to favor more directly vocational or professional degrees.
- Cultural values and family expectations could influence the perceived utility of a liberal arts education.
- Differential access to resources, such as career guidance and exposure to a wide range of academic options, might shape degree choices.

### STEM Fields

Enrollment in STEM fields like Computer Science and Engineering shows notable disparities:

- Asian Pacific Islander and nonresident alien men exhibit strong and growing interest in these fields.
- Other racial and ethnic groups, particularly underrepresented minorities, have lower enrollment rates.

Possible explanations:
- Early exposure and access to STEM education and resources can greatly influence degree choices.
- Cultural factors, such as family emphasis on STEM careers or the presence of role models, may contribute to the high enrollment of Asian Pacific Islander and nonresident alien students.

### Vocational and Technical Programs

Enrollment in vocational and technical programs, such as Welding and Automotive Mechanics, varies across demographics:

- These programs maintain relatively stable interest among Hispanic and black non-Hispanic men.
- Women across all racial and ethnic groups have lower enrollment rates in these fields.

Possible explanations:
- Traditional gender roles and societal expectations may influence the perceived suitability of these careers for men and women.
- The immediate employment opportunities and practical skills offered by these programs might appeal to groups facing economic pressures.
- Lack of exposure and limited access to vocational education in some communities could affect enrollment patterns.

## Intersectional Considerations

It's essential to recognize that students' identities are multidimensional, and the intersection of race, ethnicity, gender, and other factors can shape their experiences and choices in unique ways.

For example:
- Black non-Hispanic women show high enrollment in both Nursing and Cosmetology, suggesting the influence of both gender and race on career paths.
- The low and fluctuating enrollment of American Indian and Alaskan students across fields underscores the need to consider the specific challenges and barriers faced by indigenous communities.

## Conclusion

The analysis of degree popularity trends across demographics and genders reveals significant disparities and highlights the multifaceted influences on students' educational choices. By understanding these patterns and their underlying factors, educators, policymakers, and industry leaders can work together to understand the landscape in higher education and the workforce.

#  System Architecture and Decision Making

The architecture and design of this data analytics project were carefully considered to ensure efficiency, scalability, and reproducibility. The key components of the architecture include data ingestion, data cleanup and transformation, data storage, and data analytics.

### Data Ingestion
- The data ingestion process was designed to efficiently aggregate datasets from multiple years into a unified data framework. This design choice enables comprehensive analysis across time periods, facilitating the identification of trends and patterns.
- The decision to scrape annual datasets from the Integrated Postsecondary Education Data System (IPEDS) ensures the use of reliable and consistent data sourced from a reputable institution. This approach guarantees data integrity and comparability across years.
- The choice to concatenate individual datasets into a single DataFrame streamlines the data processing pipeline, reducing complexity and enabling more efficient analysis.

### Data Cleanup and Transformation
- The data cleanup and transformation phase was designed to standardize and cleanse the raw data, preparing it for analysis. This step is crucial for ensuring data quality and consistency.
- The use of Pandas for data manipulation and NumPy for numerical operations leverages the strengths of these widely-used libraries, providing a robust and efficient toolkit for data transformation tasks.
- The creation of a `column_mappings` dictionary to align new, uniform column names with their various labels in the dataset demonstrates a systematic approach to data standardization. This technique ensures consistency and facilitates the integration of data from different sources or years.
- The decision to handle duplicates and missing values by selecting the first non-null value for each entry effectively manages data inconsistencies while preserving data integrity.

### Data Storage
- The data storage architecture was designed to optimize performance and accessibility for large datasets. The choice to store the raw data in CSV format allows for easy import and manipulation using tools like Pandas.
- The decision to store the processed data in Parquet format on the local filesystem leverages the benefits of columnar storage, which offers efficient data compression and encoding schemes. This approach significantly enhances the performance of data retrieval and processing tasks, making it well-suited for handling large datasets.

### Data Analytics
- The data analytics component was designed to extract actionable insights and answer key questions about educational trends. The use of SQLite for SQL-based analysis enables precise data manipulation and extraction, while the utilization of Matplotlib, Seaborn, and Plotly for visualization provides a range of options for creating informative and engaging visual representations of the data.
- The choice to perform descriptive analysis and generate visualizations such as histograms, box plots, and correlation matrices allows for a comprehensive understanding of the dataset's characteristics and relationships between variables. These techniques provide a solid foundation for further exploratory analysis and hypothesis testing.

### Scalability and Future Enhancements
- The architecture of this project was designed with scalability in mind. The use of Parquet format for data storage enables efficient handling of large datasets, allowing for future growth and expansion.
- The modular design of the data processing pipeline, with separate stages for data ingestion, cleanup, transformation, and analytics, allows for easy extension and modification. New data sources or analysis techniques can be easily integrated into the existing framework.
- The use of widely-adopted libraries and tools such as Pandas, NumPy, Matplotlib, and Seaborn ensures compatibility and ease of maintenance, as well as access to a wide range of resources and community support.

The architecture and design decisions in this data analytics project prioritize efficiency, scalability, and reproducibility. The chosen tools and methodologies enable comprehensive analysis, insightful visualizations, and the ability to handle large datasets. The modular design allows for future enhancements and extensions, making it a robust foundation for ongoing educational data analysis and research.
