In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [2]:
# Read the Excel file into a DataFrame
df = pd.read_excel('quiz1.xlsx', sheet_name='usnews3.data.9 .SS (v5.0)')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'quiz1.xlsx'

In [None]:
df.describe().T

In [None]:
#Check if there are any duplicate colleges based on both name and state.

is_duplicate = df.duplicated(subset=['College Name', 'State'])
duplicated_rows = df[is_duplicate]
duplicated_rows

In [None]:
#Summary statistics for the target variable Graduation Rate

target_stats = df['Graduation rate'].describe()
print(target_stats)

The maximum value of the target variable is 118 which is higher than the 100% rate that should be possible.

In [None]:
# Plot histogram
plt.hist(df['Graduation rate'], bins=25, edgecolor='k')
plt.xlabel('Graduation Rate')
plt.ylabel('Frequency')
plt.title('Distribution of Graduation Rates')
plt.show()

The histogram of the target variable shows that there is one college above the 100% graduation rate now we will take a look at that college.

In [None]:
df[df['Graduation rate'] == 118.0]

In [None]:
df.loc[df['College Name'] == 'Cazenovia College', 'Graduation rate'] = 57.8

In [None]:
df['Graduation rate'].describe()

Now graduation rate only goes to the maximum value of 100.

In [None]:
null_counts = df.isnull().sum()
print(null_counts)

Before dealing with null values I want to remove the rows that have nulls in the target column as imputing is not guaranteed to be an accurate depiction of the data

In [None]:
df = df.dropna(subset=['Graduation rate'])
null_counts = df.isnull().sum()
print(null_counts)

In [None]:
# Group the data by the 'Public (1)/ Private (2)' column and calculate the mean graduation rate
grouped = df.groupby('Public (1)/ Private (2)')
mean_grad_rate = grouped['Graduation rate'].mean()

# Plot the mean graduation rates
labels = ['Public', 'Private']
plt.bar(labels, mean_grad_rate)
plt.xlabel('College Type')
plt.ylabel('Mean Graduation Rate')
plt.title('Mean Graduation Rate by College Type')
plt.show()

College type will likely be a good indicator of graduation rate with Private college students graduating at an average rate over 10% higher than public universities

In [None]:
# Impute null values with column medians
df_filled = df.fillna(df.median(numeric_only=True))

# Check for remaining null values
null_counts = df_filled.isnull().sum()
print(null_counts)

In [None]:
df.hist(figsize=(15,10))
plt.subplots_adjust(hspace=.5);

# PCA Analysis for Feature Selection

In [None]:
# Drop the "College Name" and "State" columns from the DataFrame used for PCA
df_pca = df_filled.drop(["College Name", "State"], axis=1)

# Create separate variables for the "College Name" and "State" columns
college_names = df_filled["College Name"]
states = df_filled["State"]

In [None]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_pca)

# Apply PCA
pca = PCA(n_components=0.9)
X_pca = pca.fit_transform(X_scaled)

In [None]:
# Plot cumulative explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance by Number of Components')

plt.show()

In [None]:
# Plot scatter plot of PCA components
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA Components')

The scatter plot of the PCA components provides visual insights into the relationship between the data points in the reduced-dimensional space. In this case, the scatter plot shows the distribution and clustering of the data points based on their values in the first two principal components.

The scatter plot appears to exhibit a parabolic or curved pattern, indicating a potential non-linear relationship among the data points. The curvature suggests that there may be complex interactions or higher-order patterns present in the data that are not easily captured by a linear model.

In [None]:
# Get explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Print explained variance ratio for each component
for i, variance_ratio in enumerate(explained_variance_ratio):
    print(f"Component {i+1}: {variance_ratio:.4f}")

Components 1 and 2 collectively explain approximately 53% of the total variance in the dataset. This indicates that these two components capture a significant portion of the variability present in the original features. The relatively high explained variance suggests that these components contain valuable information for predicting the target variable, graduation rate.

On the other hand, components 3 and above contribute much less to the overall variance explained, with component 3 explaining less than 7% of the variance. This suggests that the additional components beyond the first two contribute relatively less information compared to the initial components. These components may capture more noise or less meaningful patterns in the data.

In summary, components 1 and 2 capture a substantial amount of the variance in the dataset, while the remaining components contribute relatively less. By focusing on these important components, we can effectively reduce the dimensionality of the data while retaining key information for predicting graduation rates.

# Possible ML for Predicting Graduation Rate

1) Polynomial Regression: Since the scatter plot exhibits a parabolic or curved pattern, polynomial regression can be explored as a model that can capture non-linear relationships. By including polynomial terms of the principal components as additional features, a polynomial regression model may better capture the non-linearities in the data.

2) Random Forest Regression: Random forest regression is an ensemble learning model that can handle non-linear relationships effectively. It combines multiple decision trees to make predictions and can capture complex interactions and patterns in the data. By utilizing the principal components as input features, a random forest regression model may provide accurate predictions of graduation rates.

3) Linear Regression: Despite the presence of non-linear patterns in the scatter plot, it is still worth considering a linear regression model as a baseline approach. By using the principal components as input features, we can build a linear regression model to estimate graduation rates. However, it is important to keep in mind that the linear regression model may not capture the full complexity of the underlying relationships.