# Return on Investment of US College Majors

By Mijail Mariano, Chenchen Feng, & David Schneemann

## Project Goal

Our goal with this project is to predict return on investment for US college undergraduate majors. We use US Department of Education data to identify statistically significant metrics in order to predict this "cost-to-earnings" ratio of return on investment. We explore how potential features within the data produce certain return on investment ratios aand provide insight into why and how these factors contribute to this target variable. With this information and the following recommendations, we intend to provide current and prospective students a greater understanding of the value and trade-offs tied to aa given college major and its potential career paths.

## Project Description

The cost of attending college and its associated benefits and drawbacks are primary topics of conversation for young people today. The actual return on investment of such an endeavor is difficult to quantify and leaves prospective students guessing when choosing a college and degree to pursue and whether college is even the right choice. This analysis hopes to provide more clarity for students in making these key decisions about their future.

In order to more accurately predict return on investment, we will analyze the attributes (features) of college majors. We gathered data for all US colleges from the US Department of Education College Scorecard and US Census earnings data provided by University of Minnesota's IPUMS database. 
Our chosen dataset includes all gathered data for the academic years of 2017-2018 and 2018-2019 and earnings data from 2019 spanning 1% of the US population. 
In total, our initial dataset contains 225,000 records and 3000+ features. 
Our final cleaned and prepared dataset contains roughly the same number of records but narrows to roughly 100 key features. 

Our approach to this project involves clustering by groups of these features in order to best predict return on investment while simultaneously providing depth and insight into what features most affect return on investment and overall college outcomes.

-----------------------------------------------------------------------------------------------------------
## Initial Questions

-----------------------------------------------------------------------------------------------------------
## Clustering Questions

#### 1. Does bedroom, bathroom, and garage space count affect log error?

#### 2.  Does location, latitude, and longitude affect log error?
    
#### 3. Do sqft, lot_sq_ft, and bath_bed_ratio affect log error?

-----------------------------------------------------------------------------------------------------------
## Data Dictionary

In order to effectively meet our goals, the following module imports are required. \
Below is an extensive list of all modules I imported and used to create and complete the desired analysis for Zillow.

In [1]:
# regular imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import math

# default pandas decimal number display format
pd.options.display.float_format = '{:20,.2f}'.format

import warnings
warnings.filterwarnings("ignore")

# Wrangling
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.cluster import KMeans
from scipy import stats
import sklearn.preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from scipy import stats
from scipy.stats import pearsonr, spearmanr, kruskal
from scipy.stats.mstats import winsorize
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

import csv
import acquire
import prepare

| Variable      | Meaning |
| ----------- | ----------- |
| logerror      | The measured log error of a home       |
| home_value      | The total tax assessed value of the parcel       |
| bedrooms   | The total number of bedrooms in a home        |
| bathrooms      | The total number of bathrooms in a home       |
| garage_spaces      | The total number of car slots in a garage       |
| year_built      | The year the home was built       |
| age      | The age of the home       |
| location      | Location of a home by county      |
| sq_ft      | The total square feet of a home       |
| lot_sq_ft      | The total square feet of a property lot       |
| latitude   | Location using the latitudenal metric        |
| longitude      | Location using the longitudenal metric       |
| bath_bed_ratio      | Ratio of bathrooms to bedrooms of a home       |

## Acquire Zillow Data

##### We acquire our data by utilizing the acquire.py file.
This file pulls selected features from FieldOfStudyData1718_1819_PP and joins them with MERGED2018_19_PP. \
We then join this merged dataframe with select earnings data from IPUMS dataset; namely years 2017, 2018, & 2019. \
Our resulting table returns 56,079 entries of data with the following attributes.

In [2]:
# Calling my acquire.py file and utilizing its function,assigning the output to df
df = acquire.get_bach_df()

dataframe shape: (0, 139)
Shape of resulting df:(0, 139)


## Prepare Zillow Data

##### We prepare our data by utilizing the prepare.py file.
This file:
- Handles null values
- Converts some variables to integers for optimization
- Cleans variables, including dropping numerous extraneous features along with renaming columns 
- Includes robust feature engineering including 
    - Condensing majors into predominant major_categories and 
    - Iterating Family Income brackets
    - Engineering our target variable ROI (5, 10, & 20 years) by earnings data and net price vars
- Splits prepared df into train, validate, test 
- Handles outliers through a process called "capping" via the "winsorize" method
- Manual imputation of select features
- Utilizes an iterative imputer to programmatically handle remaining missing values

Our resulting dataframes are ready for exploration and evaluation.

In [3]:
df = prepare.clean_college_df(df)
df = prepare.clean_high_percentage_nulls(df)
df = prepare.obtain_target_variables(df)
print(f'Shape of resulting df:{df.shape}')

dataframe shape: (0, 117)
dataframe shape: (0, 127)
Shape of resulting df:(0, 127)


In [4]:
# Creating income brackets

income_0_30000 = [
'other_fam_income_0_30000',
 'private_fam_income_0_30000',
 'program_fam_income_0_30000',
 'pub_fam_income_0_30000']

income_30001_48000 = [
 'other_fam_income_30001_48000',
 'private_fam_income_30001_48000',
 'program_fam_income_30001_48000',
 'pub_fam_income_30001_48000']

income_48001_75000 = [
'other_fam_income_48001_75000',
'private_fam_income_48001_75000',
'program_fam_income_48001_75000',
'pub_fam_income_48001_75000']

income_75001_110000 = [
'other_fam_income_75001_110000',
'private_fam_income_75001_110000',
'program_fam_income_75001_110000',
'pub_fam_income_75001_110000']

income_over_110000 = [
'other_fam_income_over_110000',
'private_fam_income_over_110000',
'program_fam_income_over_110000',
'pub_fam_income_over_110000']



In [5]:
df = prepare.get_fam_income_col(df, income_0_30000, "fam_income_0_30000")
df = prepare.get_fam_income_col(df, income_30001_48000, "fam_income_30001_48000")
df = prepare.get_fam_income_col(df, income_48001_75000, "fam_income_48001_75000")
df = prepare.get_fam_income_col(df, income_75001_110000, "fam_income_75001_110000")
df = prepare.get_fam_income_col(df, income_over_110000, "fam_income_over_110000")

In [None]:
# Creating region category feature from `state_post_code`
df['us_region'] = df.apply(lambda row: prepare.label_states(row), axis=1)


In [6]:
train, validate, test = prepare.split_data(df)

ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

In [None]:
# Capping outliers on train df
train = prepare.percentile_capping(train, 0.1, 0.1)

In [None]:
# Running imputer func on train
train_imputed = prepare.train_iterative_imputer(train)


In [None]:
# Performing final imputation on validate and test dfs
validate_imputed, test_imputed = prepare.impute_val_and_test(train, validate, test)

In [None]:
# Checking shape on our samples to confirm appropriate split

print('Imputed Train shape: {}.'.format(train_imputed.shape))
print('Imputed Validate shape: {}.'.format(validate_imputed.shape))
print('Imputed Test shape: {}.'.format(test_imputed.shape))

Total df shape: (48250, 32).
Train shape: (27020, 32).
Validate shape: (11580, 32).
Test shape: (9650, 32).


## Set the Data Context

#### Note: Not all visuals, analysis, and work is shown within this Final Report. 
#### All my work, from start to finish, is available in my `working_notebook.ipynb` file for your reference.

Our acquired and prepared dataset contains information for 48,250 homes. \
    In the process of exploring this data and setting initial hypotheses, I created a figure plotting choice categorical variables with our target variable of `logerror`. Using this figure I determined potential correlation with each of the features stated in my initial hypotheses. The following exploration seeks to answer these questions.