# Step 1:
College completion data question - Does a high level of institutional expenditure per student award reliably predict that an institution will have a graduation rate in the top 25th percentile?

Job placement data question - Can we predict a student's placement status using their degree percentage and work experience while controlling for their MBA specialization?

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [13]:
college = pd.read_csv('cc_institution_details.csv')
college.head()

Unnamed: 0,index,unitid,chronname,city,state,level,control,basic,hbcu,flagship,...,vsa_grad_after6_transfer,vsa_grad_elsewhere_after6_transfer,vsa_enroll_after6_transfer,vsa_enroll_elsewhere_after6_transfer,similar,state_sector_ct,carnegie_ct,counted_pct,nicknames,cohort_size
0,0,100654,Alabama A&M University,Normal,Alabama,4-year,Public,Masters Colleges and Universities--larger prog...,X,,...,36.4,5.6,17.2,11.1,232937|100724|405997|113607|139533|144005|2285...,13,386,99.7|07,,882.0
1,1,100663,University of Alabama at Birmingham,Birmingham,Alabama,4-year,Public,Research Universities--very high research acti...,,,...,,,,,196060|180461|201885|145600|209542|236939|1268...,13,106,56.0|07,UAB,1376.0
2,2,100690,Amridge University,Montgomery,Alabama,4-year,Private not-for-profit,Baccalaureate Colleges--Arts & Sciences,,,...,,,,,217925|441511|205124|247825|197647|221856|1353...,16,252,100.0|07,,3.0
3,3,100706,University of Alabama at Huntsville,Huntsville,Alabama,4-year,Public,Research Universities--very high research acti...,,,...,0.0,0.0,0.0,0.0,232186|133881|196103|196413|207388|171128|1900...,13,106,43.1|07,UAH,759.0
4,4,100724,Alabama State University,Montgomery,Alabama,4-year,Public,Masters Colleges and Universities--larger prog...,X,,...,,,,,100654|232937|242617|243197|144005|241739|2354...,13,386,88.0|07,ASU,1351.0


In [14]:
job = pd.read_csv('Placement_Data_Full_Class.csv')
job.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


# Step 2 (college completion)
Generic Question: How does university spending influence student completion outcomes?

Independent business metric: Completion cost per graduate. This measures the financial efficiency of an institution in moving a student from enrollment to an award.

Data preparation:

Variable types: Convert control and level to factors.

One hot encoding: Apply to control (Public, Private not-for-profit, Private for-profit).

Normalize: Scale student_count, aid_value, and exp_award_value.

Drop unneeded variables: Remove metadata: unitid, chronname, city, state, site, and coordinates.

Target variable: Create high_grad_rate (1 if grad_100_value > 75th percentile).

In [None]:
df_c = pd.read_csv('cc_institution_details.csv').dropna(subset=['grad_100_value'])

# Create target
q3_threshold = df_c['grad_100_value'].quantile(0.75)
df_c['high_grad_target'] = (df_c['grad_100_value'] >= q3_threshold).astype(int)

# prevalence
prevalence_c = df_c['high_grad_target'].mean()
print(f"College Completion Prevalence: {prevalence_c:.2%}")

# Partition
train_c, rem_c = train_test_split(df_c, train_size=0.7, stratify=df_c['high_grad_target'], random_state=42)
tune_c, test_c = train_test_split(rem_c, test_size=0.5, stratify=rem_c['high_grad_target'], random_state=42)

College Completion Prevalence: 25.01%


# Step 2 (job placement)
Generic Question: What student profile characteristics are most predictive of successful campus placement?

Independent business metric: placement rate by specialization. This allows the university to identify which career tracks (Mkt&HR vs Mkt&Fin) require more corporate outreach or curriculum adjustment.

Data Preparation Plan:

Variable types: Set gender, ssc_b, hsc_b, hsc_s, degree_t, workex, and specialisation as factors.

Collapse factors: If any subject area in hsc_s has very low frequency, collapse it into an "Other" category.

One hot encoding: Apply to non-binary categories like hsc_s and degree_t.

Normalize: Apply Min-Max scaling to all percentage columns (ssc_p, hsc_p, degree_p, etest_p, mba_p).

Drop unneeded variables: Remove sl_no (random index) and salary (Target Leakage: salary only exists for those already placed).

Target variable: Map status to Placed = 1 and Not Placed = 0.

In [None]:
# Load and basic target setup
df_p = pd.read_csv('Placement_Data_Full_Class.csv')
df_p['status_bin'] = df_p['status'].map({'Placed': 1, 'Not Placed': 0})

# prevalence
prevalence_p = df_p['status_bin'].mean()
print(f"Job Placement Prevalence: {prevalence_p:.2%}")

# Partition
train_p, rem_p = train_test_split(df_p, train_size=0.7, stratify=df_p['status_bin'], random_state=42)
tune_p, test_p = train_test_split(rem_p, test_size=0.5, stratify=rem_p['status_bin'], random_state=42)

Job Placement Prevalence: 68.84%


# Step 3 
Instincts
In the Job Placement data, I suspect that workex and mba_p (MBA percentage) will have the highest feature importance. Employers likely value the combination of recent academic rigor and previous professional maturity. For College Completion, I expect a strong correlation between control and graduation rates, as private institutions often have different student-to-faculty ratios that affect these metrics.

Concerns
Key concerns include excluding the salary column from placement features since it directly proxies the target, carefully handling missing financial data in the cc_institution_details dataset to avoid bias toward only well-documented schools, and addressing extreme exp_award_value outliers with robust scaling or log transformations to prevent skewed normalization. There may also be bias between public and private institutions, and several features reflect socioeconomic factors that could influence results. Graduation rates are affected by external factors not captured in the data (such as the pandemic), and using a median split may oversimplify institutional performance.