The dataset is on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by <b>[American Community Survey](https://www.census.gov/programs-surveys/acs/)</b>, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their <b>[Github repo](https://github.com/fivethirtyeight/data/tree/master/college-majors).

In [1]:
# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [2]:
# To display multiple output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

# To display all columns of dataframe
pd.set_option('display.max_columns', None)

In [4]:
recent_grads = pd.read_csv('recent-grads.csv')

# displaying top and bottom five rows of dataframe
recent_grads.head()

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,1849,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,556,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,558,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,1069,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,23170,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


There are four columns in total with null values and also we can see that all those columns belong to single row. So if we delete rows with null values, we should be only loosing one row.

In [7]:
print('Number of rows in dataframe before dropping null values : ',recent_grads.shape[0])

Number of rows in dataframe before dropping null values :  173


In [32]:
recent_grads.shape

(172, 21)

In [5]:
# Dropping rows with null values
recent_grads = recent_grads.dropna()
print('Number of rows in dataframe after dropping null values : ',recent_grads.shape[0])

Number of rows in dataframe after dropping null values :  172


In [9]:
import statsmodels.api as sm

In [33]:
model = sm.OLS(recent_grads['College_jobs'], recent_grads[['Men','Women']])

In [34]:
results = model.fit()

In [35]:
results.summary()

0,1,2,3
Dep. Variable:,College_jobs,R-squared:,0.788
Model:,OLS,Adj. R-squared:,0.786
Method:,Least Squares,F-statistic:,316.5
Date:,"Thu, 19 Dec 2019",Prob (F-statistic):,4.8900000000000004e-58
Time:,09:14:00,Log-Likelihood:,-1849.7
No. Observations:,172,AIC:,3703.0
Df Residuals:,170,BIC:,3710.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Men,0.0117,0.040,0.289,0.773,-0.068,0.092
Women,0.4611,0.028,16.321,0.000,0.405,0.517

0,1,2,3
Omnibus:,92.918,Durbin-Watson:,1.678
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1017.004
Skew:,1.69,Prob(JB):,1.45e-221
Kurtosis:,14.423,Cond. No.,2.9
