<a href="https://colab.research.google.com/github/chrisseiler96/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
import pandas as pd

In [0]:
dfcols = ['age', 'workclass','fnlwgt', 'education', 'education-num','marital-status','occupation', 'relationship', 'race', 'sex', 'capital-gain',
'capital-loss','hours-per-week','native-country', 'pay']

In [0]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None, names = dfcols)

In [4]:
df.head(50)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,pay
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


###Checking for nulls. The dataset file claims it has null values.

In [5]:
df.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
pay               0
dtype: int64

In [0]:
import numpy as np

In [0]:
df = df.replace(" ?", np.nan)


In [8]:
df.head(50)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,pay
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [9]:
df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
pay                  0
dtype: int64

In [10]:
df.shape

(32561, 15)

In [0]:
#df.dropna(subset=['workclass'], how='all', inplace = True)

In [12]:
df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
pay                  0
dtype: int64

In [0]:
#df.dropna(subset=['native-country'], how='all', inplace = True)

In [14]:
df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
pay                  0
dtype: int64

In [0]:
#df.dropna(subset=['occupation'], how='all', inplace = True)

###Null values are removed

Encoding categorical values

In [0]:
df = df.replace(to_replace =[" <=50K"],  
                            value =0) 

In [0]:
df = df.replace(to_replace =[" >50K"],  
                            value =1) 

In [18]:
df.pay.value_counts()

0    24720
1     7841
Name: pay, dtype: int64

In [19]:
df.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

In [20]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
pay                int64
dtype: object

In [21]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,pay
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [0]:
df['workclass'] = pd.Categorical(df['workclass'])

In [0]:
df['education'] = pd.Categorical(df['education'])

In [0]:
df['marital-status'] = pd.Categorical(df['marital-status'])

In [0]:
df['occupation'] = pd.Categorical(df['occupation'])

In [0]:
df['relationship'] = pd.Categorical(df['relationship'])

In [0]:
df['race'] = pd.Categorical(df['race'])

In [0]:
df['sex'] = pd.Categorical(df['sex'])

In [0]:
df['native-country'] = pd.Categorical(df['native-country'])

In [30]:
df.dtypes

age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status    category
occupation        category
relationship      category
race              category
sex               category
capital-gain         int64
capital-loss         int64
hours-per-week       int64
native-country    category
pay                  int64
dtype: object

In [0]:
dummylist = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']

In [32]:
for dummy in dummylist:
  print(df[dummy].unique())

[State-gov, Self-emp-not-inc, Private, Federal-gov, Local-gov, NaN, Self-emp-inc, Without-pay, Never-worked]
Categories (8, object): [State-gov, Self-emp-not-inc, Private, Federal-gov, Local-gov,
                         Self-emp-inc, Without-pay, Never-worked]
[Bachelors, HS-grad, 11th, Masters, 9th, ..., 5th-6th, 10th, 1st-4th, Preschool, 12th]
Length: 16
Categories (16, object): [Bachelors, HS-grad, 11th, Masters, ..., 10th, 1st-4th, Preschool, 12th]
[Never-married, Married-civ-spouse, Divorced, Married-spouse-absent, Separated, Married-AF-spouse, Widowed]
Categories (7, object): [Never-married, Married-civ-spouse, Divorced, Married-spouse-absent,
                         Separated, Married-AF-spouse, Widowed]
[Adm-clerical, Exec-managerial, Handlers-cleaners, Prof-specialty, Other-service, ..., Tech-support, NaN, Protective-serv, Armed-Forces, Priv-house-serv]
Length: 15
Categories (14, object): [Adm-clerical, Exec-managerial, Handlers-cleaners, Prof-specialty, ...,
               

In [0]:
df_dummies = pd.get_dummies(df[['workclass','education','marital-status','occupation','relationship','race','sex','native-country']], prefix = 'category')

In [34]:
df_dummies.shape

(32561, 99)

In [35]:
df.shape

(32561, 15)

In [0]:
df = pd.concat([df, df_dummies], axis=1)

In [37]:
df.shape

(32561, 114)

In [0]:
pd.set_option('display.max_columns', None)

In [39]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,pay,category_ Federal-gov,category_ Local-gov,category_ Never-worked,category_ Private,category_ Self-emp-inc,category_ Self-emp-not-inc,category_ State-gov,category_ Without-pay,category_ 10th,category_ 11th,category_ 12th,category_ 1st-4th,category_ 5th-6th,category_ 7th-8th,category_ 9th,category_ Assoc-acdm,category_ Assoc-voc,category_ Bachelors,category_ Doctorate,category_ HS-grad,category_ Masters,category_ Preschool,category_ Prof-school,category_ Some-college,category_ Divorced,category_ Married-AF-spouse,category_ Married-civ-spouse,category_ Married-spouse-absent,category_ Never-married,category_ Separated,category_ Widowed,category_ Adm-clerical,category_ Armed-Forces,category_ Craft-repair,category_ Exec-managerial,category_ Farming-fishing,category_ Handlers-cleaners,category_ Machine-op-inspct,category_ Other-service,category_ Priv-house-serv,category_ Prof-specialty,category_ Protective-serv,category_ Sales,category_ Tech-support,category_ Transport-moving,category_ Husband,category_ Not-in-family,category_ Other-relative,category_ Own-child,category_ Unmarried,category_ Wife,category_ Amer-Indian-Eskimo,category_ Asian-Pac-Islander,category_ Black,category_ Other,category_ White,category_ Female,category_ Male,category_ Cambodia,category_ Canada,category_ China,category_ Columbia,category_ Cuba,category_ Dominican-Republic,category_ Ecuador,category_ El-Salvador,category_ England,category_ France,category_ Germany,category_ Greece,category_ Guatemala,category_ Haiti,category_ Holand-Netherlands,category_ Honduras,category_ Hong,category_ Hungary,category_ India,category_ Iran,category_ Ireland,category_ Italy,category_ Jamaica,category_ Japan,category_ Laos,category_ Mexico,category_ Nicaragua,category_ Outlying-US(Guam-USVI-etc),category_ Peru,category_ Philippines,category_ Poland,category_ Portugal,category_ Puerto-Rico,category_ Scotland,category_ South,category_ Taiwan,category_ Thailand,category_ Trinadad&Tobago,category_ United-States,category_ Vietnam,category_ Yugoslavia
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [40]:
df['hours-per-week'].value_counts()

40    15217
50     2819
45     1824
60     1475
35     1297
20     1224
30     1149
55      694
25      674
48      517
38      476
15      404
70      291
10      278
32      266
24      252
65      244
36      220
42      219
44      212
16      205
12      173
43      151
37      149
8       145
52      138
80      133
56       97
28       86
99       85
      ...  
19       14
64       14
51       13
85       13
68       12
98       11
11       11
63       10
78        8
29        7
77        6
59        5
31        5
96        5
67        4
91        3
76        3
81        3
73        2
89        2
97        2
88        2
86        2
61        2
95        2
92        1
94        1
87        1
74        1
82        1
Name: hours-per-week, Length: 94, dtype: int64

creating a feature

In [0]:
#df['fulltime'] = df['hours-per-week']

In [0]:
#for val in df['fulltime']:
  #print(df.fulltime[val])
 # if df.fulltime[val] >= 30:
 #   df.fulltime[val] = 1
#  else:
 #   df.fulltime[val] = 0

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [43]:
df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32561.0,38.581647,13.640433,17.0,28.0,37.0,48.0,90.0
fnlwgt,32561.0,189778.366512,105549.977697,12285.0,117827.0,178356.0,237051.0,1484705.0
education-num,32561.0,10.080679,2.572720,1.0,9.0,10.0,12.0,16.0
capital-gain,32561.0,1077.648844,7385.292085,0.0,0.0,0.0,0.0,99999.0
capital-loss,32561.0,87.303830,402.960219,0.0,0.0,0.0,0.0,4356.0
hours-per-week,32561.0,40.437456,12.347429,1.0,40.0,40.0,45.0,99.0
pay,32561.0,0.240810,0.427581,0.0,0.0,0.0,0.0,1.0
category_ Federal-gov,32561.0,0.029483,0.169159,0.0,0.0,0.0,0.0,1.0
category_ Local-gov,32561.0,0.064279,0.245254,0.0,0.0,0.0,0.0,1.0
category_ Never-worked,32561.0,0.000215,0.014661,0.0,0.0,0.0,0.0,1.0


In [44]:
from sklearn.linear_model import LogisticRegression


X = df.drop(['pay','workclass','education','marital-status','occupation','relationship','race','sex','native-country'], axis=1)
y = df['pay']

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.7979484659562053

In [45]:
log_reg.coef_

array([[-5.31497770e-03, -3.67688592e-06, -2.22800292e-03,
         3.37327744e-04,  7.77525068e-04, -9.81596745e-03,
         1.34945356e-04,  5.30977475e-05, -2.25384391e-06,
        -1.60956057e-03,  3.19753167e-04, -1.66383727e-05,
        -1.43905144e-05, -5.12944380e-06, -2.34144236e-04,
        -3.36116054e-04, -1.07123659e-04, -3.99483209e-05,
        -7.85395537e-05, -1.63503272e-04, -1.28980427e-04,
        -3.20032013e-05, -3.32815728e-05,  8.36132716e-04,
         2.16898311e-04, -1.43308999e-03,  5.28334605e-04,
        -1.59891423e-05,  2.47327302e-04, -7.78166281e-04,
        -9.53629566e-04,  5.54739970e-06,  3.14693967e-03,
        -9.86767432e-05, -3.15915402e-03, -2.53828596e-04,
        -2.39390916e-04, -6.51687963e-04, -1.59078562e-06,
        -2.09481395e-04,  9.54563103e-04, -2.10020489e-04,
        -3.46101323e-04, -3.43734645e-04, -9.56818593e-04,
        -4.76640522e-05,  7.58475453e-04,  6.18308940e-05,
        -5.81657880e-05,  3.94252527e-05, -1.26952299e-0

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)

(22792, 105)
(9769, 105)
(22792,)
(9769,)


In [47]:
log_reg = LogisticRegression().fit(X_train, y_train)
log_reg.score(X_train, y_train)



0.7956739206739206

In [48]:
log_reg.coef_


array([[-5.23179418e-03, -3.69321834e-06, -2.21152071e-03,
         3.37003187e-04,  7.58933953e-04, -9.54792695e-03,
         1.40834756e-04,  4.48362503e-05, -3.05044448e-06,
        -1.53522686e-03,  2.97853606e-04, -1.83190180e-05,
        -5.82666499e-06, -4.65267096e-06, -2.14586264e-04,
        -3.26184439e-04, -9.40339550e-05, -3.88178182e-05,
        -7.63432963e-05, -1.44805024e-04, -1.14853530e-04,
        -1.89517920e-05, -2.55939154e-05,  8.10800539e-04,
         2.08215143e-04, -1.40427407e-03,  5.00765286e-04,
        -1.43415719e-05,  2.29959865e-04, -7.59848569e-04,
        -9.16537632e-04,  3.22668616e-06,  2.99059776e-03,
        -8.71326304e-05, -3.01094266e-03, -2.36413687e-04,
        -2.25691243e-04, -6.22711488e-04, -2.59048018e-06,
        -2.04805901e-04,  9.32749156e-04, -2.07661327e-04,
        -3.25245833e-04, -3.46217836e-04, -8.90849458e-04,
        -4.56007364e-05,  6.69929348e-04,  6.09884919e-05,
        -3.50526096e-05,  5.56934890e-05, -1.19125422e-0

In [49]:
log_reg.predict(X_test)


array([0, 0, 0, ..., 0, 0, 1])

In [50]:
X.shape

(32561, 105)

In [51]:
from sklearn.metrics import accuracy_score

y_pred = log_reg.predict(X_test)

accuracy_score(y_test, y_pred)

0.8021291841539564

In [60]:
model_coefs = pd.DataFrame(log_reg.coef_, 
                           columns = X.columns)
model_coefs.T

Unnamed: 0,0
age,-5.231794e-03
fnlwgt,-3.693218e-06
education-num,-2.211521e-03
capital-gain,3.370032e-04
capital-loss,7.589340e-04
hours-per-week,-9.547927e-03
category_ Federal-gov,1.408348e-04
category_ Local-gov,4.483625e-05
category_ Never-worked,-3.050444e-06
category_ Private,-1.535227e-03


In [61]:
model_coefs.mean().sort_values(ascending=False)


category_ Married-civ-spouse    2.990598e-03
category_ Husband               2.665054e-03
category_ Exec-managerial       9.327492e-04
category_ Bachelors             8.108005e-04
capital-loss                    7.589340e-04
category_ Male                  7.349929e-04
category_ Prof-specialty        6.699293e-04
category_ Masters               5.007653e-04
category_ Wife                  3.580565e-04
capital-gain                    3.370032e-04
category_ Self-emp-inc          2.978536e-04
category_ Prof-school           2.299599e-04
category_ Doctorate             2.082151e-04
category_ Federal-gov           1.408348e-04
category_ Protective-serv       6.098849e-05
category_ Tech-support          5.569349e-05
category_ Local-gov             4.483625e-05
category_ India                 1.229015e-05
category_ France                1.051925e-05
category_ Italy                 1.031189e-05
category_ Japan                 8.816889e-06
category_ Taiwan                7.160163e-06
category_ 

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis



---

### 1.  The largest predictor appears to be the employee having a married civilian spouse - I would hypothesize this is because military employees are included in this dataset. This would also explain why being a male is also a good predictor. Having a bachelors degree is another good predictor.

### 2. The largest negatively correlated features are being female, not being married, age, and surprisingly hours per week. I would hypothesize that my explanation above gives insight into this. The data is being taken from a pool of different types of government employees, the overall most likely would be married military workers (which would happen to be male). I would hypothesize that age is negatively correlated because there are many young military workers that may get married young and get a large pay increase.

### 3. The model fits the data pretty well. It has an accuracy score of ~80%. A very small amount of the employees actually get paid above the 50k threshold.  









---

### 1. I would use either quantile regression to identify an anomoly (students below a chosen threshold percentile on certain metrics), it may also be possible to use a hazard model if you may capture proper data for it.

### 2. I think because time is involved you could use sometype of survival analysis on this. Simple logistic regression may also be used if the data you can capture is categorical ("6months_since_last_product : 1")

### 3. Ridge regression or logistic regression would work best for this, depending on the data available.