<a href="https://colab.research.google.com/github/samirgadkari/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
!pip install -U seaborn

Requirement already up-to-date: seaborn in /Users/samirgadkari/anaconda3/lib/python3.6/site-packages (0.9.0)


In [0]:
import seaborn as sns
print(sns.__version__)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

0.9.0


In [0]:
adults = pd.read_csv('https://archive.ics.uci.edu/ml/'
                     'machine-learning-databases/adult/adult.data',
                      header = None,
                      names = ['age', 'workclass',
                               'fnlwgt', 'education',
                               'education-num', 'marital-status',
                               'occupation', 'relationship',
                               'race', 'sex',
                               'capital-gain', 'capital-loss',
                               'hours-per-week', 'native-country',
                               'gt_or_lt_50K'],
                      index_col = False)

In [0]:
### From the adult_names file
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [0]:
adults.shape

(32561, 15)

In [0]:
adults.head(20)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,gt_or_lt_50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [0]:
adults.columns.to_series().groupby(adults.dtypes).groups

{dtype('int64'): Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
        'hours-per-week'],
       dtype='object'),
 dtype('O'): Index(['workclass', 'education', 'marital-status', 'occupation',
        'relationship', 'race', 'sex', 'native-country', 'gt_or_lt_50K'],
       dtype='object')}

In [0]:
def print_uniques(df):
    df1 = df.select_dtypes('O')
    cols = df1.columns
    for col in cols:
        uniques = df[col].unique()
        print('{:>15}:'.format(col),
              'num:', uniques.size,
              ','.join(uniques))
print_uniques(adults)

      workclass: num: 9  State-gov, Self-emp-not-inc, Private, Federal-gov, Local-gov, ?, Self-emp-inc, Without-pay, Never-worked
      education: num: 16  Bachelors, HS-grad, 11th, Masters, 9th, Some-college, Assoc-acdm, Assoc-voc, 7th-8th, Doctorate, Prof-school, 5th-6th, 10th, 1st-4th, Preschool, 12th
 marital-status: num: 7  Never-married, Married-civ-spouse, Divorced, Married-spouse-absent, Separated, Married-AF-spouse, Widowed
     occupation: num: 15  Adm-clerical, Exec-managerial, Handlers-cleaners, Prof-specialty, Other-service, Sales, Craft-repair, Transport-moving, Farming-fishing, Machine-op-inspct, Tech-support, ?, Protective-serv, Armed-Forces, Priv-house-serv
   relationship: num: 6  Not-in-family, Husband, Wife, Own-child, Unmarried, Other-relative
           race: num: 5  White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other
            sex: num: 2  Male, Female
 native-country: num: 42  United-States, Cuba, Jamaica, India, ?, Mexico, South, Puerto-Rico, Honduras

### Missing values denoted as '?' in categorical columns.  Let's fix them.

In [0]:
cols = adults.select_dtypes(['O']).columns
cols

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country', 'gt_or_lt_50K'],
      dtype='object')

In [0]:
for col in cols:
    adults[col] = adults[col].replace({'?': 'unknown', ' ?': 'unknown'})

In [0]:
print_uniques(adults)

      workclass: num: 9  State-gov, Self-emp-not-inc, Private, Federal-gov, Local-gov,unknown, Self-emp-inc, Without-pay, Never-worked
      education: num: 16  Bachelors, HS-grad, 11th, Masters, 9th, Some-college, Assoc-acdm, Assoc-voc, 7th-8th, Doctorate, Prof-school, 5th-6th, 10th, 1st-4th, Preschool, 12th
 marital-status: num: 7  Never-married, Married-civ-spouse, Divorced, Married-spouse-absent, Separated, Married-AF-spouse, Widowed
     occupation: num: 15  Adm-clerical, Exec-managerial, Handlers-cleaners, Prof-specialty, Other-service, Sales, Craft-repair, Transport-moving, Farming-fishing, Machine-op-inspct, Tech-support,unknown, Protective-serv, Armed-Forces, Priv-house-serv
   relationship: num: 6  Not-in-family, Husband, Wife, Own-child, Unmarried, Other-relative
           race: num: 5  White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other
            sex: num: 2  Male, Female
 native-country: num: 42  United-States, Cuba, Jamaica, India,unknown, Mexico, South, Puerto

### Convert >50K to 1 and <=50K to 0

In [0]:
adults['gt_or_lt_50K'] = adults['gt_or_lt_50K'].replace({'>50K': 1, ' >50K': 1, '<=50K': 0, ' <=50K': 0})

In [0]:
adults.head(20)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,gt_or_lt_50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,0
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,0
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,1
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,1
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,1


### Find and handle null values in numerical columns

In [0]:
pd.options.mode.use_inf_as_na = True

In [0]:
cols = adults.select_dtypes(['int64']).columns
cols

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'gt_or_lt_50K'],
      dtype='object')

In [0]:
adults.isnull().sum().sum()

0

### No null values in numerical columns
### Let's continue to one-hot encode categorical columns as required

In [0]:
df1 = adults[adults.select_dtypes(['O']).columns]
df1_encoded = pd.get_dummies(df1)
df2 = adults[adults.select_dtypes(['int64']).columns]
adults_encoded = df1_encoded.join(df2)
adults_encoded

Unnamed: 0,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,workclass_unknown,education_ 10th,...,native-country_ Vietnam,native-country_ Yugoslavia,native-country_unknown,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,gt_or_lt_50K
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,39,77516,13,2174,0,40,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,50,83311,13,0,0,13,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,38,215646,9,0,0,40,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,53,234721,7,0,0,40,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,28,338409,13,0,0,40,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,37,284582,14,0,0,40,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,49,160187,5,0,0,16,0
7,0,0,0,0,0,1,0,0,0,0,...,0,0,0,52,209642,9,0,0,45,1
8,0,0,0,1,0,0,0,0,0,0,...,0,0,0,31,45781,14,14084,0,50,1
9,0,0,0,1,0,0,0,0,0,0,...,0,0,0,42,159449,13,5178,0,40,1


In [0]:
adults_encoded['gt_or_lt_50K'].unique()

array([0, 1])

In [0]:
adults_encoded.columns.to_series().groupby(adults_encoded.dtypes).groups

{dtype('uint8'): Index(['workclass_ Federal-gov', 'workclass_ Local-gov',
        'workclass_ Never-worked', 'workclass_ Private',
        'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
        'workclass_ State-gov', 'workclass_ Without-pay', 'workclass_unknown',
        'education_ 10th',
        ...
        'native-country_ Puerto-Rico', 'native-country_ Scotland',
        'native-country_ South', 'native-country_ Taiwan',
        'native-country_ Thailand', 'native-country_ Trinadad&Tobago',
        'native-country_ United-States', 'native-country_ Vietnam',
        'native-country_ Yugoslavia', 'native-country_unknown'],
       dtype='object', length=102),
 dtype('int64'): Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
        'hours-per-week', 'gt_or_lt_50K'],
       dtype='object')}

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

### Let's scale the numeric features

In [0]:
cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
        'hours-per-week']
for col in cols:
    data = np.reshape(adults_encoded[col].values, (-1, 1))
    scaler = StandardScaler()
    scaler.fit(data)
    adults_encoded[col] = scaler.transform(data)

[39 50 38 ... 58 22 52]




In [0]:
adults_encoded.head(20)

Unnamed: 0,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,workclass_unknown,education_ 10th,...,native-country_ Vietnam,native-country_ Yugoslavia,native-country_unknown,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,gt_or_lt_50K
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,-0.115955,0.898201,1.523438,-0.14592,-0.21666,-0.035429,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0.763796,-0.280358,-1.974858,-0.14592,-0.21666,-1.979184,0
7,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0.983734,0.188195,-0.42006,-0.14592,-0.21666,0.369519,1
8,0,0,0,1,0,0,0,0,0,0,...,0,0,0,-0.55583,-1.364279,1.523438,1.761142,-0.21666,0.774468,1
9,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0.250608,-0.28735,1.134739,0.555214,-0.21666,-0.035429,1


In [0]:
X = adults_encoded.drop(['gt_or_lt_50K'], axis = 1)
y = adults_encoded['gt_or_lt_50K']
log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.8531064770737999

In [0]:
coeffs = log_reg.coef_[0]
cols = adults_encoded.columns
coeffs_zipped = list(zip(cols, coeffs))  # we did not look at abs(coef_) here
coeffs_zipped.sort(key = lambda x: x[1], reverse = True)
coeffs_zipped

[('capital-gain', 2.308516922571051),
 ('marital-status_ Married-AF-spouse', 1.405812569698593),
 ('marital-status_ Married-civ-spouse', 1.2579294202231448),
 ('native-country_ Cambodia', 1.0372772378146318),
 ('relationship_ Wife', 1.0143217850602728),
 ('occupation_ Exec-managerial', 0.8616673668705758),
 ('native-country_ Italy', 0.8570886248992131),
 ('education-num', 0.7557519134316374),
 ('occupation_ Tech-support', 0.726205641420045),
 ('occupation_ Protective-serv', 0.6459302926136646),
 ('occupation_ Prof-specialty', 0.5920269984579891),
 ('native-country_ France', 0.569821113268248),
 ('education_ 5th-6th', 0.5568135328152521),
 ('native-country_ Yugoslavia', 0.5565147761864135),
 ('native-country_ Philippines', 0.5409461857704255),
 ('workclass_ Federal-gov', 0.5406366244979757),
 ('native-country_ Germany', 0.5277123740687741),
 ('education_ 1st-4th', 0.5116677338284811),
 ('native-country_ Ireland', 0.4718642728242551),
 ('native-country_ Japan', 0.4633450491184909),
 ('na

In [0]:
coeffs_zipped = list(zip(cols, np.abs(coeffs)))  # we're looking at abs(coef_) here
coeffs_zipped.sort(key = lambda x: x[1], reverse = True)
coeffs_zipped

[('capital-gain', 2.308516922571051),
 ('occupation_ Priv-house-serv', 1.6373919878329355),
 ('marital-status_ Married-AF-spouse', 1.405812569698593),
 ('native-country_ Columbia', 1.2879779871304833),
 ('marital-status_ Married-civ-spouse', 1.2579294202231448),
 ('marital-status_ Never-married', 1.2092333059359588),
 ('education_ Preschool', 1.1827809010550254),
 ('relationship_ Own-child', 1.1560253432762735),
 ('sex_ Female', 1.1300773279607017),
 ('native-country_ Cambodia', 1.0372772378146318),
 ('relationship_ Wife', 1.0143217850602728),
 ('workclass_ Without-pay', 1.0103452022678443),
 ('native-country_ Dominican-Republic', 0.9583763271298901),
 ('occupation_ Farming-fishing', 0.9128013392305521),
 ('occupation_ Exec-managerial', 0.8616673668705758),
 ('native-country_ Italy', 0.8570886248992131),
 ('marital-status_ Separated', 0.8419104958561208),
 ('relationship_ Other-relative', 0.8400696592943107),
 ('native-country_ South', 0.7979410517732731),
 ('native-country_ Vietnam', 

In [0]:
### Looks like these are the highest features that show who's getting more than 50K:
  - 'capital-gain'
  - 'occupation_ Priv-house-serv
  - 'marital-status_ Married-AF-spouse'
  - 'native-country_ Columbia'
  - 'marital-status_ Married-civ-spouse'
  - 'marital-status_ Never-married'
### Let's just take these and see what score we get

In [0]:
features = ['capital-gain',
            'marital-status_ Married-AF-spouse',
            'marital-status_ Married-civ-spouse',
            'marital-status_ Never-married',
            'native-country_ Columbia',
            'occupation_ Priv-house-serv']
X = adults_encoded[features]
y = adults_encoded['gt_or_lt_50K']
log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.7937716900586591

In [0]:
coeffs = log_reg.coef_[0]
cols = features
coeffs_zipped = list(zip(cols, coeffs))
coeffs_zipped.sort(key = lambda x: x[1], reverse = True)
coeffs_zipped

[('capital-gain', 2.3670893779855247),
 ('marital-status_ Married-civ-spouse', 2.104577444344599),
 ('marital-status_ Married-AF-spouse', 1.7910408349902298),
 ('marital-status_ Never-married', -0.7621859882521643),
 ('native-country_ Columbia', -1.4874709778484336),
 ('occupation_ Priv-house-serv', -2.1450309068039273)]

### With this smaller set of features, we get a score of almost 80.
### With all features, our score was around 85.
### We can say that these features show us a good magnitude of the relationship with people earning > 50K

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**TODO - your answers!**

### Top features positively correlated with income:
  - 'capital-gain'
  - 'marital-status_ Married-AF-spouse'
  - 'marital-status_ Married-civ-spouse'
  
### Top features negatively correlated with income:
  - 'marital-status_ Never-married'
  - 'native-country_ Columbia'
  - 'occupation_ Priv-house-serv'
  
### How well the model explains the data, and insights we derive from it
  - With the top 3 positively and negatively correlated features, we get a score of almost 80.
  - With all features, our score was around 85.
  - We can say that these smaller set of features show us a large magnitude of the relationship with people earning > 50K.  We're only losing ~ 6% of the score.
  -  'capital-gain': You can only have capital gains if you're invested in the stock market.  If you're invested in the stock market, you must have a large amount of savings, which is probably because your income is large.
  - 'marital-status': If you're married, you're steady and can help each other to overcome difficulties.  Overcoming difficulties means you will get better at work, and steadily get more income.  Not sure why only certain marital status came out higher.
  - ''marital-status_ Never-married': You have no one to help you through tough times, also no one to push you to do better, so this is negatively correlated with income.
  -  'native-country_ Columbia': Not sure what this is about.  Columbia is a well-run capitalist country.
  -  'occupation_ Priv-house-serv': If you're working as a maid in a house and living there, you're probably not making much.  Your housing costs will come out of income that you could have made if you were living on your own.
  
  In short:
  - Invest
  - Get married
  - Live in a nice country (get there somehow)
  - Do jobs related to making decisions/thinking rather than labor

### Match the model to the situation
  1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
  
  Since we're looking at tiers, we should use Quantile regression. This way our line will fit to separate the bottom tier from the top tiers.
  
  2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
  
  Here we should use Survival Analysis.  This is because there would be cases where products are not launched, but your study concludes (this data is censored).
  
  3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.
  
  This is a case where you may have number of features > number of observations.  In this case you should use Ridge Regression to move you away from overfitting your model to the data.  Try different alpha values and plot it against the mean-squared-errors to see which alpha gives you enough bias so you're not overfitting, but not too much so you're not underfitting.