<a href="https://colab.research.google.com/github/donaldocelaj/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [0]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

In [4]:
df.shape

(32560, 15)

In [5]:
df.columns

Index(['39', ' State-gov', ' 77516', ' Bachelors', ' 13', ' Never-married',
       ' Adm-clerical', ' Not-in-family', ' White', ' Male', ' 2174', ' 0',
       ' 40', ' United-States', ' <=50K'],
      dtype='object')

In [0]:
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital-status',
              'occupation', 'relationship', 'race', 'sex', 'captial_gain', 'capital_loss',
              'hours-per-week', 'native-country', 'class']

In [7]:
df.head(50)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital-status,occupation,relationship,race,sex,captial_gain,capital_loss,hours-per-week,native-country,class
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
5,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
6,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
7,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
8,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
9,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K


In [8]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
captial_gain      0
capital_loss      0
hours-per-week    0
native-country    0
class             0
dtype: int64

In [0]:
#Looking through the data I saw multiple ' ?' in the data
df = df.replace(' ?', np.nan)

In [10]:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
captial_gain         0
capital_loss         0
hours-per-week       0
native-country     583
class                0
dtype: int64

In [11]:
#Looking through the data I saw multiple ' ?' in the data
df.workclass.fillna(method = 'ffill', inplace=True)
df.occupation.fillna(method='ffill', inplace=True)
df['native-country'].fillna(method='ffill', inplace=True)
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
captial_gain      0
capital_loss      0
hours-per-week    0
native-country    0
class             0
dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
age               32560 non-null int64
workclass         32560 non-null object
fnlwgt            32560 non-null int64
education         32560 non-null object
education_num     32560 non-null int64
marital-status    32560 non-null object
occupation        32560 non-null object
relationship      32560 non-null object
race              32560 non-null object
sex               32560 non-null object
captial_gain      32560 non-null int64
capital_loss      32560 non-null int64
hours-per-week    32560 non-null int64
native-country    32560 non-null object
class             32560 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [13]:
df['class'].value_counts()

 <=50K    24719
 >50K      7841
Name: class, dtype: int64

In [0]:
df['class'] = df['class'].map({' <=50K':0, ' >50K':1})

In [15]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
captial_gain       int64
capital_loss       int64
hours-per-week     int64
native-country    object
class              int64
dtype: object

In [16]:
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital-status,occupation,relationship,race,sex,captial_gain,capital_loss,hours-per-week,native-country,class
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,0
5,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,0
6,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,1
7,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,1
8,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,1
9,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,1


In [0]:
df2 = pd.get_dummies(df, columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])

In [18]:
df2.shape

(32560, 106)

In [19]:
df2.head(10)

Unnamed: 0,age,fnlwgt,education_num,captial_gain,capital_loss,hours-per-week,class,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,37,284582,14,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,49,160187,5,0,0,16,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,52,209642,9,0,0,45,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,31,45781,14,14084,0,50,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,42,159449,13,5178,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9,37,280464,10,0,0,80,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [0]:
X = df2.drop(['class'], axis=1)

In [0]:
y = df2['class']

In [22]:
model = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=25000)
model.fit(X, y)
model.score(X, y)

0.7957923832923833

In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [24]:
print(model.score(X_test, y_test))

0.7998771498771499


In [25]:
model = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=40000)
model.fit(X, y)
model.score(X, y)

0.7957923832923833

In [26]:
print(model.coef_)

[[-1.37941013e-05 -3.24997433e-06 -3.15029809e-06  1.60624665e-04
   3.50288751e-04 -1.54406738e-05  3.11746545e-08 -8.03054606e-09
  -6.87461486e-10 -7.91255436e-07  8.87736079e-08 -4.68259407e-08
  -2.35723099e-08 -1.99807408e-09 -8.24303822e-08 -1.12660093e-07
  -3.74711823e-08 -1.39378372e-08 -2.66874311e-08 -6.18146762e-08
  -4.71731590e-08 -1.68869969e-08 -2.54310586e-08  2.36485965e-07
   6.94718117e-08 -5.64743492e-07  1.60794720e-07 -5.37853758e-09
   8.05582606e-08 -3.05117417e-07 -3.64919686e-07  1.44467047e-09
   8.90369108e-07 -3.63370099e-08 -1.05960577e-06 -8.99878113e-08
  -9.33850085e-08 -2.54507533e-07 -5.97549718e-10 -1.26922899e-07
   2.58820546e-07 -8.83482964e-08 -1.26410083e-07 -1.39686188e-07
  -3.47161814e-07 -1.77367211e-08  2.02944817e-07  1.12519300e-08
  -5.91179940e-08  4.10144049e-09 -6.90511592e-08  7.96089285e-07
  -6.71016986e-07 -9.52385082e-08 -5.62635323e-07 -3.27835670e-07
   1.08215697e-07 -3.53083951e-08 -2.62611105e-08 -1.69399884e-07
  -2.08413

In [27]:
columns= list(df2.columns)
coefficients = list(model.coef_[0])

print(columns)
print(coefficients)

['age', 'fnlwgt', 'education_num', 'captial_gain', 'capital_loss', 'hours-per-week', 'class', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'marital-status_ Divorced', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'marital-status_ Married-spouse-absent', 'marital-status_ Never-married', 'marital-status_ Separated', 'marital-status_ Widowed', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-manag

In [28]:
zipset = zip(columns, coefficients)
zipset = set(zipset)
print(zipset)

{('native-country_ South', 2.4611239540462006e-09), ('native-country_ Vietnam', 7.650036596417234e-10), ('education_ 9th', -1.6886996876678147e-08), ('native-country_ Ireland', 1.5543113415756758e-09), ('native-country_ Germany', -1.4081527529684625e-09), ('occupation_ Exec-managerial', -8.83482963825241e-08), ('relationship_ Husband', -6.710169857170158e-07), ('workclass_ Local-gov', -6.874614862159568e-10), ('workclass_ Self-emp-inc', -4.682594068053317e-08), ('workclass_ Never-worked', -7.912554355309496e-07), ('occupation_ Farming-fishing', -1.2641008279604676e-07), ('race_ Other', -5.006108131148072e-07), ('native-country_ Honduras', 7.398520547750988e-10), ('education_ Assoc-acdm', -2.5431058562703995e-08), ('education_ Some-college', -3.649196864322046e-07), ('native-country_ Italy', -4.20077614835203e-09), ('education_ Prof-school', -3.051174166225868e-07), ('education_ Doctorate', -5.647434923469572e-07), ('capital_loss', 0.0003502887511010761), ('education_ 1st-4th', -2.66874

In [0]:
sorted_coef = pd.Series(coefficients, index=X.columns)

In [39]:
sorted_coef = sorted_coef.sort_values(ascending=True)
sorted_coef

hours-per-week                       -1.544067e-05
age                                  -1.379410e-05
fnlwgt                               -3.249974e-06
education_num                        -3.150298e-06
marital-status_ Never-married        -1.059606e-06
sex_ Female                          -8.358275e-07
workclass_ Private                   -7.912554e-07
relationship_ Not-in-family          -6.710170e-07
native-country_ United-States        -6.634209e-07
education_ HS-grad                   -5.647435e-07
relationship_ Own-child              -5.626353e-07
race_ White                          -5.006108e-07
marital-status_ Divorced             -3.649197e-07
occupation_ Other-service            -3.471618e-07
relationship_ Unmarried              -3.278357e-07
education_ Some-college              -3.051174e-07
occupation_ Adm-clerical             -2.545075e-07
race_ Black                          -1.693999e-07
occupation_ Machine-op-inspct        -1.396862e-07
occupation_ Craft-repair       

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

In [40]:
# 3 Features correlated with >50k income
sorted_coef[-3:]

marital-status_ Married-civ-spouse    8.903691e-07
captial_gain                          1.606247e-04
capital_loss                          3.502888e-04
dtype: float64

In [41]:
# 3 features negatively correlated with >50k income
sorted_coef[:3]

hours-per-week   -0.000015
age              -0.000014
fnlwgt           -0.000003
dtype: float64

In [42]:
#How well my model does
print(model.score(X, y))

0.7957923832923833


Model seemed mostly accurate, capital gain as well as loss were shocking things to find in the data, definitely interesting what things came up as negative coefficients, negative coefficients seem odd.

1. Quantile regression. We want a model that is useful for a given quantile. 
2. Survival Analysis. We have censored data in the form of not knowing when products will be released.
3. Ridge regression. Too many features, probably likely to overfit by introducing bias to our regression.