# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [10]:
#Load pandas, data.csv and create column names
import pandas as pd

names1 = ['age', 'working_class','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','fifty_k']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None )
df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [12]:
#rename columns
df.columns = names1
df.head(1)

Unnamed: 0,age,working_class,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,fifty_k
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [20]:
#one-hot encode all categorical data
obj_df = df.select_dtypes(include=['object']).copy()
df_encoded = pd.get_dummies(df, columns=list(obj_df), drop_first=True)


In [21]:
df_encoded.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,working_class_ Federal-gov,working_class_ Local-gov,working_class_ Never-worked,working_class_ Private,...,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,fifty_k_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [71]:
df_encoded.isna().sum().sum()

0

In [None]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

df_encoded['fifty_k_ >50K'].plot.bar()


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [30]:
#Separate data into features and target, and train and test
#Scaling has not been performed on this first attempt
import numpy as np
from sklearn.model_selection import train_test_split
X = df_encoded.loc[:,'age':'native-country_ Yugoslavia']
y = df_encoded.loc[:,'fifty_k_ >50K']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=13)


In [39]:
#Run regression and see score
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
log_reg = log.fit(X_train, y_train)
log_reg.score(X, y)



0.7972420994441203

In [40]:
#Set predictions variable
predictions = log.predict(X_test)

In [43]:
#Check predictions score
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)

0.8026242322724735

In [55]:
#Print coefficients for all variables without scaling
print('Coefficients:\n')
for i in range(len(df_encoded)):
    print(list(df_encoded)[i], ': ', log.coef_[0][i])

Coefficients:

age :  -0.004331431056283785
fnlwgt :  -3.5276733136160007e-06
education-num :  -0.002174801676520017
capital-gain :  0.00034435383011747664
capital-loss :  0.0007885400757040971
hours-per-week :  -0.01097542148279274
working_class_ Federal-gov :  0.0001401706797506627
working_class_ Local-gov :  5.607824273266792e-05
working_class_ Never-worked :  -3.0388381189261e-06
working_class_ Private :  -0.001769400155056068
working_class_ Self-emp-inc :  0.0003502722764794547
working_class_ Self-emp-not-inc :  -1.6225602938440587e-05
working_class_ State-gov :  -9.59779678008305e-07
working_class_ Without-pay :  -3.7059648254401763e-06
education_ 11th :  -0.0003747270328781491
education_ 12th :  -0.00011866654458144316
education_ 1st-4th :  -4.7688923418212265e-05
education_ 5th-6th :  -8.873340015383073e-05
education_ 7th-8th :  -0.00018398198409642845
education_ 9th :  -0.00014275647065825918
education_ Assoc-acdm :  -5.099898594286039e-05
education_ Assoc-voc :  -2.1704995374

IndexError: index 100 is out of bounds for axis 0 with size 100

In [58]:
#Scale data for next attempt at logistic regression
from sklearn.preprocessing import minmax_scale

df_encoded[['age','fnlwgt','capital-gain','capital-loss','hours-per-week']] = minmax_scale(df_encoded[['age','fnlwgt','capital-gain','capital-loss','hours-per-week']])


  This is separate from the ipykernel package so we can avoid doing imports until


In [59]:
#Separate data again
X = df_encoded.loc[:,'age':'native-country_ Yugoslavia']
y = df_encoded.loc[:,'fifty_k_ >50K']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=13)

In [60]:
#Run logistic regression
log = LogisticRegression()
log_reg = log.fit(X_train, y_train)
log_reg.score(X, y)



0.8504038573753877

In [63]:
#Set predictions
predictions = log.predict(X_test)

In [64]:
#Check accuracy of predictions
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)

0.8496184626837893

In [67]:
#Print new coefficients
print('Coefficients after min_max scaling: \n')
for i in range(len(df_encoded)):
    print(list(df_encoded)[i], ': ', log.coef_[0][i])

Coefficients after min_max scaling: 

age :  1.9441693007692153
fnlwgt :  1.0983267623162127
education-num :  0.14656629458160425
capital-gain :  16.288223159811785
capital-loss :  2.4645217461806657
hours-per-week :  2.7285412856987548
working_class_ Federal-gov :  0.8053279345903519
working_class_ Local-gov :  0.17819004488175175
working_class_ Never-worked :  -0.13811454456106279
working_class_ Private :  0.35909288410431384
working_class_ Self-emp-inc :  0.5628829087394709
working_class_ Self-emp-not-inc :  -0.11131522079921767
working_class_ State-gov :  0.09400118931510996
working_class_ Without-pay :  -0.45933499865428684
education_ 11th :  -0.34695414059217144
education_ 12th :  -0.07914888059605861
education_ 1st-4th :  -0.368496503730575
education_ 5th-6th :  -0.09056333414136558
education_ 7th-8th :  -0.4135484456009051
education_ 9th :  -0.3001855090515108
education_ Assoc-acdm :  0.12990564927140633
education_ Assoc-voc :  0.36915453183043717
education_ Bachelors :  0.6459

IndexError: index 100 is out of bounds for axis 0 with size 100

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

---
### Peyton's Answers
1. wife, age, and having a doctorate are possitively correlated with income above 50k

2. own child, other-relative, and native_country_ Laos are all negatively correlated with income above 50k

3. This model does a solid job of explaining the data. The score, which functions similarly to R^2, is about .85 for the test data. This is indicative of fairly strong explanatory power. Before deriving too many "insights" from the data, an initial sanity check can be performed by looking at variables that are commonly known to predict increased income, such as higher education and having a wife. Both of these are correct. Some insights that may be drawn from the data (greater and lower incomes is in reference to fifty k):


    Which countries are associated with greater and lower incomes 
    
    Which categories of jobs are associated with greater and lower incomes 
    
    Which relationship statuses are associated with greater and lower incomes
    
    The relationship between race and incomes
    
    The relationship between sex and incomes
    
    The relationship between number of hours worked per week and income
    
You can also compare the impact of different factors by comparing the magnitude of their coefficients, but this could also be a reach without further exploration and processing of the data. 

---

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:

---
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.

### Answer

Quantile Regression: I don't actually think that quantile regression is appropriate here. It's just not appropriate anywhere else. Quantile regression could tell you in which way the features impact the lowest performers, but not predict which observations will be low performers. Quantile regression might be more appropriate for designing an intervention (e.g. we see that the coefficient on feature X is quite large for our lowest performers, so lets design an intervention based around X). 

---
  
  
---
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.

### Answer

Survival Analysis: this tool is used when modeling time to an event. In this case, the event is a product launch. 

---


---
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

### Answer

Ridge Regression: this tool is appropriate for regularizing data that suffers from having too many features given the total number of observations. Fantastically detailed physical information likely means many static features, while the total number of observations is quite limited.  

---


Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**TODO - your answers!**