# Lambda School Data Science - Making Data-backed Assertions

This is, for many, the main point of data science - to create and support reasoned arguments based on evidence. It's not a topic to master in a day, but it is worth some focused time thinking about and structuring your approach to it.

## Lecture - generating a confounding variable

The prewatch material told a story about a hypothetical health condition where both the drug usage and overall health outcome were related to gender - thus making gender a confounding variable, obfuscating the possible relationship between the drug and the outcome.

Let's use Python to generate data that actually behaves in this fashion!

In [0]:
#y = "health level" - predicted variable, dependent variable
#x = "took the drug" - explanatory variable, independent variable
#omitted variable == confounding variable

In [4]:
import random
dir(random)  # Reminding ourselves what we can do here

['BPF',
 'LOG4',
 'NV_MAGICCONST',
 'RECIP_BPF',
 'Random',
 'SG_MAGICCONST',
 'SystemRandom',
 'TWOPI',
 '_BuiltinMethodType',
 '_MethodType',
 '_Sequence',
 '_Set',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acos',
 '_bisect',
 '_ceil',
 '_cos',
 '_e',
 '_exp',
 '_inst',
 '_itertools',
 '_log',
 '_pi',
 '_random',
 '_sha512',
 '_sin',
 '_sqrt',
 '_test',
 '_test_generator',
 '_urandom',
 '_warn',
 'betavariate',
 'choice',
 'choices',
 'expovariate',
 'gammavariate',
 'gauss',
 'getrandbits',
 'getstate',
 'lognormvariate',
 'normalvariate',
 'paretovariate',
 'randint',
 'random',
 'randrange',
 'sample',
 'seed',
 'setstate',
 'shuffle',
 'triangular',
 'uniform',
 'vonmisesvariate',
 'weibullvariate']

In [5]:
# Let's think of another scenario:
# We work for a company that sells accessories for mobile phones.
# They have an ecommerce site, and we are supposed to analyze logs
# to determine what sort of usage is related to purchases, and thus guide
# website development to encourage higher conversion.

# The hypothesis - users who spend longer on the site tend
# to spend more. Seems reasonable, no?

# But there's a confounding variable! If they're on a phone, they:
# a) Spend less time on the site, but
# b) Are more likely to be interested in the actual products!

# Let's use namedtuple to represent our data

from collections import namedtuple
# purchased and mobile are bools, time_on_site in seconds
User = namedtuple('User', ['purchased','time_on_site', 'mobile'])

example_user = User(False, 12, False)
print(example_user)

User(purchased=False, time_on_site=12, mobile=False)


In [6]:
# And now let's generate 1000 example users
# 750 mobile, 250 not (i.e. desktop)
# A desktop user has a base conversion likelihood of 10%
# And it goes up by 1% for each 15 seconds they spend on the site
# And they spend anywhere from 10 seconds to 10 minutes on the site (uniform)
# Mobile users spend on average half as much time on the site as desktop
# But have three times as much base likelihood of buying something

users = []

for _ in range(250):
  # Desktop users
  time_on_site = random.uniform(10, 600)
  purchased = random.random() < 0.1 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, False))
  
for _ in range(750):
  # Mobile users
  time_on_site = random.uniform(5, 300)
  purchased = random.random() < 0.3 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, True))
  
random.shuffle(users)
print(users[:10])

[User(purchased=False, time_on_site=164.79646490243542, mobile=False), User(purchased=True, time_on_site=227.08534865609417, mobile=True), User(purchased=False, time_on_site=103.39200081274801, mobile=True), User(purchased=False, time_on_site=285.4405206761921, mobile=False), User(purchased=True, time_on_site=250.8546254707819, mobile=True), User(purchased=False, time_on_site=236.66981547662084, mobile=False), User(purchased=False, time_on_site=26.30949284211047, mobile=True), User(purchased=True, time_on_site=168.57387994910874, mobile=False), User(purchased=True, time_on_site=222.0377544854059, mobile=True), User(purchased=True, time_on_site=58.44279672845045, mobile=True)]


In [7]:
# Let's put this in a dataframe so we can look at it more easily
import pandas as pd
user_data = pd.DataFrame(users)
user_data.head()

Unnamed: 0,purchased,time_on_site,mobile
0,False,164.796465,False
1,True,227.085349,True
2,False,103.392001,True
3,False,285.440521,False
4,True,250.854625,True


In [8]:
# Let's use crosstabulation to try to see what's going on
pd.crosstab(user_data['purchased'], user_data['time_on_site'])

time_on_site,5.660696932816149,6.269948851041285,6.52853902133554,7.160036244521695,7.666598108619139,7.980772235761906,8.586973805971457,9.27166227122264,9.631781636865963,10.37954764145567,...,589.3069726958985,591.0795331671477,592.7833214011026,593.0035014430161,593.1815901666499,594.0376470426305,595.1833036800011,595.4262827616765,595.8073612824071,596.4584359773023
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
False,0,0,0,1,0,1,0,1,1,1,...,0,0,0,1,1,1,1,1,0,1
True,1,1,1,0,1,0,1,0,0,0,...,1,1,1,0,0,0,0,0,1,0


In [9]:
# OK, that's not quite what we want
# Time is continuous! We need to put it in discrete buckets
# Pandas calls these bins, and pandas.cut helps make them

time_bins = pd.cut(user_data['time_on_site'], 5)  # 5 equal-sized bins
pd.crosstab(user_data['purchased'], time_bins)

time_on_site,"(5.07, 123.82]","(123.82, 241.98]","(241.98, 360.139]","(360.139, 478.299]","(478.299, 596.458]"
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,230,201,107,35,35
True,116,143,83,17,33


In [10]:
# We can make this a bit clearer by normalizing (getting %)
pd.crosstab(user_data['purchased'], time_bins, normalize='columns')

time_on_site,"(5.07, 123.82]","(123.82, 241.98]","(241.98, 360.139]","(360.139, 478.299]","(478.299, 596.458]"
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,0.66474,0.584302,0.563158,0.673077,0.514706
True,0.33526,0.415698,0.436842,0.326923,0.485294


In [11]:
# That seems counter to our hypothesis
# More time on the site can actually have fewer purchases

# But we know why, since we generated the data!
# Let's look at mobile and purchased
pd.crosstab(user_data['purchased'], user_data['mobile'], normalize='columns')

mobile,False,True
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.7,0.577333
True,0.3,0.422667


In [14]:
# Yep, mobile users are more likely to buy things
# But we're still not seeing the *whole* story until we look at all 3 at once

# Live/stretch goal - how can we do that?

pd.crosstab(user_data['mobile'], [user_data['purchased'], 
                                  time_bins], rownames=['device'], 
                                  colnames=["purchased", "time on site"], normalize='index')

purchased,False,False,False,False,False,True,True,True,True,True
time on site,"(5.07, 123.82]","(123.82, 241.98]","(241.98, 360.139]","(360.139, 478.299]","(478.299, 596.458]","(5.07, 123.82]","(123.82, 241.98]","(241.98, 360.139]","(360.139, 478.299]","(478.299, 596.458]"
device,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
False,0.116,0.18,0.124,0.14,0.14,0.032,0.02,0.048,0.068,0.132
True,0.268,0.208,0.101333,0.0,0.0,0.144,0.184,0.094667,0.0,0.0


## Assignment - what's going on here?

Consider the data in `persons.csv` (already prepared for you, in the repo for the week). It has four columns - a unique id, followed by age (in years), weight (in lbs), and exercise time (in minutes/week) of 1200 (hypothetical) people.

Try to figure out which variables are possibly related to each other, and which may be confounding relationships.

In [16]:
# TODO - your code here
# Use what we did live in lecture as an example

# HINT - you can find the raw URL on GitHub and potentially use that
# to load the data with read_csv, or you can upload it yourself
person_data = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-1-Sprint-1-Dealing-With-Data/master/module4-databackedassertions/persons.csv')
person_data.head(30)

Unnamed: 0.1,Unnamed: 0,age,weight,exercise_time
0,0,44,118,192
1,1,41,161,35
2,2,46,128,220
3,3,39,216,57
4,4,28,116,182
5,5,58,103,165
6,6,55,161,107
7,7,21,188,37
8,8,55,216,79
9,9,50,127,267


In [22]:
print(pd.crosstab(person_data['exercise_time'], person_data['age']).shape)
pd.crosstab(person_data['exercise_time'], person_data['age'])

(294, 63)


age,18,19,20,21,22,23,24,25,26,27,...,71,72,73,74,75,76,77,78,79,80
exercise_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,2,1,0,0,0,0,0,1,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
6,0,0,0,0,0,0,0,1,0,1,...,1,0,0,0,0,1,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
9,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
time_bins = pd.cut(person_data['exercise_time'], 5) # 5 equal-sized bins
age_bins = pd.cut(person_data['age'], 9)
pd.crosstab(age_bins, time_bins)

exercise_time,"(-0.3, 60.0]","(60.0, 120.0]","(120.0, 180.0]","(180.0, 240.0]","(240.0, 300.0]"
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(17.938, 24.889]",22,32,34,32,15
"(24.889, 31.778]",33,15,26,30,36
"(31.778, 38.667]",35,34,25,30,27
"(38.667, 45.556]",20,31,19,22,35
"(45.556, 52.444]",25,24,23,19,38
"(52.444, 59.333]",20,27,24,24,33
"(59.333, 66.222]",30,42,19,25,7
"(66.222, 73.111]",44,57,41,10,0
"(73.111, 80.0]",49,54,12,0,0


In [26]:
# longer periods of exercise_time decrease significantly after age 55

# normalize the data (%)
pd.crosstab(age_bins, time_bins, normalize='columns')

exercise_time,"(-0.3, 60.0]","(60.0, 120.0]","(120.0, 180.0]","(180.0, 240.0]","(240.0, 300.0]"
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(17.938, 24.889]",0.079137,0.101266,0.152466,0.166667,0.078534
"(24.889, 31.778]",0.118705,0.047468,0.116592,0.15625,0.188482
"(31.778, 38.667]",0.125899,0.107595,0.112108,0.15625,0.141361
"(38.667, 45.556]",0.071942,0.098101,0.085202,0.114583,0.183246
"(45.556, 52.444]",0.089928,0.075949,0.103139,0.098958,0.198953
"(52.444, 59.333]",0.071942,0.085443,0.107623,0.125,0.172775
"(59.333, 66.222]",0.107914,0.132911,0.085202,0.130208,0.036649
"(66.222, 73.111]",0.158273,0.18038,0.183857,0.052083,0.0
"(73.111, 80.0]",0.176259,0.170886,0.053812,0.0,0.0


### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  What are the variable types in the data?
2.  What are the relationships between the variables?
3.  Which relationships are "real", and which spurious?


## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub.

- [Spurious Correlations](http://tylervigen.com/spurious-correlations)
- [NIH on controlling for confounding variables](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017459/)

Stretch goals:

- Produce your own plot inspired by the Spurious Correlation visualizations (and consider writing a blog post about it - both the content and how you made it)
- Pick one of the techniques that NIH highlights for confounding variables - we'll be going into many of them later, but see if you can find which Python modules may help (hint - check scikit-learn)