<a href="https://colab.research.google.com/github/ash12hub/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/Ashwin_Raghav_Swamy_LS_DS_114_Making_Data_backed_Assertions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - Making Data-backed Assertions

This is, for many, the main point of data science - to create and support reasoned arguments based on evidence. It's not a topic to master in a day, but it is worth some focused time thinking about and structuring your approach to it.

## Lecture - generating a confounding variable

The prewatch material told a story about a hypothetical health condition where both the drug usage and overall health outcome were related to gender - thus making gender a confounding variable, obfuscating the possible relationship between the drug and the outcome.

Let's use Python to generate data that actually behaves in this fashion!

In [11]:
import random
dir(random)  # Reminding ourselves what we can do here

['BPF',
 'LOG4',
 'NV_MAGICCONST',
 'RECIP_BPF',
 'Random',
 'SG_MAGICCONST',
 'SystemRandom',
 'TWOPI',
 '_BuiltinMethodType',
 '_MethodType',
 '_Sequence',
 '_Set',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acos',
 '_bisect',
 '_ceil',
 '_cos',
 '_e',
 '_exp',
 '_inst',
 '_itertools',
 '_log',
 '_pi',
 '_random',
 '_sha512',
 '_sin',
 '_sqrt',
 '_test',
 '_test_generator',
 '_urandom',
 '_warn',
 'betavariate',
 'choice',
 'choices',
 'expovariate',
 'gammavariate',
 'gauss',
 'getrandbits',
 'getstate',
 'lognormvariate',
 'normalvariate',
 'paretovariate',
 'randint',
 'random',
 'randrange',
 'sample',
 'seed',
 'setstate',
 'shuffle',
 'triangular',
 'uniform',
 'vonmisesvariate',
 'weibullvariate']

In [12]:
# Let's think of another scenario:
# We work for a company that sells accessories for mobile phones.
# They have an ecommerce site, and we are supposed to analyze logs
# to determine what sort of usage is related to purchases, and thus guide
# website development to encourage higher conversion.

# The hypothesis - users who spend longer on the site tend
# to spend more. Seems reasonable, no?

# But there's a confounding variable! If they're on a phone, they:
# a) Spend less time on the site, but
# b) Are more likely to be interested in the actual products!

# Let's use namedtuple to represent our data

from collections import namedtuple
# purchased and mobile are bools, time_on_site in seconds
User = namedtuple('User', ['purchased','time_on_site', 'mobile'])

example_user = User(False, 12, False)
print(example_user)

User(purchased=False, time_on_site=12, mobile=False)


In [13]:
# And now let's generate 1000 example users
# 750 mobile, 250 not (i.e. desktop)
# A desktop user has a base conversion likelihood of 10%
# And it goes up by 1% for each 15 seconds they spend on the site
# And they spend anywhere from 10 seconds to 10 minutes on the site (uniform)
# Mobile users spend on average half as much time on the site as desktop
# But have three times as much base likelihood of buying something

users = []

for _ in range(250):
  # Desktop users
  time_on_site = random.uniform(10, 600)
  purchased = random.random() < 0.1 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, False))
  
for _ in range(750):
  # Mobile users
  time_on_site = random.uniform(5, 300)
  purchased = random.random() < 0.3 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, True))
  
random.shuffle(users)
print(users[:10])

[User(purchased=True, time_on_site=53.5929467752866, mobile=True), User(purchased=True, time_on_site=150.76910585097826, mobile=True), User(purchased=True, time_on_site=170.03487135583882, mobile=True), User(purchased=True, time_on_site=104.15982735126963, mobile=True), User(purchased=False, time_on_site=126.13567816767824, mobile=True), User(purchased=False, time_on_site=258.2386014244204, mobile=True), User(purchased=False, time_on_site=136.21318607819438, mobile=False), User(purchased=False, time_on_site=254.18809546002478, mobile=False), User(purchased=False, time_on_site=101.25037047071163, mobile=True), User(purchased=False, time_on_site=163.24518725479996, mobile=True)]


In [14]:
# Let's put this in a dataframe so we can look at it more easily
import pandas as pd
user_data = pd.DataFrame(users)
user_data.head()

Unnamed: 0,purchased,time_on_site,mobile
0,True,53.592947,True
1,True,150.769106,True
2,True,170.034871,True
3,True,104.159827,True
4,False,126.135678,True


In [15]:
# Let's use crosstabulation to try to see what's going on
pd.crosstab(user_data['purchased'], user_data['time_on_site'])

time_on_site,6.129894981962449,6.9656622293171875,7.347251770453456,7.367046730094224,7.44476098999802,7.94949158636552,9.24251473426126,9.245871213959026,9.402234725702828,9.420099436747357,...,573.3331918857986,574.9077802930975,576.0503267132963,584.9338508591928,589.0609969252337,589.289925493586,589.8807690567894,591.8667515062981,593.8984699426222,597.5552982534095
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
False,0,1,1,0,1,1,1,0,1,1,...,1,1,0,1,0,0,0,1,1,0
True,1,0,0,1,0,0,0,1,0,0,...,0,0,1,0,1,1,1,0,0,1


In [49]:
# OK, that's not quite what we want
# Time is continuous! We need to put it in discrete buckets
# Pandas calls these bins, and pandas.cut helps make them

time_bins = pd.cut(user_data['time_on_site'], 5)  # 5 equal-sized bins
pd.crosstab(user_data['purchased'], time_bins.astype(str))

time_on_site,"(124.415, 242.7]","(242.7, 360.985]","(360.985, 479.27]","(479.27, 597.555]","(5.538, 124.415]"
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,196,100,33,28,254
True,145,96,21,21,106


In [50]:
# We can make this a bit clearer by normalizing (getting %)
pd.crosstab(user_data['purchased'], time_bins.astype(str), normalize='columns')

time_on_site,"(124.415, 242.7]","(242.7, 360.985]","(360.985, 479.27]","(479.27, 597.555]","(5.538, 124.415]"
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,0.57478,0.510204,0.611111,0.571429,0.705556
True,0.42522,0.489796,0.388889,0.428571,0.294444


In [48]:
# That seems counter to our hypothesis
# More time on the site can actually have fewer purchases

# But we know why, since we generated the data!
# Let's look at mobile and purchased
pd.crosstab(user_data['purchased'], user_data['mobile'].astype(str), normalize='columns')

mobile,False,True
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.7,0.581333
True,0.3,0.418667


In [0]:
# Yep, mobile users are more likely to buy things
# But we're still not seeing the *whole* story until we look at all 3 at once

# Live/stretch goal - how can we do that?

## Assignment - what's going on here?

Consider the data in `persons.csv` (already prepared for you, in the repo for the week). It has four columns - a unique id, followed by age (in years), weight (in lbs), and exercise time (in minutes/week) of 1200 (hypothetical) people.

Try to figure out which variables are possibly related to each other, and which may be confounding relationships.

In [19]:
from google.colab import files

files.upload()


Saving persons.csv to persons.csv


{'persons.csv': b',age,weight,exercise_time\n0,44,118,192\n1,41,161,35\n2,46,128,220\n3,39,216,57\n4,28,116,182\n5,58,103,165\n6,55,161,107\n7,21,188,37\n8,55,216,79\n9,50,127,267\n10,21,160,228\n11,43,102,78\n12,73,209,44\n13,27,165,48\n14,21,169,171\n15,36,131,194\n16,49,171,191\n17,69,172,147\n18,18,122,271\n19,55,157,111\n20,19,218,28\n21,34,143,24\n22,20,116,267\n23,20,159,241\n24,32,117,181\n25,71,103,21\n26,21,164,229\n27,79,189,38\n28,72,149,110\n29,26,117,279\n30,29,157,91\n31,40,168,115\n32,78,208,67\n33,70,169,172\n34,32,163,175\n35,61,133,147\n36,58,145,164\n37,41,158,63\n38,69,138,159\n39,40,200,78\n40,35,112,270\n41,80,186,87\n42,72,211,100\n43,63,158,151\n44,74,152,83\n45,52,140,187\n46,71,136,75\n47,27,192,6\n48,23,120,264\n49,49,149,171\n50,61,193,71\n51,49,140,280\n52,50,109,194\n53,60,134,162\n54,70,244,18\n55,34,101,182\n56,60,170,182\n57,47,200,105\n58,53,122,259\n59,69,153,43\n60,31,109,164\n61,68,109,25\n62,41,128,225\n63,28,142,215\n64,64,154,249\n65,31,231,2\n6

In [59]:
# TODO - your code here
import pandas as pd;

df = pd.read_csv('persons.csv')
age_cut = pd.cut(df['age'], 5)
exercise_cut = pd.cut(df['exercise_time'], 5)
weight_cut = pd.cut(df['weight'], 5)
print(pd.crosstab(age_cut, weight_cut.astype(str)),'\n')
print(pd.crosstab(age_cut, exercise_cut.astype(str)),'\n')
print(pd.crosstab(weight_cut, exercise_cut.astype(str)),'\n')
# Use what we did live in lecture as an example

# HINT - you can find the raw URL on GitHub and potentially use that
# to load the data with read_csv, or you can upload it yourself

weight          (129.2, 158.4]  (158.4, 187.6]  (187.6, 216.8]  \
age                                                              
(17.938, 30.4]              86              49              34   
(30.4, 42.8]                62              49              31   
(42.8, 55.2]                62              49              26   
(55.2, 67.6]                71              45              44   
(67.6, 80.0]                54              66              44   

weight          (216.8, 246.0]  (99.854, 129.2]  
age                                              
(17.938, 30.4]               7               80  
(30.4, 42.8]                 7              104  
(42.8, 55.2]                 8               78  
(55.2, 67.6]                 9               53  
(67.6, 80.0]                22               60   

exercise_time   (-0.3, 60.0]  (120.0, 180.0]  (180.0, 240.0]  (240.0, 300.0]  \
age                                                                            
(17.938, 30.4]           

exercise_time,"(-0.3, 60.0]","(120.0, 180.0]","(180.0, 240.0]","(240.0, 300.0]","(60.0, 120.0]"
weight,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(99.854, 129.2]",53,71,79,107,65
"(129.2, 158.4]",44,67,74,74,76
"(158.4, 187.6]",61,56,38,10,93
"(187.6, 216.8]",76,29,1,0,73
"(216.8, 246.0]",44,0,0,0,9


### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  What are the variable types in the data?

**Ans**: All the variables are of integer type.
2.  What are the relationships between the variables?

**Ans**: For wieght and exercise time. heavier people seem to exercise for shorter periods of time than lighter people.
For age and exercise time. The exercise time doesn't seem to be related to age too much. Older people are less likely to exercise for longer periods of time, younger people may exercise for shorter or longer periods of time. So it seems like all ages are likely to exercise for shorter periods of time, but older people are less lickly to exercise for longer periods of time. There seems to be no real relationship between age and weight. Just that older people are more likely weigh more than 200 compared to other people.
3.  Which relationships are "real", and which spurious?

**Ans**: Real relationships are relationships where the values of two variables are dependant on each other. If one variable increases, the other one increases as well. The relationship between weight and exercise time is a real relationship.

Spurious relationships are those where the relationships doesn't seem to have a logical connection. The spurious relationships are age and weight, and, age and exercise time.

## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub.

- [Spurious Correlations](http://tylervigen.com/spurious-correlations)
- [NIH on controlling for confounding variables](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017459/)

Stretch goals:

- Produce your own plot inspired by the Spurious Correlation visualizations (and consider writing a blog post about it - both the content and how you made it)
- Pick one of the techniques that NIH highlights for confounding variables - we'll be going into many of them later, but see if you can find which Python modules may help (hint - check scikit-learn)