It might be a good idea to first check the [source of the Boston housing data](https://archive.ics.uci.edu/ml/datasets/Housing).

In [1]:
# Download the data and save to a file called "housing.data."

import urllib
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
urllib.urlretrieve (data_url, "housing.data")

('housing.data', <httplib.HTTPMessage instance at 0x104702518>)

The data file does not contain the column names in the first line, so we'll need to add those in manually. You can find the names and explanations [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names). We've extracted the names below for convenience. You may choose to edit the names, should you decide it would be more helpful to do so.

In [2]:
names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]

Load the data in through any method you choose. Make sure to include the column names so that you may conduct your analysis more easily.

In [3]:
import pandas as pd

data = pd.read_csv("housing.data", header=None, names=names, delim_whitespace=True)

data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


Exercise 1: Conduct a brief integrity check of your data. This integrity check should include,
but is not limited to, checking for missing values and making sure all values make logical
sense. (i.e. Is one variable a percentage, but there are observations above 100%?)

Summarize your findings in a few sentences, including what you checked and, if appropriate, any 
steps you took to rectify potential integrity issues.

In [4]:
## There is no missing data and there are no other data integrity issues.

Exercise 2: For what two attributes does it make the *least* sense to calculate mean and median? Why?

In [5]:
## The dummy variable CHAS and the categorical variable RAD. CHAS is a dummy (categorical) variable that 
## makes no sense quantitatively. RAD is a variable that indexes the distance to highways. It has many low 
## values and, after a large gap, has higher values. It stands to reason that this is not a "true"
## quantitative variable in the sense that the difference between RAD = 1 and RAD = 2 may not be the same
## as the difference between RAD = 2 and RAD = 3.

In [7]:
import numpy as np
for i in data.columns.values:
    for j in data.columns.values:
        print(i,j,np.corrcoef(data[i],data[j]))

('CRIM', 'CRIM', array([[ 1.,  1.],
       [ 1.,  1.]]))
('CRIM', 'ZN', array([[ 1.        , -0.20046922],
       [-0.20046922,  1.        ]]))
('CRIM', 'INDUS', array([[ 1.        ,  0.40658341],
       [ 0.40658341,  1.        ]]))
('CRIM', 'CHAS', array([[ 1.        , -0.05589158],
       [-0.05589158,  1.        ]]))
('CRIM', 'NOX', array([[ 1.        ,  0.42097171],
       [ 0.42097171,  1.        ]]))
('CRIM', 'RM', array([[ 1.       , -0.2192467],
       [-0.2192467,  1.       ]]))
('CRIM', 'AGE', array([[ 1.        ,  0.35273425],
       [ 0.35273425,  1.        ]]))
('CRIM', 'DIS', array([[ 1.        , -0.37967009],
       [-0.37967009,  1.        ]]))
('CRIM', 'RAD', array([[ 1.        ,  0.62550515],
       [ 0.62550515,  1.        ]]))
('CRIM', 'TAX', array([[ 1.        ,  0.58276431],
       [ 0.58276431,  1.        ]]))
('CRIM', 'PTRATIO', array([[ 1.        ,  0.28994558],
       [ 0.28994558,  1.        ]]))
('CRIM', 'B', array([[ 1.        , -0.38506394],
       [-0.38

Exercise 3: Which two variables have the strongest linear association? Report both variables, the metric you chose as the basis for your comparison, and the value of that metric. *(Hint: Make sure you consider only variables for which it makes sense to find a linear association.)*

In [8]:
## Solution: NOX (Nitric Oxides Concentration) and DIS (Weighted Distances to Five Boston Employment Centers)
## have the strongest linear association. The correlation between NOX and DIS is -0.76923.

Exercise 4: Which variable has the most symmetric distribution? Which variable has the most left-skewed (negatively skewed) distribution? Which variable has the most right-skewed (positively skewed) distribution? (Do not scale for this exercise.) Defend your method for determining these variables.

In [9]:
for i in data.columns.values:
    print(i,(np.mean(data[i])-np.median(data[i])),(np.mean(data[i])-np.median(data[i]))/(np.max(data[i])-np.min(data[i])))

('CRIM', 3.3570135573122535, 0.037732022987018228)
('ZN', 11.363636363636363, 0.11363636363636363)
('INDUS', 1.4467786561265044, 0.053034408215780954)
('CHAS', 0.069169960474308304, 0.069169960474308304)
('NOX', 0.016695059288537206, 0.034351973844726762)
('RM', 0.076134387351786792, 0.014587926298483772)
('AGE', -8.9250988142292158, -0.091916568632638682)
('DIS', 0.58759268774703433, 0.053432575339144153)
('RAD', 4.5494071146245059, 0.19780030933150025)
('TAX', 78.237154150197625, 0.14930754608816341)
('PTRATIO', -0.59446640316203414, -0.063241106719365336)
('B', -34.765968379447429, -0.087664452013332575)
('LSTAT', 1.293063241106724, 0.03568055301067119)
('MEDV', 1.3328063241106989, 0.029617918313571086)


In [10]:
## Solution: Using mean(data) - median(data) as the metric to assess how symmetric a variable is,
## the variable with the most symmetric distribution is NOX. The variable with the most left-skewed
## distribution is B. The variable with the most right-skewed distribution is TAX.

## You can use different metrics - but be sure you can defend your choice!

Exercise 5: As you may have noticed, the spread of the distribution contributed significantly to the numbers that helped you to answer Exercise 4. Repeat Exercise 4, but scale your results by the range of that variable.

In [11]:
## Solution: Using (mean(data) - median(data)) / (max(data) - min(data)) as the metric to assess how symmetric
## a variable is, the variable with the most symmetric distribution is MEDV. The variable with the most 
## left-skewed distribution is AGE. The variable with the most right-skewed distribution is RAD.

## You can use different metrics - but be sure you can defend your choice!

Exercise 6: Conduct a full univariate analysis on MEDV, CHAS, TAX, and RAD. For each variable, you should answer the three questions generally asked in a univariate analysis using the most appropriate metrics. If you feel there is additional information that is relevant, include it. 

In [12]:
## Sketch of Answer: You should report at least one measure of center, one measure of spread, and
## a description (metric-based or plot-based) of the shape of the distribution of each variable.
## Defending these choices is better. (i.e. median is a better measure of center than mean
## because...) Including multiple measures of center and/or spread and interpreting what these
## reveal about the distribution of a variable is especially good. Finally, including a plot
## that goes along with these metrics and this description would turn this answer from a "good"
## one into a "great" one. A report to a supervisor should ideally include these points.

Exercise 7: Exercises 3 through 6 have used inferential statistics, descriptive statistics, or both. For each exercise, identify the branch of statistics on which you relied for your answer.

In [13]:
## Solution: For all exercises, we relied only on descriptive statistics.

Exercise 8: It seems likely that this data is a census - that is, the data set includes the entire target population. Suppose that the 506 observations was too much for our computer (as unlikely as this might be) and we needed to pare this down to fewer observations. Set the seed equal to the sum of the first ten rows of 'RAD' and use the random.sample() function to select 50 observations. Find the mean of the 'AGE' of these observations. ([This documentation](https://docs.python.org/2/library/random.html) may be helpful.)

In [14]:
sum(data['RAD'][0:9])
import random
random.seed(29)
rand_sample = random.sample(data['AGE'],50)
np.mean(rand_sample)

68.438000000000002

Exercise 9: In Exercise 8, identify the type of sampling used.

In [15]:
## Solution: Simple random sampling was used.

BONUS: Of the remaining types of sampling about which we learned, describe (but do not execute) how you might implement at least one of these types of sampling.

In [1]:
## Potential Solution: Stratified random sampling is a method used when we want to protect ourselves from
## a potentially "bad" or "skewed" simple random sample. The variable CHAS takes on two values: 1 and 0. Rather
## than selecting 50 observations at random, we could look at the proportion of 1s and 0s for the CHAS variable,
## select 50 * (proportion of 1s) observations where CHAS = 1 and then select 50 * (proportion of 0s)
## obervations where CHAS = 0.