# Practice: Basic Statistics I: Averages

For this practice, let's use the Boston dataset.

In [118]:
# Import the numpy package so that we can use the method mean to calculate averages
import numpy as np

In [119]:
# Import the load_boston method 
from sklearn.datasets import load_boston

In [120]:
# Import pandas, so that we can work with the data frame version of the Boston data
import pandas as pd

In [121]:
# Load the Boston data
boston = load_boston()

In [122]:
# This will provide the characteristics for the Boston dataset
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [123]:
# Here, I'm including the prices of Boston's houses, which is boston['target'], as a column with the other 
# features in the Boston dataset.
boston_data = np.concatenate((boston['data'], pd.DataFrame(boston['target'])), axis = 1)

In [124]:
# Convert the Boston data to a data frame format, so that it's easier to view and process
boston_df = pd.DataFrame(boston_updated, columns = np.concatenate((boston['feature_names'], 'MEDV'), axis = None))
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.430,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.60,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.90,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.10,18.9


In [125]:
# Determine the mean of each feature
averages_column = np.mean(boston_df, axis = 0)
print(averages_column)

CRIM         3.593761
ZN          11.363636
INDUS       11.136779
CHAS         0.069170
NOX          0.554695
RM           6.284634
AGE         68.574901
DIS          3.795043
RAD          9.549407
TAX        408.237154
PTRATIO     18.455534
B          356.674032
LSTAT       12.653063
MEDV        22.532806
dtype: float64


In [126]:
# Determine the mean of each row
averages_row = np.mean(boston_df, axis = 1)
print(averages_row)

0      59.635666
1      56.235315
2      55.298456
3      52.585755
4      53.731875
5      53.256432
6      61.520342
7      64.543646
8      64.077024
9      62.390724
10     63.379471
11     62.601226
12     59.316977
13     59.951733
14     60.346704
15     59.437714
16     56.999681
17     60.880721
18     50.586658
19     60.061236
20     61.470549
21     61.904810
22     62.467812
23     62.892474
24     62.291561
25     55.078724
26     60.745351
27     55.671012
28     61.852906
29     60.935961
         ...    
476    90.554622
477    88.216579
478    89.752321
479    89.804136
480    88.547301
481    88.859420
482    89.237483
483    86.210763
484    84.816184
485    86.797169
486    89.342475
487    86.874712
488    92.280340
489    88.865841
490    87.484433
491    92.284703
492    91.821659
493    65.674786
494    65.189448
495    64.136614
496    67.542371
497    66.809291
498    66.533016
499    66.892266
500    67.353899
501    57.842659
502    58.516841
503    59.5819

So we can determine the averages by row, but should we do this? Why or why not?

**Answer:** It's very hard to interpret a these values, because taking an average across different features does not make sense.

Let's put together what you have learned about averages and subsetting to do the next problems. 

We will determine the average price for houses along the Charles River and that for houses NOT along the river.

In [130]:
# Use the query method to define a subset of boston_df that only include houses are along the river (CHAS = 1). 
along_river = boston_df.query('CHAS == 1')
along_river

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
142,3.32105,0.0,19.58,1.0,0.871,5.403,100.0,1.3216,5.0,403.0,14.7,396.9,26.82,13.4
152,1.12658,0.0,19.58,1.0,0.871,5.012,88.0,1.6102,5.0,403.0,14.7,343.28,12.12,15.3
154,1.41385,0.0,19.58,1.0,0.871,6.129,96.0,1.7494,5.0,403.0,14.7,321.02,15.12,17.0
155,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
160,1.27346,0.0,19.58,1.0,0.605,6.25,92.6,1.7984,5.0,403.0,14.7,338.92,5.5,27.0
162,1.83377,0.0,19.58,1.0,0.605,7.802,98.2,2.0407,5.0,403.0,14.7,389.61,1.92,50.0
163,1.51902,0.0,19.58,1.0,0.605,8.375,93.9,2.162,5.0,403.0,14.7,388.45,3.32,50.0
208,0.13587,0.0,10.59,1.0,0.489,6.064,59.1,4.2392,4.0,277.0,18.6,381.32,14.66,24.4
209,0.43571,0.0,10.59,1.0,0.489,5.344,100.0,3.875,4.0,277.0,18.6,396.9,23.09,20.0
210,0.17446,0.0,10.59,1.0,0.489,5.96,92.1,3.8771,4.0,277.0,18.6,393.25,17.27,21.7


What do you notice about the CHAS column? 

**Answer:** It's all 1.0! This means that we successfully subsetting all houses that are along the Charles River. Great work!

In [128]:
# Now determine the average price for these houses. 'MEDV' is the column name for the prices. 
averages_price_along_river = np.mean(along_river['MEDV'])
averages_price_along_river

28.44

Now try determining the average for houses NOT along the River.

In [129]:
# Determine the average price for houses that are NOT along the Charles River (when CHAS = 0). 
not_along_river = boston_df.query('CHAS == 0')
averages_price_not_along_river = np.mean(not_along_river['MEDV'])
averages_price_not_along_river

22.093842887473482

Good work! You're becoming an expert in subsetting and determining averages on subsetted data. This will be integral for your capstone projects and future careers as data scientists! 