# Practice: Basic Statistics I: Averages

For this practice, let's use the California dataset.

In [1]:
# Import the numpy package so that we can use the method mean to calculate averages
import numpy as np

In [2]:
# Import the fetch_california_housing method to load the California data later on
from sklearn.datasets import fetch_california_housing

In [3]:
# Import pandas, so that we can work with the data frame version of the California data
import pandas as pd

In [4]:
# Load the California data
california = fetch_california_housing()

In [5]:
# This will provide the characteristics for the California dataset
print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [6]:
# Convert the housing object to a data frame format, so that it's easier to view and process
california_df = pd.DataFrame(california['data'], columns = california['feature_names'])
# Here, I'm including the prices of California's houses, which is california['target'], 
# as a column with the other features in the California dataset.
california_df['HouseValue'] = california['target']
california_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,HouseValue
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [7]:
# Determine the mean of each feature
averages_column = np.mean(california_df, axis = 0)
print(averages_column)

MedInc           3.870671
HouseAge        28.639486
AveRooms         5.429000
AveBedrms        1.096675
Population    1425.476744
AveOccup         3.070655
Latitude        35.631861
Longitude     -119.569704
HouseValue       2.068558
dtype: float64


In [8]:
# Determine the mean of each row
averages_row = np.mean(california_df, axis = 1)
print(averages_row)

0         33.562744
1        262.094029
2         54.061360
3         60.454940
4         61.045845
            ...    
20635     88.830077
20636     34.017826
20637    105.942697
20638     76.494316
20639    148.160729
Length: 20640, dtype: float64


So we can determine the averages by row, but should we do this? Why or why not?

**Answer:** It's very hard to interpret a these values, because taking an average across different features does not make sense.

Let's put together what you have learned about averages and subsetting to do the next problems. 

We will determine the average price for houses less than 20 years old and that for houses 20 years old or more.

In [9]:
# Use the query method to define a subset of california_df that only include houses less than 20 years old. 
newer_houses = california_df.query('HouseAge < 20')
newer_houses

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,HouseValue
59,2.5625,2.0,2.771930,0.754386,94.0,1.649123,37.82,-122.29,0.600
75,0.9241,17.0,2.817768,1.052392,762.0,1.735763,37.81,-122.28,1.775
77,1.1111,19.0,5.830918,1.173913,721.0,3.483092,37.81,-122.28,1.083
80,1.5000,17.0,3.197232,1.000000,609.0,2.107266,37.81,-122.28,1.625
87,0.7600,10.0,2.651515,1.054545,546.0,1.654545,37.81,-122.27,1.625
...,...,...,...,...,...,...,...,...,...
20632,3.1250,15.0,6.023377,1.080519,1047.0,2.719481,39.26,-121.45,1.156
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


What do you notice about the HouseAge column? 

**Answer:** From a glance, the numbers seem to be less than 20! This is a good sign that we successfully subsetted all houses that are less than 20 years old. Great work!

In [10]:
# Now determine the average price for these houses. 'HouseValue' is the column name for the prices. 
averages_newer_houses = np.mean(newer_houses['HouseValue'])
averages_newer_houses

1.9326925875085794

Now try determining the average for houses 20 years or older.

In [11]:
# Determine the average price for houses that are 20 years or older. 
older_houses = california_df.query('HouseAge >= 20')
averages_older_houses = np.mean(older_houses['HouseValue'])
averages_older_houses

2.122016487307589

Good work! You're becoming an expert in subsetting and determining averages on subsetted data. This will be integral for your capstone projects and future careers as data scientists! 