For this practice, let's use the California housing dataset.

In [1]:
# Import the fetch_california_housing method 
from sklearn.datasets import fetch_california_housing

In [2]:
# Import pandas, so that we can work with the data frame version of the California housing dataset
import pandas as pd

In [3]:
# Load the California housing data
california = fetch_california_housing()

In [4]:
# This will provide the characteristics for the California housing dataset
print(california.DESCR)

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.




In [5]:
# Convert the California housing data to a data frame format, so that it's easier to view and process
california_df = pd.DataFrame(california['data'], columns = california['feature_names'])
california_df['HouseValue'] = california['target']
california_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,HouseValue
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
5,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25,2.697
6,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
7,3.1200,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25,2.414
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26,2.267
9,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25,2.611


Determine the percentage of recently built houses (i.e. houses with an age less than 10 years). 

In [6]:
# Determine number of houses with an age less than 10 years. 
num_houses_new = sum(california_df['HouseAge'] < 10)

# Determine the total number of houses in the dataset
total_num = len(california_df['HouseAge'])

# Calculate the percentage of recently built houses.
num_houses_new/total_num*100

6.3226744186046515

What is the easiest way to calculte the percentage of houses with age of 10 years and older? Try to do this in one line of code.

In [7]:
100 - num_houses_new/total_num*100

93.677325581395351

That's right! Just take the difference from 100%.

Now, let's check to make sure this is correct by calculating the percentage using comparison operators (<, >, =<, =>, !=, ==). 

In [8]:
# Determine number of houses with an age of 10 years or greater. 
num_houses_older = sum(california_df['HouseAge'] >= 10)

# Calculate the percentage of older houses.
num_houses_older/total_num*100

93.677325581395351

Nicely done! 

Let's do another problem. Determine the percentages of houses that are younger than 20 years old and have an average value of greater than $80,000 (i.e when the average house value is greater than or equal to 0.8).

You'll be using logical operators (&, |) to solve this problem. 

In [9]:
# Determine number of houses with an age less than 20 years and valued at $80,000 or more
num_houses_interest = sum((california_df['HouseAge'] < 20) & (california_df['HouseValue'] >= .8))

# Calculate the percentage of recently built houses.
num_houses_interest/total_num*100

26.148255813953487

Now let's calculate the percentages of houses that are 20 years and older and have an average value of $80,000 and less. 

Let's start with the easiest way to calculate this percentage. Try to do this in one line of code.

In [10]:
100 - num_houses_interest/total_num*100

73.851744186046517

Great! Now let's determine this with logical operators. 

In [11]:
# Determine number of houses with an age greater or equal to 20 years and valued at less than $80,000
num_houses_interest_other = sum((california_df['HouseAge'] >= 20) | (california_df['HouseValue'] < .8))

# Calculate the percentage of recently built houses.
num_houses_interest_other/total_num*100

73.851744186046503

Good work on learning how to calculate percentages!