#Exercise 2: Visualise Ames housing dataset with Altair
In this exercise, we will learn how to get a better understanding of a dataset and the relationship between variables using data visualisation techniques.
The dataset used for this exercise is the Ames Housing dataset compiled by Dean De Cock: http://www.amstat.org/publications/jse/v19n3/decock.pdf

1. Open on a new Colab notebook and import the pandas and altair packages

In [0]:
import pandas as pd
import altair as alt

2. Assign the link to the AMES dataset to a variable called 'file_url':

In [0]:
file_url = 'https://raw.githubusercontent.com/TrainingByPackt/The-Data-Science-Workshop/master/Chapter10/dataset/ames_iowa_housing.csv'

3. Using the read_csv method from the package pandas, load the dataset into a new variable called 'df':

In [0]:
df = pd.read_csv(file_url)

4. Plot the histogram for the variable 'SalePrice' using the mark_bar() and encode() methods from altair package. Use the alt.X and alt.Bin API to specify the number of bin step: 50000:

In [0]:
alt.Chart(df).mark_bar().encode(
    alt.X("SalePrice:Q", bin=alt.Bin(step=50000)), 
    y='count()'
)

This chart shows us most of the properties have a sale price centered around 100k - 150k. There are also few outliers with high sale price over 500k. 

5. Now let's plot the histogram for 'LotArea' but this time with a bin step size of 10000:

In [0]:
alt.Chart(df).mark_bar().encode(
    alt.X("LotArea:Q", bin=alt.Bin(step=10000)), 
    y='count()'
)

'LotArea' has a totally different distribution compared to 'SalePrice'. Most of the observations are between 0 and 20,000. The rest of the observations represents a small proportion of the dataset. We can also notice some extreme outliers over 150,000.

6. Plot a scatter plot with 'LotArea' as the x-axis and 'SalePrice' as y-axis to understand the interactions between these 2 variables:

In [0]:
alt.Chart(df).mark_circle().encode(
    x='LotArea:Q',
    y='SalePrice:Q'
)

There is clearly a correlation between the size of the property and the sale price. If we look only at the properties with LotArea under 50,000, we can see a linear relationship: if we draw a straight line from coordinates (0,0) to (20000,800000), we can say that the SalePrice increases by 40,000 for each additional increase of 1,000 for LotArea. The formula of this straight line (or regression line) will be: SalePrice = 40000 * LotArea / 1000. We can also see that for some properties although their size is quite high, their price didn't follow this pattern. For instance, the property with size 160,000 has been sold for less than 300,000.

7. Now let's plot the histogram for 'OverallQual' but this time with the default bin step size (bin=True):

In [0]:
alt.Chart(df).mark_bar().encode(
    alt.X("OverallQual", bin=True), 
    y='count()'
)

We can see the values contained in this column are discrete: they can only take a finite number of values (any integer between 1 and 10). This variable is not numerical but ordinal: the order matters but you can't perform some mathematical operations on it such as adding value 2 to value 8. This column is an arbitrary mapping to assess the overall quality of the property. We will see in the next chapter how to change the type of such column.

8. Build a boxplot with 'OverallQual:O' (':O' is for specifying this column is ordinal)on the x-axis and 'SalePrice' on the y-axis using the mark_boxplot() method:

In [0]:
alt.Chart(df).mark_boxplot().encode(
    x='OverallQual:O',
    y='SalePrice:Q'
)

We can see that SalePrice is higher for higher 'OverallQual' values. So we can confirm this variable is ordered by ascending value and the highest means better overall quality for the property. We can also notice the price range is bigger for properties with the highest overall quality value (10). It ranges from 100,000 to 600,000 while properties with medium quality (5) ranges from 90,000 to 200,000. This means there are other factors impacting the sale price.

9. Now let's plot a bar chart for 'YearBuilt' as its x-axis and 'count()' as its y-axis'. Don't forget to specify that 'YearBuilt' is an ordinal variable and not numerical using ':O':

In [0]:
alt.Chart(df).mark_bar().encode(
    alt.X('YearBuilt:O'),
    y='count()'
)

We can see there are less sold properties that have been built before 1920 and more are sold for recently built properties (after 2000).

10. Plot a boxplot similar to the step 8 but for 'YearBuilt' as its x-axis:

In [0]:
alt.Chart(df).mark_boxplot().encode(
    x='YearBuilt:O',
    y='SalePrice:Q'
)

Overall, the sale price is higher for more recently built properties except for very old properties (before 1935).

11.  Let's analyse the relationship between 'SalePrice' and 'Neighborhood' by plotting a bar chart similar to step 9:

In [0]:
alt.Chart(df).mark_bar().encode(
    x='Neighborhood',
    y='count()'
)

The number of sold properties differs depending on their location. The neighborhood 'NAmes' have the higher number of properties sold: over 220. On the other hand, neighborhoods such as 'Blueste' or 'NPkVill' have just a few properties sold.

12. Let's analyse the relationship between 'SalePrice' and 'Neighborhood' by plotting a box plot chart similar to step 10:

In [0]:
alt.Chart(df).mark_boxplot().encode(
    x='Neighborhood:O',
    y='SalePrice:Q'
)

The location of the property sold has a significant impact on the sale price. Neighborhoods 'noRidge', 'NridgHt' and 'StoneBr' have higher price overall. It is also worth noticing that there are some extreme outliers for 'NoRidge' where some properties have been sold with price much higher than other properties in this neighborhood.

Congratulations in completing this exercise. We saw that using data visualisation we can get some valuable insights about the dataset. For instance, using a scatter plot, we identified a linear relationship between 'SalePrice' and 'LotArea' where the price tends to increase as the size of the property gets bigger. Histograms helped us to understand the distribution of the numerical variables and bar charts gave us a similar view for categorical variables. For example we saw there are more sold properties in some neighborhoods compared to others. Finally we were able to analyse and compare the impact of different values of a variable on 'SalePrice' through the use of boxplot. We saw that the better condition of property is, the higher the sale price will be. Data visualization is a very important tool for data scientists to explore and analyse datasets.