# Springboard Data Science Career Track Unit 4 Challenge - Tier 3 Complete

## Objectives
Hey! Great job getting through those challenging DataCamp courses. You're learning a lot in a short span of time. 

In this notebook, you're going to apply the skills you've been learning, bridging the gap between the controlled environment of DataCamp and the *slightly* messier work that data scientists do with actual datasets!

Here’s the mystery we’re going to solve: ***which boroughs of London have seen the greatest increase in housing prices, on average, over the last two decades?***


A borough is just a fancy word for district. You may be familiar with the five boroughs of New York… well, there are 32 boroughs within Greater London [(here's some info for the curious)](https://en.wikipedia.org/wiki/London_boroughs). Some of them are more desirable areas to live in, and the data will reflect that with a greater rise in housing prices.

***This is the Tier 3 notebook, which means it's not filled in at all: we'll just give you the skeleton of a project, the brief and the data. It's up to you to play around with it and see what you can find out! Good luck! If you struggle, feel free to look at easier tiers for help; but try to dip in and out of them, as the more independent work you do, the better it is for your learning!***

This challenge will make use of only what you learned in the following DataCamp courses: 
- Prework courses (Introduction to Python for Data Science, Intermediate Python for Data Science)
- Data Types for Data Science
- Python Data Science Toolbox (Part One) 
- pandas Foundations
- Manipulating DataFrames with pandas
- Merging DataFrames with pandas

Of the tools, techniques and concepts in the above DataCamp courses, this challenge should require the application of the following: 
- **pandas**
    - **data ingestion and inspection** (pandas Foundations, Module One) 
    - **exploratory data analysis** (pandas Foundations, Module Two)
    - **tidying and cleaning** (Manipulating DataFrames with pandas, Module Three) 
    - **transforming DataFrames** (Manipulating DataFrames with pandas, Module One)
    - **subsetting DataFrames with lists** (Manipulating DataFrames with pandas, Module One) 
    - **filtering DataFrames** (Manipulating DataFrames with pandas, Module One) 
    - **grouping data** (Manipulating DataFrames with pandas, Module Four) 
    - **melting data** (Manipulating DataFrames with pandas, Module Three) 
    - **advanced indexing** (Manipulating DataFrames with pandas, Module Four) 
- **matplotlib** (Intermediate Python for Data Science, Module One)
- **fundamental data types** (Data Types for Data Science, Module One) 
- **dictionaries** (Intermediate Python for Data Science, Module Two)
- **handling dates and times** (Data Types for Data Science, Module Four)
- **function definition** (Python Data Science Toolbox - Part One, Module One)
- **default arguments, variable length, and scope** (Python Data Science Toolbox - Part One, Module Two) 
- **lambda functions and error handling** (Python Data Science Toolbox - Part One, Module Four) 

## The Data Science Pipeline

This is Tier Three, so we'll get you started. But after that, it's all in your hands! When you feel done with your investigations, look back over what you've accomplished, and prepare a quick presentation of your findings for the next mentor meeting. 

Data Science is magical. In this case study, you'll get to apply some complex machine learning algorithms. But as  [David Spiegelhalter](https://www.youtube.com/watch?v=oUs1uvsz0Ok) reminds us, there is no substitute for simply **taking a really, really good look at the data.** Sometimes, this is all we need to answer our question.

Data Science projects generally adhere to the four stages of Data Science Pipeline:
1. Sourcing and loading 
2. Cleaning, transforming, and visualizing 
3. Modeling 
4. Evaluating and concluding 


### 1. Sourcing and Loading 

Any Data Science project kicks off by importing  ***pandas***. The documentation of this wonderful library can be found [here](https://pandas.pydata.org/). As you've seen, pandas is conveniently connected to the [Numpy](http://www.numpy.org/) and [Matplotlib](https://matplotlib.org/) libraries. 

***Hint:*** This part of the data science pipeline will test those skills you acquired in the pandas Foundations course, Module One. 

#### 1.1. Importing Libraries

In [1]:
# Let's import the pandas, numpy libraries as pd, and np respectively. 
import pandas as pd
import numpy as np
import dataframe_image as dfi

# Load the pyplot collection of functions from matplotlib, as plt 
from matplotlib import pyplot as plt

#### 1.2.  Loading the data
Your data comes from the [London Datastore](https://data.london.gov.uk/): a free, open-source data-sharing portal for London-oriented datasets. 

In [None]:
# First, make a variable called url_LondonHousePrices, and assign it the following link, enclosed in quotation-marks as a string:
# https://data.london.gov.uk/download/uk-house-price-index/70ac0766-8902-4eb5-aab5-01951aaed773/UK%20House%20price%20index.xls

url_LondonHousePrices = 'https://data.london.gov.uk/download/uk-house-price-index/70ac0766-8902-4eb5-aab5-01951aaed773/UK%20House%20price%20index.xls'

# The dataset we're interested in contains the Average prices of the houses, and is actually on a particular sheet of the Excel file. 
# As a result, we need to specify the sheet name in the read_excel() method.
# Put this data into a variable called properties.  
properties = pd.read_excel(url_LondonHousePrices, sheet_name='Average price', index_col= None)

### 2. Cleaning, transforming, and visualizing
This second stage is arguably the most important part of any Data Science project. The first thing to do is take a proper look at the data. Cleaning forms the majority of this stage, and can be done both before or after Transformation.

The end goal of data cleaning is to have tidy data. When data is tidy: 

1. Each variable has a column.
2. Each observation forms a row.

Keep the end goal in mind as you move through this process, every step will take you closer. 



***Hint:*** This part of the data science pipeline should test those skills you acquired in: 
- Intermediate Python for data science, all modules.
- pandas Foundations, all modules. 
- Manipulating DataFrames with pandas, all modules.
- Data Types for Data Science, Module Four.
- Python Data Science Toolbox - Part One, all modules

**2.1. Exploring your data** 

Think about your pandas functions for checking out a dataframe. 

In [None]:
print('Dataframe head:\n', properties.head(), '\n')
print('Dataframe shape:\n', properties.shape, '\n')
print('Dataframe indices:\n', properties.index, '\n')
print('Dataframe columns:\n', properties.columns, '\n')

**2.2. Cleaning the data**

You might find you need to transpose your dataframe, check out what its row indexes are, and reset the index. You  also might find you need to assign the values of the first row to your column headings  . (Hint: recall the .columns feature of DataFrames, as well as the iloc[] method).

Don't be afraid to use StackOverflow for help  with this.

In [None]:
#Transposing dataframe and then printing it.
properties_t = properties.transpose()
print('Transpose of dataframe:\n', properties_t.head(), '\n')

#Setting the first row as column headers and then dropping it.
properties_t.columns = properties_t.iloc[0]
properties_cleaned = properties_t.drop('Unnamed: 0').reset_index()

print('Dataframe after cleaning:')
properties_cleaned.head()


**2.3. Cleaning the data (part 2)**

You might we have to **rename** a couple columns. How do you do this? The clue's pretty bold...

In [None]:
#Renaiming columns containing Londong Boroughs and ID's
properties_renamed = properties_cleaned.rename(columns={'index':'london_borough', pd.NaT:'borough_id'})

#Checking indices and columns of new dataframe.
print(properties_renamed.index)
print(properties_renamed.columns)
print('\n')

#Printing renamed dataframe.
print('Dataframe after renaming:')
properties_renamed.head() 

**2.4.Transforming the data**

Remember what Wes McKinney said about tidy data? 

You might need to **melt** your DataFrame here. 

In [None]:
#Melting the dataframe, then printing it.
properties_melted = properties_renamed.melt(id_vars=['london_borough', 'borough_id'], var_name='month', \
                                          value_name='ave_housing_price')
print('Dataframe after melting: ')
properties_melted

Remember to make sure your column data types are all correct. Average prices, for example, should be floating point numbers... 

In [None]:
#Checking the type of all column entries.
for column in properties_melted.columns:
    print('Type of ', column, ': ', properties_melted[column].dtype)
print('\n')

#Applying float() function to entries of ave_housing_price column.
#Make a new copy of dataframe with updated values.
properties_melted_v2 = properties_melted
properties_melted_v2['ave_housing_price'] = properties_melted['ave_housing_price'].apply(float)

#Checking type of ave_housing_price column after applying float().
print('ave_housing_price after applying float():')
properties_melted_v2['ave_housing_price'].dtype

**2.5. Cleaning the data (part 3)**

Do we have an equal number of observations in the ID, Average Price, Month, and London Borough columns? Remember that there are only 32 London Boroughs. How many entries do you have in that column? 

Check out the contents of the London Borough column, and if you find null values, get rid of them however you see fit. 

In [None]:
#For every column in the dataframe, check how many NaN values there are.
print('Number of NaN values in each column:\n', properties_melted_v2[:].isna().sum(), '\n')

#Make filter to get rid of rows with NaN values.
#In this list, "True" means it will be kept in the list
NaN_filter = properties_melted_v2['borough_id'].isna() == False

#Applying filter
properties_no_nan = properties_melted_v2.loc[NaN_filter]

#Checking how many rows where eliminated by subtracting the number of 
#indices in new dataframe from the old dataframe
print('Number of rows eliminated: ', properties_melted_v2.shape[0] - properties_no_nan.shape[0], '\n')

#Another check to see if there are any NaN values remaining in the new dataframe.
print('Number of NaN values in each column after filtering:\n', \
      properties_no_nan[:].isna().sum(), '\n')

#Checking how many unique values are in the 'london_boroughs' column. There should be 32. However...
print('Unique values in the "boroughs" column: ', properties_no_nan['london_borough'].nunique(), '\n')

#One way to distinguish wheter a string in the london_borough columns is or is not a borough
#is to manually look at is. Luckily, it seems our source data file already had the boroughs
#sorted at the top, as can be seen from the following list.
print('List of unique string in the london_borough column:')
for unique_borough in properties_no_nan['london_borough'].unique():
    print(unique_borough)
print('Note how only the first 32 are actually boroughs.\n')

#Makingn an array of strings containing only the boroughs by slicing the previous list.
#Then, checking that it has 32 items
boroughs_array = properties_no_nan['london_borough'].unique()[:32]
print('Number of boroughs in boroughs_array: ', len(boroughs_array))

#Finally, we use the array of boroughs to filter our previous dataframe
properties_boroughs_only \
= properties_no_nan.loc[properties_no_nan['london_borough'].isin(boroughs_array)].reset_index(drop=True)

#As a final check, counting the total number of unique strings in the london_borough column of our updated
#dataframe
print('Unique strings in the london_borough column of the updated dataframe: ', \
      properties_boroughs_only['london_borough'].nunique())

#Print our most recent dataframe.
print('Dataframe with only London Boroughs included:\n')
properties_boroughs_only

**2.6. Visualizing the data**

To visualize the data, why not subset on a particular London Borough? Maybe do a line plot of Month against Average Price?

In [None]:
#The previously defined boroughs_array will be used to make ave_housing_price vs month plots for each
#London Borough

#Defining function that uses the first n entries in an array 'array' of borough names to plot 
#average housing vs time from a dataframe df
def plot_n_boroughs(df, array, n):
    for i in range(n):
        filtered_df = df.loc[df['london_borough'] == array[i]]
        plt.plot(filtered_df['month'], filtered_df['ave_housing_price'], label=array[i])
    plt.xlabel('Time')
    plt.ylabel('Average Housing Price')
    plt.title('Average Housing Price vs. Time for various London Boroughs')
    plt.legend(title='Group')
    
#To get a sense for how the data looks like, 5 arbitrary boroughs will be plotted.
plot_n_boroughs(properties_boroughs_only, boroughs_array, 5)
plt.show()



To limit the number of data points you have, you might want to extract the year from every month value your *Month* column. 

To this end, you *could* apply a ***lambda function***. Your logic could work as follows:
1. look through the `Month` column
2. extract the year from each individual value in that column 
3. store that corresponding year as separate column. 

Whether you go ahead with this is up to you. Just so long as you answer our initial brief: which boroughs of London have seen the greatest house price increase, on average, over the past two decades? 

In [None]:
#Make copy of most recent dataframe and add a year column
properties_final = properties_boroughs_only
properties_final['year'] = properties_boroughs_only['month'].apply(lambda x: x.year)

print(properties_final['ave_housing_price'])

#What follows are some consistency checks.

#Checking how many times each year appears. Since there are 12 months for each of the 32 boroughs, these
#numbers should be 384 for each year, with the exception of the last year, which only has 3 months.
print('Number of times each year appears:\n', properties_final['year'].value_counts(), '\n')

#Checking total month entries for each London Borough. Since the years go from 1995 till 2022 and 2022
#only has 3 months, this number should be 327. Also print out the total number of London Boroughs.

print('Number of month entries per London Borough:')
count = 0
for group_tuple in list(properties_final.groupby('london_borough')):
    count += 1
    print(group_tuple[0], ': ', len(group_tuple[1]['month']))
print('Number of Boroughs: ', count, '\n')

#Change float format of ave_housing_price column:
#properties_final['ave_housing_price'] = properties_final['ave_housing_price'].apply(lambda x: '%.2f' % x)
print(properties_final['ave_housing_price'])

#Print out the final dataframe.
print('Final dataframe:')
properties_final

**3. Modeling**

Consider creating a function that will calculate a ratio of house prices, comparing the price of a house in 2018 to the price in 1998.

Consider calling this function create_price_ratio.

You'd want this function to:
1. Take a filter of dfg, specifically where this filter constrains the London_Borough, as an argument. For example, one admissible argument should be: dfg[dfg['London_Borough']=='Camden'].
2. Get the Average Price for that Borough, for the years 1998 and 2018.
4. Calculate the ratio of the Average Price for 1998 divided by the Average Price for 2018.
5. Return that ratio.

Once you've written this function, you ultimately want to use it to iterate through all the unique London_Boroughs and work out the ratio capturing the difference of house prices between 1998 and 2018.

Bear in mind: you don't have to write a function like this if you don't want to. If you can solve the brief otherwise, then great! 

***Hint***: This section should test the skills you acquired in:
- Python Data Science Toolbox - Part One, all modules

In [None]:
#Function definition
def create_price_ratio(borough_arg):
    filtered_table = properties_final[properties_final['london_borough'] == borough_arg]
    p_table = filtered_table.pivot_table(index='year', values='ave_housing_price') #Calculates average by default
    return p_table.at[1998, 'ave_housing_price'] / p_table.at[2018, 'ave_housing_price']

#Create dictionary containing each London Borough and its associated price ratio as a key-value pair.
ratio_dict = {}
for borough in properties_final['london_borough'].unique():
    ratio = create_price_ratio(borough)
    ratio_dict[borough] = ratio
    
#Sort the dictionary in ascending order of ratio.
ratio_dict_sorted = dict(sorted(ratio_dict.items(), key=lambda x: x[1]))

#Making a pandas dataframe from this dictionary
ratios_df = pd.DataFrame(list(ratio_dict_sorted.items()))
ratios_df_split = pd.concat([ratios_df.iloc[:16], ratios_df.iloc[16:].reset_index(drop=True)], axis=1)
ratios_df_split.rename(columns={0:'London Borough', 1:'Ratio'}, inplace=True)
blankIndex = [''] * len(ratios_df_split)
ratios_df_split.index = blankIndex

#Printing this dataframe.
print('Price ratio of each London borough:\n', ratios_df_split)

#Exporting this dataframe.
ratios_df_split.dfi.export('ratios_table.png')


#Extract the key of the first item on the dictionary:
answer = list(ratio_dict_sorted.keys())[0]

#Print out the final answer.
print('The London Borough that has shown the greatest increase in average house pricing \
from 1998 to 2918 is:\n', answer, '\n')

In [None]:
#These final lines of code make Average Hosing Price vs. Time plots of 5 arbitrary boroughs plus Kackney. 
#A quick glance of it tells us that, while Hackney had the fastest price increase in a ten-year period, 
#it is also not the borough with the most expensive housing.
print('The purpose of this plot is to show that, while Hackney had the largest increase, it does not have \
the most\n expensive housing.')

df_hackney = properties_final[properties_final['london_borough'] == 'Hackney'] 
plt.plot(df_hackney['month'], df_hackney['ave_housing_price'], label='Hackney')
plot_n_boroughs(properties_final, boroughs_array, 5)
plt.show()
plt.clf()

#Bar plot of the 5 London boroughs that showed the greates increase in average housing prie.
ratios_df.iloc[:5].plot(y=1, kind='bar', rot=45, \
                        legend=False, title='Ratio or average housing price of 1998 to 2018', \
                       ylabel='ratio', x=0, xlabel='')
plt.savefig('ratios_bar.png', bbox_inches='tight')
plt.show()
plt.clf()

#Average housing price vs time plot of these 5 boroughs.
array_sorted_by_ratio = np.array(ratios_df[0])
plot_n_boroughs(properties_final, array_sorted_by_ratio, 5)
plt.savefig('line_plot_top_ratios', bbox_inches='tight')
plt.show()
plt.clf()

In [None]:
#Creating dataframe containing the max housing price that each borough ever had.
max_price_df = properties_final.pivot_table(values='ave_housing_price', index='london_borough', aggfunc='max')
#print(max_price_df)
#print(max_price_df.columns)
max_price_df.rename(columns={'ave_housing_price':'max_housing_price'}, inplace=True)
max_price_df.sort_values('max_housing_price', ascending=False, inplace=True)
max_price_df.reset_index(inplace=True)
max_price_df['max_housing_price'] = max_price_df['max_housing_price'].apply(lambda x: '%.2f' % x)
print(max_price_df)
#print(max_price_df.columns)
#print(max_price_df.iloc[:16], '\n\n', max_price_df.iloc[16:].reset_index(drop=True))
max_price_df_split = pd.concat([max_price_df.iloc[:16], max_price_df.iloc[16:].reset_index(drop=True)], axis=1)
#print(max_price_df_split)

#max_price_df_split.rename(columns={0:'London Borough', 1:'Max Housing Price'}, inplace=True)
blankIndex = [''] * len(max_price_df_split)
max_price_df_split.index = blankIndex
max_price_df_split.dfi.export('max_price.png')
print(max_price_df_split)

#Creating numpy array containing the boroughs sorted by max housing price.
array_sorted_by_max_price = np.array(max_price_df.iloc[:, 0])
plot_n_boroughs(properties_final, array_sorted_by_max_price, 5)
plt.savefig('line_plot_top_max_price', bbox_inches='tight')
plt.show()
plt.clf()

### 4. Conclusion
What can you conclude? Type out your conclusion below. 

Look back at your notebook. Think about how you might summarize what you have done, and prepare a quick presentation on it to your mentor at your next meeting. 

We hope you enjoyed this practical project. It should have consolidated your data hygiene and pandas skills by looking at a real-world problem involving just the kind of dataset you might encounter as a budding data scientist. Congratulations, and looking forward to seeing you at the next step in the course! Look back at your notebook. Think about how you might summarize what you have done, and prepare a quick presentation on it to your mentor at your next meeting.

Conclusions:
The preceding analysis shows that, from the year 1998 to 2018, the London borough that had the greatest average housing price increase was Hackney. A quick glance at the average housing price ratios from the year 1998 to 2018 for all boroughs show that many other boroughs have similar ratios, so Hackley is by no means an outlier.

Another interesting obsevation is that every borough has a sharp decrease in housing price at about the year 2009. Based on the timing, my current hypothesis is that it may be related to the United States housing bubble that occured at around that time, but more analysis is needed to draw definite conclusions.

Yet another interesting observations is that certain boroughs show more variability in their housing prices than others, City of London being a noteworthy example. Perhaps this could also be used to somehow quantify the "popularity" for living in each borough.

Finally, it should be noted that, while Hackney showed the greatest price increase in hosusing prices, it is by no means the borough with the most expensive housing.