# Setup

In [None]:
# Download utils.py to working directory
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/ML-Challenge/week2-data-analysis/master/utils.py', 'utils.py')

In [None]:
# Import utils
# We'll be using this module throughout the lesson
import utils

# Intro to pandas DataFrames

In this lesson, we will introduce pandas DataFrames. We will use pandas to import and inspect a variety of datasets, ranging from population data obtained from the World Bank to monthly stock data obtained via Yahoo Finance. We will also practice building DataFrames from scratch and become familiar with the intrinsic data visualization capabilities of pandas.

### Inspecting Data

We can use the DataFrame methods `.head()` and `.tail()` to view the first few and last few rows of a DataFrame. We have imported pandas as ```pd``` and loaded population data from 1960 to 2018 as a DataFrame `urban_population`. This dataset was obtained from the [World Bank](https://databank.worldbank.org/reports.aspx?source=2&type=metadata&series=SP.URB.TOTL.IN.ZS#).

Let's use `urban_population.head()` and `urban_population.tail()` to verify that the first and last rows match a file on disk. In later exercises, we will see how to extract values from DataFrames with indexing.

In [None]:
# First 5 rows
utils.urban_population.head()

In [None]:
# Last 5 rows
utils.urban_population.tail()

### DataFrame data types

Pandas is aware of the data types in the columns of our DataFrame. It is also aware of null and `NaN` ('Not-a-Number') types which often indicate missing data. 

We can use `urban_population.info()` to determine information about the total count of non-null entries and infer the total count of null entries, which likely indicates missing data.

In [None]:
utils.urban_population.info()

### NumPy and pandas working together

Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, we can use the DataFrame attribute `.values` to represent a DataFrame df as a NumPy array. We can also pass pandas data structures to NumPy methods. 

In this example, we have loaded world population data every 10 years since 1960 into the DataFrame `world_population`. This dataset was derived from the one used in the previous exercise.

In [None]:
# Import numpy
import numpy as np

In [None]:
# Create array of DataFrame values: np_vals
np_vals = utils.world_population.values

In [None]:
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

In [None]:
# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(utils.world_population)

In [None]:
# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'utils.world_population', 'df_log10']]

## Building DataFrames from scratch

### Zip lists to build a DataFrame

In this example, we're going to make a pandas DataFrame of the top three countries to win gold medals since 1896 by first building a dictionary. `list_keys` contains the column names 'Country' and 'Total'. `list_values` contains the full names of each country and the number of gold medals awarded. The values have been taken from [Wikipedia](https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table).

In [None]:
list_keys = ['Country', 'Total']
list_values = [['United States', 'Soviet Union', 'United Kingdom'], [1118, 473, 273]]

We will use these lists to construct a list of tuples, use the list of tuples to construct a dictionary, and then use that dictionary to construct a DataFrame. In doing so, we'll make use of the `list()`, `zip()`, `dict()` and `pd.DataFrame()` functions. Pandas has been imported as `pd`.

In [None]:
import pandas as pd

Note: The [zip()](https://docs.python.org/3/library/functions.html#zip) function in Python 3 and above returns a special zip object, which is essentially a generator. To convert this `zip` object into a list, we'll need to use `list()`.

In [None]:
# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))

# Inspect the list using print()
print(zipped)

In [None]:
# Build a dictionary with the zipped list: data
data = dict(zipped)
print(data)

In [None]:
# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)

### Labeling data

We can use the DataFrame attribute `df.columns` to view and assign new string labels to columns in a pandas DataFrame.

In this example, we have defined a DataFrame `artists` containing top Billboard hits from the 1980s (from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_the_1980s#1980)). Each row has the year, artist, song name and the number of weeks at the top. However, this DataFrame has the column labels a, b, c, d. Our job is to use the df.columns attribute to re-assign descriptive column labels.

In [None]:
utils.artists

In [None]:
# Build a list of labels: list_labels
list_labels = ['year', 'artist', 'song', 'chart weeks']

In [None]:
# Assign the list of labels to the columns attribute: df.columns
utils.artists.columns = list_labels
utils.artists

### Building DataFrames with broadcasting

We can implicitly use 'broadcasting', a feature of NumPy, when creating pandas DataFrames. In this example, we're going to create a DataFrame of cities in Pennsylvania that contains the city name in one column and the state name in the second. We have imported the names of 15 cities as the list `cities`

In [None]:
utils.cities

In [None]:
# Make a string with the value 'PA': state
state = 'PA'

In [None]:
# Construct a dictionary: data
data = {'state': state, 'city': utils.cities}

In [None]:
# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

## Importing & exporting data

### Reading a flat file

In previous examples, the data was already preloaded using the pandas function `read_csv()`. Now, let's have a look how we can use it to read the World Bank population data we saw earlier into a DataFrame. The file is available in the variable `data/urban_population.csv`.

In [None]:
# Just read the file
urban_population = pd.read_csv('data/urban_population.csv')

# Print the DataFrame
urban_population

In [None]:
# Show DataFrame info
urban_population.info()

Somethings seems off. Two of the columns should have dtype of `float64` and some missing values. When exporting the dataset from the WorldBank the default placeholder for missing values is `..`.  The `read_csv()`  function allows us to specify extra `NaN` characters. 

In [None]:
# Just read the file
urban_population = pd.read_csv('data/urban_population.csv', na_values=['..'])

# Show DataFrame info
urban_population.info()

Also we have a column too much. We don't need the `Time Code` column. We can drop it after we've read it using the `df.drop()` function or we can decide not to read it by using the `usecols` parameter when reading the data using the `read_csv()` function

In [None]:
cols = ['Country Name', 'Country Code', 'Year', 'Total Population', 'Urban population (% of total)']
urban_population = pd.read_csv('data/urban_population.csv', usecols=cols, na_values=['..'])

urban_population.head()

### Delimiters, headers, and extensions

Not all data files are clean and tidy. Pandas provides methods for reading those not-so-perfect data files that we encounter far too often.

In this example, we have monthly stock data for four companies downloaded from [Yahoo Finance](https://finance.yahoo.com/?guccounter=1). The data is stored as one row for each company and each column is the end-of-month closing price. The file name is in the variable `utils.file_messy`.

In addition, this file has three aspects that may cause trouble for lesser tools: multiple header lines, comment records (rows) interleaved throughout the data rows, and space delimiters instead of commas.

Let's use pandas to read the data from this problematic `file_messy` using non-default input options with `read_csv()` so as to tidy up the mess at read time. Then, write the cleaned up data to a CSV file with the variable `file_clean`, as we might do in a real data workflow.

To learn more about the option input parameters needed, we'll use `help()` on the pandas function `pd.read_csv()`.

In [None]:
help(pd.read_csv)

In [None]:
# Read the raw file as-is: df1
df1 = pd.read_csv(utils.file_messy)

# Print the output of df1.head()
df1.head()

Contents of `file_messy`

```
The following stock data was collect on 2016-AUG-25 from an unknown source
These kind of ocmments are not very useful, are they?
probably should just throw this line away too, but not the next since those are column labels
name Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# So that line you just read has all the column headers labels
IBM 156.08 160.01 159.81 165.22 172.25 167.15 164.75 152.77 145.36 146.11 137.21 137.96
MSFT 45.51 43.08 42.13 43.47 47.53 45.96 45.61 45.51 43.56 48.70 53.88 55.40
# That MSFT is MicroSoft
GOOGLE 512.42 537.99 559.72 540.50 535.24 532.92 590.09 636.84 617.93 663.59 735.39 755.35
APPLE 110.64 125.43 125.97 127.29 128.76 127.81 125.34 113.39 112.80 113.36 118.16 111.73
# Maybe we should have bought some Apple stock in 2008?
```

In [None]:
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(utils.file_messy, delimiter=' ', header=3, comment='#')

# Print the output of df2.head()
df2.head()

In [None]:
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv('data/file_clean.csv', index=False)

## Plotting with pandas

### Plotting series using pandas

Data visualization is often a very effective first step in gaining a rough understanding of a data set to be analyzed. Pandas provides data visualization by both depending upon and interoperating with the matplotlib library. We will now explore some of the basic plotting mechanics with pandas as well as related matplotlib options. We will use the DataFrame method `df.plot()` to visualize the data, and then explore the optional matplotlib input parameters that this `.plot()` method accepts.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 10]

The pandas `.plot()` method makes calls to matplotlib to construct the plots. This means that we can use the skills we've learned in the previous visualization lesson to customize the plot. In this example, we'll add a custom title and axis labels to the figure.

Before plotting, let's inspect the DataFrame in using df.head(). Also, let's use type(df) and note that it is a single column DataFrame.

In [None]:
# Create a plot with color='red'
utils.august_weather[['Temperature (degrees F)']].plot(color='red')

# Add a title
plt.title('Temperature in Austin')

# Specify the x-axis label
plt.xlabel('Hours since midnight August 1, 2010')

# Specify the y-axis label
plt.ylabel('Temperature (degrees F)')

# Display the plot
plt.show()

### Plotting DataFrames

Comparing data from several columns can be very illuminating. Pandas makes doing so easy with multi-column DataFrames. By default, calling `df.plot()` will cause pandas to over-plot all column data, with each column as a single line. In this example, we have pre-loaded three columns of data from a weather data set - temperature, dew point, and pressure - but the problem is that pressure has different units of measure. The pressure data, measured in Atmospheres, has a different vertical scaling than that of the other two data columns, which are both measured in degrees Fahrenheit.

Let's plot all columns as a multi-line plot, to see the nature of vertical scaling problem. Then, we'll use a list of column names passed into the DataFrame `df[column_list]` to limit plotting to just one column, and then just 2 columns of data.

In [None]:
utils.august_weather.plot()
plt.show()

In [None]:
# Plot all columns as subplots
utils.august_weather.plot(subplots=True)
plt.show()

In [None]:
# Plot the Dew Point and Temperature data, but not the Pressure data
column_list = ['Temperature (degrees F)','Dew Point (degrees F)']
utils.august_weather[column_list].plot()
plt.show()

# Extracting and transforming data

In this chapter, we will learn how to index, slice, filter, and transform DataFrames using a variety of datasets

## Index DataFrames

### Square Brackets

We can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets.

In [None]:
# Print out winner column as Pandas Series
utils.election['winner']

In [None]:
# Print out winner column as Pandas DataFrame
utils.election[['winner']]

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

In [None]:
# Print out DataFrame with winner and turnout columns
utils.election[['winner', 'turnout']]

Square brackets can do more than just selecting columns. We can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the election DataFrame:

```
utils.election[0:5]
```

The result is another DataFrame containing only the rows we specified.

In [None]:
# Print out first 3 observations
utils.election[:3]

In [None]:
# Print out fourth, fifth and sixth observation
utils.election[3:6]

### Positional and labeled indexing (loc and iloc)

With [loc](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing) and [iloc](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing) we can do practically any data selection operation on DataFrames we can think of. `loc` is label-based, which means that we have to specify rows and columns based on their row and column labels. `iloc` is integer index based, so we have to specify rows and columns by their integer index like we did in the previous example.

In [None]:
# Print out observation for Indiana
utils.election.loc['Indiana']

In [None]:
# Same but using iloc
utils.election.iloc[31]

In [None]:
# Print out observations for Indiana and Northampton
utils.election.loc[['Indiana', 'Northampton']]

In [None]:
# Same but using iloc
utils.election.iloc[[31, 47]]

`loc` and `iloc` also allow us to select both rows and columns from a DataFrame.

In [None]:
# Print out winner value of Indiana
print(utils.election.loc['Indiana', 'winner'])

In [None]:
# Print sub-DataFrame
utils.election.loc[['Indiana', 'Northampton'], ['winner', 'turnout']]

It's also possible to select only columns with `loc` and `iloc`. In both cases, we simply put a slice going from beginning to end in front of the comma:

```
cars.loc[:, 'country']
cars.iloc[:, 1]

cars.loc[:, ['country','drives_right']]
cars.iloc[:, [1, 2]]
```

In [None]:
# Print out winner column as Series
utils.election.loc[:, 'winner']

In [None]:
# Print out winner column as DataFrame
utils.election.loc[:, ['winner']]

In [None]:
# Print out winner and turnout as DataFrame
utils.election.loc[:, ['winner', 'turnout']]

### Indexing and column rearrangement

There are circumstances in which it's useful to modify the order of DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

In [None]:
# Create a separate dataframe with the columns ['winner', 'total', 'voters']
utils.election[['winner', 'total', 'voters']].head()

## Slicing DataFrames

### Slicing rows

The Pennsylvania US election results data set that we have been using so far is ordered by county name. This means that county names can be sliced alphabetically. In this example, we're going to perform slicing on the county names of the election DataFrame.

In [None]:
# Slice the row labels 'Perry' to 'Potter': p_counties
utils.election.loc['Perry':'Potter']

In [None]:
# Slice the row labels 'Potter' to 'Perry' in reverse order
utils.election.loc['Potter':'Perry':-1]

### Slicing columns

Similar to row slicing, columns can be sliced by value using `.loc[]`.

In [None]:
# Slice the columns from 'Obama' to 'winner'
utils.election.loc[:, 'Obama':'winner'].head()

In [None]:
# Slice the columns from the starting column to 'Obama'
utils.election.loc[:, :'Obama']

In [None]:
# Slice the columns from 'Romney' to the end
utils.election.loc[:, 'Romney':].head()

### Subselecting DataFrames with lists

We can use lists to select specific row and column labels with the `.loc[]` accessor.

In [None]:
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']

# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']

# Create the new DataFrame
utils.election.loc[rows, cols]

## Filtering DataFrames

### Thresholding data

In this example, we want to prepare a boolean array to select all of the rows and columns where voter turnout exceeded 70%

In [None]:
# Create the boolean array: high_turnout
high_turnout = utils.election['turnout'] > 70

# Filter the election DataFrame with the high_turnout array
utils.election[high_turnout]

### Filtering columns using other columns

The election results DataFrame has a column labeled `'margin'` which expresses the number of extra votes the winner received over the losing candidate. This number is given as a percentage of the total votes cast. It is reasonable to assume that in counties where this margin was less than 1%, the results would be too-close-to-call.

In this example we'll a use boolean selection to filter the rows where the margin was less than 1. Then we'll convert these rows of the 'winner' column to `np.nan` to indicate that these results are too close to declare a winner.

In [None]:
# Create the boolean array: too_close
election = utils.election.copy()
too_close = election['margin'] < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election.loc[too_close, 'winner'] = np.nan

# Print the output of election.info()
election.info()

### Filtering using NaNs

In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The `.dropna()` method is used to perform this action. We'll now practice using this method on a dataset obtained from [Vanderbilt University](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html), which consists of data from passengers on the Titanic.

We'll also use the `.shape` attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.

Finally, we'll use the `thresh=` keyword argument to drop columns from the full dataset that have less than 1000 non-missing values.

In [None]:
utils.titanic.info()

In [None]:
# Select the 'age' and 'cabin' columns: df
df = utils.titanic[['age', 'cabin']]

# Print the shape of df
df.shape

In [None]:
# Drop rows in df with how='any' and print the shape
df.dropna(how='any').shape

In [None]:
# Drop rows in df with how='all' and print the shape
df.dropna(how='all').shape

In [None]:
# Drop columns in titanic with less than 1000 non-missing values
utils.titanic.dropna(thresh=1000, axis='columns').info()

## Transforming DataFrames

### Using apply() to transform a column

The `.apply()` method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this example we'll revisit our weather dataset.

In [None]:
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)

In [None]:
utils.weather.head()

In [None]:
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = utils.weather[['Temperature (degrees F)', 'Dew Point (degrees F)']].apply(to_celsius)

# Reassign the columns of df_celsius
df_celsius.columns = ['Temperature (degrees C)', 'Dew Point (degreees C)']

# Print the output of df_celsius.head()
df_celsius.head()

### Using .map() with a dictionary

The `.map()` method is used to transform values according to a Python dictionary look-up.

In [None]:
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
utils.election['color'] = utils.election['winner'].map(red_vs_blue)

# Print the output of election.head()
utils.election.head()

### Using vectorized functions

When performance is paramount, we should avoid using `.apply()` and `.map()` because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, we can loop over the data at the same speed as compiled code! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

We can even write our own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas.

In this example we're going to import the `zscore` function from `scipy.stats` and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.

Instead of using `.apply()` as we did in the earlier examples, the `zscore` UFunc will take a pandas Series as input and return a NumPy array. We will then assign the values of the NumPy array to a new column in the DataFrame.

In [None]:
# Import zscore from scipy.stats
from scipy.stats import zscore

In [None]:
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(utils.election['turnout'])

# Print the type of turnout_zscore
print(type(turnout_zscore))

In [None]:
# Assign turnout_zscore to a new column: election['turnout_zscore']
utils.election['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
utils.election.head()

# Exploratory data analysis

## Visual exploratory data analysis

### pandas line plots

Earlier in this lesson, we saw that the `.plot()` method will place the Index values on the x-axis by default. In the following example, we'll practice making line plots with specific columns on the x and y axes.

We will work with a dataset consisting of monthly stock prices in 2015 for AAPL, GOOG, and IBM. The stock prices were obtained from Yahoo Finance. We will plot the `'Month'` column on the x-axis and the AAPL and IBM prices on the y-axis using a list of column names.

In [None]:
utils.stock_data

In [None]:
# Create a list of y-axis column names: y_columns
y_columns = ['APPLE', 'IBM']

# Generate a line plot
utils.stock_data.plot(x='Month', y=y_columns)

# Add the title
plt.title('Monthly stock prices')

# Add the y-axis label
plt.ylabel('Price ($US)')

# Display the plot
plt.show()

### pandas scatter plots

Pandas scatter plots are generated using the `kind='scatter'` keyword argument. Scatter plots require that the x and y columns be chosen by specifying the `x` and `y` parameters inside `.plot()`. Scatter plots also take an `s` keyword argument to provide the radius of each circle to plot in pixels.

In this example, we're going to plot fuel efficiency (miles-per-gallon) versus horse-power for 392 automobiles manufactured from 1970 to 1982 from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG).

The size of each circle is provided as a NumPy array called `sizes`. This array contains the normalized 'weight' of each automobile in the dataset.

In [None]:
# Generate a scatter plot
utils.auto_mpg.plot(kind='scatter', x='hp', y='mpg', s=utils.sizes, c='blue')

# Add the title
plt.title('Fuel efficiency vs Horse-power')

# Add the x-axis label
plt.xlabel('Horse-power')

# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')

# Display the plot
plt.show()

### pandas box plots

While pandas can plot multiple columns of data in a single figure, making plots that share the same x and y axes, there are cases where two columns cannot be plotted together because their units do not match. The `.plot()` method can generate subplots for each column being plotted. Here, each plot will be scaled independently.

In this example we will generate box plots for fuel efficiency (mpg) and weight from the automobiles data set. To do this in a single figure, we'll specify `subplots=True` inside `.plot()` to generate two separate plots.

In [None]:
# Make a list of the column names to be plotted: cols
cols = ['weight', 'mpg']

# Generate the box plots
utils.auto_mpg[cols].plot(kind='box', subplots=True)

# Display the plot
plt.show()

### pandas hist, pdf and cdf

Pandas relies on the `.hist()` method to not only generate histograms, but also plots of probability density functions (PDFs) and cumulative density functions (CDFs).

We will plot a PDF and CDF for the `fraction` column of the tips dataset. This column contains information about what fraction of the total bill is comprised of the tip.

When plotting the PDF, we need to specify `density=1` in our call to `.hist()`, and when plotting the CDF, we need to specify `cumulative=True` in addition to `density=1`.

In [None]:
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)

# Plot the PDF
utils.tips.fraction.plot(ax=axes[0], kind='hist', density=1, bins=30, range=(0,.3))
axes[0].grid(b=None)
# Plot the CDF
utils.tips.fraction.plot(ax=axes[1], kind='hist', density=1, cumulative=True, bins=30, range=(0,.3))
axes[1].grid(b=None)

# Show the plots
plt.show()

## Statistical exploratory data analysis

### Median vs mean

In many data sets, there can be large differences in the mean and median value due to the presence of outliers.

In this example, we'll investigate the mean, median, and max fare prices paid by passengers on the Titanic and generate a box plot of the fare prices. 

In [None]:
# Print summary statistics of the fare column with .describe()
utils.titanic.fare.describe()

In [None]:
# Generate a box plot of the fare column
utils.titanic.fare.plot(kind='box')

# Show the plot
plt.show()

### Quantiles

In this example, we'll investigate the probabilities of life expectancy in countries around the world. This dataset contains life expectancy for persons born each year from 1800 to 2015. Since country names change or results are not reported, not every country has values. This dataset was obtained from Gapminder.

First, we will determine the number of countries reported in 2015. There are a total of 260 unique countries in the entire dataset. Then, we will compute the 5th and 95th percentiles of life expectancy over the entire dataset. Finally, we will make a box plot of life expectancy every 50 years from 1800 to 2000. Notice the large change in the distributions over this period.

In [None]:
# Print the number of countries reported in 2015
utils.life_expectancy['2015'].count()

In [None]:
# Print the 5th and 95th percentiles
utils.life_expectancy.quantile([0.05, 0.95])

In [None]:
# Generate a box plot
years = ['1800','1850','1900','1950','2000']
utils.life_expectancy[years].plot(kind='box')
plt.show()

---
**[Week 2 - Data Analysis and Visualisation](https://radu-enuca.gitbook.io/ml-challenge/data-analysis-and-visualisation)**

*Have questions or comments? Visit the ML Challenge Mattermost Channel.*