*This is part of Kaggle's [Learn Machine Learning](https://www.kaggle.com/learn/machine-learning) series.*

# Selecting and Filtering Data
Your dataset had  too many variables to wrap your head around, or even to print out nicely.  How can you pare down this overwhelming amount of data to something you can understand?

To show you the techniques, we'll start by picking a few variables using our intuition. Later tutorials will show you statistical techniques to  automatically prioritize variables.

Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset. That is done with the **columns** property of the DataFrame (the bottom line of code below).

In [1]:
import pandas as pd

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
print(melbourne_data.columns)

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')


There are many ways to select a subset of your data. We'll start with two main approaches:  

## Selecting a Single Column
You can pull out any variable (or column) with **dot-notation**.  This single column is stored in a **Series**, which is broadly like a DataFrame with only a single column of data.  Here's an example:

In [2]:
# store the series of prices separately as melbourne_price_data.
melbourne_price_data = melbourne_data.Price
# the head command returns the top few lines of data.
print(melbourne_price_data.head())

0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64


## Selecting Multiple Columns
You can select multiple columns from a DataFrame by providing a list of column names inside brackets. Remember, each item in that list should be a string (with quotes).

In [4]:
columns_of_interest = ['Landsize', 'BuildingArea']
two_columns_of_data = melbourne_data[columns_of_interest]

two_columns_of_data

Unnamed: 0,Landsize,BuildingArea
0,202.0,
1,156.0,79.0
2,134.0,150.0
3,94.0,
4,120.0,142.0
5,181.0,
6,245.0,210.0
7,256.0,107.0
8,,
9,,


We can verify that we got the columns we need with the **describe** command.

In [5]:
two_columns_of_data.describe()

Unnamed: 0,Landsize,BuildingArea
count,13603.0,7762.0
mean,558.116371,151.220219
std,3987.326586,519.188596
min,0.0,0.0
25%,176.5,93.0
50%,440.0,126.0
75%,651.0,174.0
max,433014.0,44515.0


# Your Turn
In the notebook with your code:
1. Print a list of the columns
2. From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above to create `melbourne_price_data`.)
3. Use the `head` command to print out the top few lines of the variable you just created.
4. Pick any two variables and store them to a new DataFrame (as you saw above to create `two_columns_of_data`.)
5. Use the describe command with the DataFrame you just created to see summaries of those variables. <br>

