This notebook will introduce the libraries and the tools that you will need for this project. It will present it on a simple dataset: each of the first 8 columns represents the coordinates of a square. The last column represents the area of this square.

Later you will have to use these libraries and tools on the house price dataset.

# Libraries First Steps

[Pandas](http://pandas.pydata.org/pandas-docs/stable/) is a library allowing you to manipulate easily different data structure. Here we will only present the essential functions that you will need.

[Numpy](https://docs.scipy.org/doc/) is a library for scientific computing.It contains a lot of useful mathematical functions.

In [None]:
# This allows to import the different libraries
import numpy as np
import pandas as pd

## Open the data

The first function that you need is `read_csv` which will read a file containg a matrix of data separated by comma (comma separated vector). Each line represents one vector (in this toy example, it represents the coordinate of the square)

In [None]:
squares = pd.read_csv('data/squares.csv') # Read the file

The following representation of a square is used in this dataset :
![](img/square.png)

In [None]:
squares.head() # Display the first lines

## Simple description

The analysis of the different columns can really be interesting, indicating how the data is distributed.

In [None]:
squares.describe()

This command displays 8 numbers for each column of the dataset. The ** count** is the number of non-missing values in the given column.
Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the **mean**, which is the average. Under that, **std** is the standard deviation, which measures how numerically spread out the values are.

To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analgously, and the max is the largest number.

How many points are present in this dataset ? Use `len(dataset)` or look at the dimension of the matrix returns by `.describe()`

## Selection of data points

In order to select data points based on the value of a column, you only need to do as follow :

In [None]:
selection = squares.x0 < 0.25 # Create a list of True and False
squares[selection].head() # Selection only the elements where selection is True

Use the previous commands to answer the question :
How many squares have an area lower than 0.125 ?

## Applying a function on a column

Let's apply a function to a column, in order to clean it or change its format or even its value.

For our toy example, we will say that each square for which x0 < 0.5 is two times too small, the goal is to compute its new coordinates but also the new area of these squares.

In [None]:
# Make a copy to ensure that we don't do anything wrong
modifiedSquares = squares.copy()

First, let's look how to define a function in python: the word `def` allow you to define a function, then you give it a name and the different arguments it will take in input.

```
def functionName(argument1, argument2, ...):
    """
        Facultative description of the function
    """
    result = ... 
    return result
```

In [None]:
# To complete
# Define the function multiplyByTwo that multiply a given column by two

Second to apply a function to all columns, you can use: 
`squares.apply(function)`

In [None]:
# To complete
modifiedSquares[...] = ...

## Some tests

First, you want to verify that we haven't lied and that the column area is really the area of the different squares.

To do so, compute the distance between two points and then compute the square

Just a little hint:
- `np.power(value, exponent)` will allow you to compute the power of a number
- `np.sqrt(value)` the square root of a number

In [None]:
# To complete
def side(x0, y0, x1, y1):
    """
        Dataset contains the columns x0, x1 and y0, y1
        This function will act on these columns to compute
        the distance between 0 and 1
        And return the results
    """
    result = 
    return result

In [None]:
# To complete
# So if we said true the square of the length of squares
# should be equal to the area of the same dataset
squaredLength = 

# Do not modify the assert which is a function that compares
# verify a condition and raise an error if not True
assert np.allclose(squaredLength, squares.area), "You have certainly made a mistake"

In [None]:
# To complete
# Execute the same computation for modifiedSquares
msquaredLength = 
assert np.allclose(msquaredLength, modifiedSquares.area), "What is the mistake ?"

Why does it not work ? It seems that you have multiply every columns by two. What is the mistake ? And what should you do to correct it ?

# Real data 

## Manipulation with pandas

Now, you are ready for playing with some real data !

In [None]:
# To complete
# Open the data located at data/trainTest.csv
house_price_train = 

Answer the following questions:
- What is the average price of an house ?
- How many variables (called features) describing an house do you have ?
- During which period of time this dataset has been collected ? Do you think that all the columns will be useful for the prediction of the price ? If not, delete the ones that you think useless ? Use the function `.drop(columns="column_to_delete")`
- Are there missing data ? Which variable seems the most affected ?

Your goal will be to predict the price of a house given all the variables that describe this house 

## Quantitative vs Qualitative 

Have you look at the different kind of data you have in this dataset ? In data science, there is two main kinds of data : 

**Quantitative** data deals with numbers and things you can measure objectively: dimensions such as height, width, and length. Temperature and humidity. Prices. Area and volume.

**Qualitative** data deals with characteristics and descriptors that can't be easily measured, but can be observed subjectively—such as smells, tastes, textures and color. 

In [None]:
# To complete
# Find a quantitative and a qualitative column of data 
# (you can refer to the file describing the different columns 
# to have a better idea - description.txt)

Why is it important ? And how can we deal with it ?

Try to transform a qualitative column into a numerical column.  
What is the limit of this approach ?

## What kind of problem is it ?

In machine learning we distinguish three kind of problems:
- **Supervised learning**: where you have input variables (X) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
- **Unsupervised learning**: you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
- **Semi supervised learning**: where you have a large amount of input data (X) and only some of the data is labeled (Y) are called semi-supervised learning problems. These problems sit in between both supervised and unsupervised learning.
    
Let's give a few examples:  
- **Supervised learning**: You are a football fan and you really want to know which team will win the next World Cup, to do so you have acuired all the results of all the matches played by all the teams so far (with every input variable that you can on the different matches: players, date, temperature ...), and all the performances of the different players. And the labels (output variable) is the number of goals during the match for each team.
- **Unsupervised learning**: You have several measures (evolution of weight, height, iq, psychologic test, number of friends, ... and much more) on several American men during 20 years and you would like to see if you can distinguish two groups in this population: one happy and one unhappy, but you have no direct way for measuring it.
- **Semi supervised learning**: You work for Google and you want to recognize objects on a given image. Labelling each pixel of each picture that Google has would be reeeeaaaally long and expensive. So you will just do it on some of them but you want to take advantage of all the images that you have because more data you have, better can be your model ! 

What is the house price problem ?

These kind of learning algorithm can be divided into **regression** vs **classification**. In a classification, the goal is to predict the class of an object (a discrete value). In a regression, the goal is to predict a continuous value. 

If you have extratime, choose an article at [FiveThirtyEight](http://fivethirtyeight.com/) and try to determine what kind of problem they describe.