# A group project

## Setup

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import pandas as pd

#
pd.set_option("display.max_rows", 999)

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12


# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

Select a PATH pointing to your working directory - where your datasets are saved

In [None]:
PATH = ## Your code here ##

Then use the \<os> module to change your current (default) working directory to PATH, your new working directory

In [None]:
## Your code here ##

**Import the \<California_Houses.csv> dataset from your working directory**

In [None]:
df = ## Your code here ##

In [None]:
## Your code here ##

**Each row of the dataset represents one district in California**<br>
Have a look a the first and last few rows

In [None]:
## Your code here ##

### Create a categorical  variable \<Closest_city> indicating the closest CA city and drop the distance to each city

*Hint : you may wish to associate the name of each city to the smallest distance among the four cities*

Save the transformed dataset as "housing"

In [None]:
#######################################
#
# Your lines (and cells) of code here
#
#######################################

In [None]:
housing

**Display the summary of your new dataframe**

In [None]:
## Your code here ##

# Part 1 - Data explorations

### What do you notice? 

- which attributes are quantitative ?
- which attributes are not quantitative? and what are their types?

### *``Your answers here``*

### Find out what categories exist in 'Closest_city' column and how many districts belong to each category.


In [None]:
## Your code here ##

### Show a summary of the quantitative attributes
**Use tables and plots**

In [None]:
## Your code for the tables here ##

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
## Your code for the plots here ##

#### Look more carefully into the distribution of the "median_income" column
Hint : you may wish to change the \<bins> parameter

In [None]:
## Your code here ##

In [None]:
## Your code for a boxplot here ##

### Explain the following lines of code

### *``Your answers here``*

In [None]:
cat=[np.min(housing["Median_Income"])]
for i in [0.20, 0.40, 0.60, 0.80]:
    cat.append(housing["Median_Income"].quantile(i))
cat.append(np.max(housing["Median_Income"]))
print(cat)

In [None]:
housing["income_cat"]=pd.cut(housing["Median_Income"], bins=cat, labels = [1,2,3,4,5], include_lowest=True)

In [None]:
housing["income_cat"].value_counts()

In [None]:
housing["income_cat"].hist();

In [None]:
housing.info()

### Create a Test Set through stratified random sampling on the income variable

**Why a test set ?** <br>
**Why stratify the data on the income variable before generating a test set ?** <br>
Hint : use a seed = 42 for the reproductibility of the (re)samplings

### *``Your answers here``*

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_strat, test_strat = ## Your code here ##

Now generate an equivalent random split without stratification

In [None]:
train_random, test_random = ## Your code here ##

#### Explain precisely what each of the two following cells do.

### *``Your answers here``*

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(test_strat),
    "Random": income_cat_proportions(test_random),
}).sort_index()

compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [None]:
compare_props

**For safety, copy the stratified train set to be used for modeling** <br>
Call your working copy \<houses_df>

In [None]:
## Your code here ##

### Data Visualization : scatter plots

**Plot each row (observation) in the dataset as a geographical point** <br>
Hint : You may use figsize=(10,10), alpha=0.2)

In [None]:
## Your code here ##

### Geographic map of California houses values per district with population density

**Try to understand and comment on the following code**

### *``Your comments here``*

In [None]:
import warnings
warnings.filterwarnings("ignore")

import matplotlib.image as mpimg
california_img=mpimg.imread("california.png")
ax = houses_df.plot(kind="scatter", x="Longitude", y="Latitude", figsize=(14,10),
                       s=houses_df['Population']/100, label="Population",
                       c="Median_House_Value", cmap=plt.get_cmap("jet"),colorbar=False, alpha=0.4)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = houses_df["Median_House_Value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.show()

### Bivariate Analysis

**How are the different variables related to each other two by two ?**

Compute the correlation matrix of all the quantitative variables <br>
Call it "houses_cor"

In [None]:
## your code here ##

In [None]:
houses_cor

Display the most important correlations with the target variable : \<Median_House_Value> <br>
Why is it pertinent to choose this variable as outcome (target) ?

In [None]:
## Your code here ##

### *``Your comments here``*

What are the predictors of \<Median_House_Value> worthy of interest and the nature of their relationship with the target ? <br>
**Provide the scatter plots of those predictors with the target**

In [None]:
## Your code here ##

**What special observations are you drawing from these scatter plots?** <br>
Is the total number of rooms or bedrooms per district meaningful ? <br>
If so explain why, if not what are your suggestions ?

### *``Your answers here``*

**Create three new variables :**
1. Rooms per household
2. Bedrooms per rooms
3. People per household

In [None]:
#####################################

## Your lines (cells) of code here ##

#####################################

**Let us again look at the correlation between the predictors and the target**

In [None]:
## Your code here ##

# Part 2 - Preparing data for Machine Learning

## 2.1 - Missing Values

### 2.1.1 - Case when there are missing values in one variable

Let us create a dataset where there are 10% of missing values in one variable

### *``Comment on each of the following cells``*

In [None]:
## Your comment here ##

import random
random.seed(42)
miss = np.random.choice(houses_df.index, 1651)

In [None]:
miss

In [None]:
## Your comment here ##

houses_miss = houses_df.copy()

In [None]:
## Your comment here ##

houses_miss.loc[miss,"Tot_Bedrooms"] = None

In [None]:
## Your comment here ##

houses_miss.info()

**When we have missing values, there are two main possibilities :**
1. We simply drop the rows associated to the missing values
2. We estimate the missing values through an imputation method - the simplest and safest is to use the median

In [None]:
## Your comment here ##

houses_drop = houses_miss.dropna(subset=["Tot_Bedrooms"])

In [None]:
## Your comment here ##

houses_drop.info()

In [None]:
## Your comment here ##

Bed_med = houses_miss["Tot_Bedrooms"].median()
houses_miss["Tot_Bedrooms"].fillna(Bed_med, inplace=True)

In [None]:
## Your comment here ##

houses_miss.info()

### 2.1.2 - Case where you have missing values in several variables

Let us now build a dataset with multiple missing values : <br>
Start with a function generating missing values in a chosen column of a dataframe

In [None]:
def col_miss (df, col, max_miss):
    '''
    df : a pandas dataframe
    col : the name of the variable column
    max_miss : the maximum number of missing values
    returns a data frame with a random number of missing values on col
    '''

    ########################################
    ## Your lines of code here            ##
    ########################################


In [None]:
# Copy the train set
housing_miss = houses_df.copy()
housing_miss.info()

Generate some missing values in the first 10 predictors of \<housing_miss>

In [None]:
## Your code here ##  

In [None]:
housing_miss.info()

Now, he have a dataset with missing values in all the quantitative predictors

#### How many missing values are there in each variable ?

In [None]:
## Your code here ##

#### Let us use sklearn to do multiple imputation, with existing modules

In [None]:
# Start with simple imputer
from sklearn.impute import SimpleImputer

**Using Simple Imputer, impute missing data in each variable by replacing missing values with the mean**

In [None]:
## Your code here ##

In [None]:
X_df = pd.DataFrame(X, columns=df.columns)
X_df.info()

**Using KNNImputer, impute missing data in each variable**

In [None]:
## Your code here ##

In [None]:
## Your code here ##

In [None]:
## Your code here ##

### Which imputer is better : SimpleImputer or KNNImputer ?
Please give some arguments

### *``Your answers here``*

## 2.2 - Categorical variables...

In [None]:
houses_df.info()

We have two categorical variables :
1. "Closest_city" is nominal
2. "income_cat" is ordinal
**Explain the difference between nominal and ordinal variables**

### *``Your answers here``*

### 2.2.1 - Introducing onehot encoding

"Closest_city" has four modalities : the four city names <br>
"income_cat" has five modalities : the five intervals that we have labeled 1,2,3,4,5. However as you see in the graph and in the original values - cat =  [0.4999, 2.3523, 3.1406, 3.9669399999999997, 5.10972, 15.0001] these intervals are not equidistant, so you cannot really add nor substract them meaningfully. <br>
**In short, both categorical variables should be considered nominal**

### *So, what is one hot encoding ?*

1. Count the number of modalities in your categorical variable - assume we have k modalities
2. Create k  dummy variables with k modalities where the values are 1 for the corresponding modalities, otherwise 0 <br>
*Warning : this method is forbidden for analytical solutions and with Linear Regression without regularization. <br>
For more information : [see here](https://inmachineswetrust.com/posts/drop-first-columns/#cell7)*

In [None]:
# Let us select our categorical variables
houses_cat = houses_df[["Closest_city","income_cat"]]

In [None]:
# Call for onehot encoder. Choose a dense rather than a sparse vector
from sklearn.preprocessing import OneHotEncoder as OHE
onehot = OHE(sparse=False)
houses_onehot = onehot.fit_transform(houses_cat)

In [None]:
houses_onehot

In [None]:
onehot.categories_

### 2.2.2 - Building a Pipeline

Let us first discover how a pipeline operates...

In [None]:
# Copy once again the original train set
housing_df = train_strat.copy()
housing_df.info()

Let us start with a custom transformer to be used to add attributes

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]


In [None]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
houses_plus = attr_adder.transform(housing_df.values)

#### What do these codes actually accomplish ?
Hint : sklearn uses duck typing, not inheritance.<br>
To find our more about Duck Typing, [go here](https://youtu.be/N6sst3aH_FA)

### *``Your answers here``*

In [None]:
# Check your answers...

houses_plus_df = pd.DataFrame(houses_plus,
                              columns=list(housing_df.columns)+["rooms_per_household", "population_per_household"],
                              index=housing_df.index)

houses_plus_df.info()

### 2.2.3 - Pipeline for the quantitative variables

For the quantitative variables, let us generate a pipeline with the following steps
1. Impute missing values with the "median" method
2. Add two new attributes : rooms per household and population per household
3. Standardize the training set

#### Let us restart with a new training set from a dataset with missing values

In [None]:
train = housing_miss[housing_miss.columns[0:12]]
train.info()

In [None]:
quanti_features = list(train.columns[1:10])
cat_features = ["Closest_city","income_cat"]
train_quanti = train[quanti_features]

In [None]:
train_quanti.info()

**Define the pipeline which will go through three steps :**
1. Impute missing data with the median method
2. Combine three new attributes
3. Standardize the quantitative features

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

quanti_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()), # this will add 3 attributes
        ('std_scaler', StandardScaler()),
    ])

houses_quanti = quanti_pipeline.fit_transform(train_quanti)

In [None]:
houses_quanti

#### Explain the previous cells of codes and what they aim to accomplish

### *``Your answers here``*

**How many features should we have at this step ?** 

In [None]:
## Justify your answer with a code here ##

### 2.2.4 - Pipeline for the quantitative and categorical variables

Now let us include the categorical variables

In [None]:
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
        ("num", quanti_pipeline, quanti_features),
        ("cat", OHE(), cat_features),
    ])

houses_ready = full_pipeline.fit_transform(train)

**How many features should we have at this step ? Explain !**

In [None]:
## Justify your answer with a code here ##

# Part 3 - Machine Learning

First of all, define the target (outcome) and the predictors (features)

In [None]:
y = ## Your code here ##
X = ## Your code here ##

## 3.1 - Learning and evaluating with the training set only

#### Linear Regression
Start with the most classical Linear Regression <br>
Check that this algorithm does not use Ordinary Least Square with matrix inversion

In [None]:
## Your code here ##

Estimate performance with Mean Squared Error and Mean Absolute Error

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
## Your code here ##

In [None]:
## Print your results (code) ##

#### Decision Tree Regression

In [None]:
## Your code here ##

Estimate performance with Mean Squared Error and Mean Absolute Error

In [None]:
## Your code here ##

In [None]:
## Print your results (code) ##

#### Comment on these first results

### *``Your answers here``*

## 3.2 - Estimating the models with *cross validation*

**Explain what the following function accomplishes**

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

### *``Your answers here``*

#### Linear Regression

In [None]:
lr = LinearRegression()

In [None]:
from sklearn.model_selection import cross_val_score

lr_scores = ## Your code here ##
lr_rmse = ## Your code here ##
## Your code here ##

#### Penalized Linear Regression (Elasticnet)

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
####################
## Your code here ##
####################

#### Decision Trees

In [None]:
## Your code here ##

In [None]:
####################
## Your code here ##
####################

***Compare the results of the evaluation (scores) using the training set with the evaluation (scores) using cross validation. <br>
What are your conclusions ?***

### *``Your answers here``*

#### Random Forests

In [None]:
## Your code here ##

In [None]:
####################
## Your code here ##
####################

#### Support Vector Machines

In [None]:
## Your code here ##

In [None]:
####################
## Your code here ##
####################

## 3.3 - Tuning the model with Grid Search and Randomized Search

#### Example : Random Forest

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
rf_grid = {'n_estimators': [30,60,100], 'max_features': [8,10,15]}

####################
## Your code here ##
####################

In [None]:
## Display the hyperparameters of the best model (code) ##

In [None]:
## Print the score of the best model (code)##

#### Example : ElasticNet

In [None]:
en_grid = {'alpha': np.logspace(-3, 4, 10), 'l1_ratio':np.linspace(0,1,11) }

####################
## Your code here ##
####################

In [None]:
## Display the hyperparameters of the best model (code) ##

In [None]:
## Print the score of the best model (code)##

#### Example Decision Tree

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
cart_grid = {"min_samples_split": range(1,10),"min_samples_leaf": range(1,60)}

####################
## Your code here ##
####################

In [None]:
## Display the hyperparameters of the best model (code) ##

In [None]:
## Print the score of the best model (code)##

## 3.4 - Final question : how good are our models in predicting unseen data ?

In [None]:
# Start by checking the structure of the test set (code) #
## Your code here ##

### Prepare your test set to be evaluated on the tuned models

In [None]:
# Now prepare our test set
## Your code here ##

In [None]:
# Define your target (outcome) and your predictors #
## Your code here ##

#### Random Forest

In [None]:
# Estimate the performance of your test set on the best cross-validated Random Forest model #

####################
## Your code here ##
####################

#### ElaticNet

In [None]:
# Estimate the performance of your test set on the best cross-validated Elasticnet model #

####################
## Your code here ##
####################

#### Decision Tree

In [None]:
# Estimate the performance of your test set on the best cross-validated Decision Tree model #

####################
## Your code here ##
####################

#### Linear Regression

In [None]:
# Estimate the performance of your test set on the Linear Regression model #

####################
## Your code here ##
####################

### Machine Learning Conclusion
**In the light of all these information, what have your learned about :**
1. Overfitting
2. Tuning a learner
3. Model performance
4. Else ?
Please write a complete but synthetic essay on your learning experience

### <span style="color:blue">Full Homework to be submitted on session #7</span>