# . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# Decision Tree Linear Regression
# . . . . . . . . . . . . . . . . . . . . . . . . . . . .




Afin de créer un arbre de décision, nous allons étudier un CSV sur les revenus d'individus aux USA.
Cela contient les informations sur le statut marital, l'âge, le type d'emploi etc.
Les datas sont de 1994. Nous voulons predire si le salaire sera inferieur voire égal à 50k ou supérieur à 50k

Voilà la source : [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult)
Le CSV est inclus.

### <span style="color:purple">Importe le fichier income.csv fourni
 </span>

In [45]:
import pandas as pd

income = pd.read_csv('income.csv', index_col = False) # Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)
income.head(15)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


### <span style="color:purple">Combien d'observations y a t-il ?
 </span>

In [33]:
len(income)

32561

 <span style="color:green">Attendu : 32561</span>

### ------------------------------------------------------------------------------------------------------------------------------

## Catégories

### <span style="color:purple">Prends le temps d'interpreter les datas. Quels sont les differents facteurs, les	Θ. Sous quelle forme sont elles classifiées.<br> Sex donne "Male" ou "Female". "workclass" contient plusieurs sortes de réponses.   Afin de pouvoir les analyser nous allons les convertirs en valeures numériques en fonction des categories.   Pour cela utilise la class [pandas.Categorical()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Categorical.html)</span>

In [46]:
to_convert = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'high_income']
mappings = {}
reverse_mappings = {}
for key in to_convert:
    c = pd.Categorical(income[key])
    mappings[key] = {category: i for (i, category) in enumerate(c.categories)}
    reverse_mappings[key] = {i: category for (i, category) in enumerate(c.categories)}
mappings['sex']


{' Female': 0, ' Male': 1}

In [47]:
income.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


### <span style="color:purple">Convertis le reste des catgorical columns dans `income` (education, marital_status, occupation, relationship, race, sex, native_country, and high_income) en categories numerique.</span>


In [48]:
for key in to_convert:
    income[key] = income[key].replace(mappings[key])
income.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0


### <span style="color:purple">A present divise income en deux `DataFrames` en fonction de `workclass` afin de diviser ceux travaillant dans le secteur privé ou non. </span>
*python hint : Booleans to split a dataframe.*

In [56]:
private_code = mappings['workclass'][' Private']
income_private = income[income['workclass'] == private_code]
income_non_private = income[income['workclass'] != private_code]

In [57]:
income_private.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0


In [58]:
income_non_private.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0


### ------------------------------------------------------------------------------------------------------------------------------

###  <span style="color:purple">Calcule une proportion / probabilite</span>
 ##### <span style="color:purple">Quelle est la proportion de `private_income` et de `public_incomes` ?
 </span>

In [59]:
total = len(income)
n_private = len(income_private)
n_non_private = len(income_non_private)
private_incomes_prop = float(n_private) / float(total)
public_incomes_prop = float(n_non_private) / float(total)

print("private_incomes_proportion",private_incomes_prop)
print("public_incomes_proportion",public_incomes_prop)

private_incomes_proportion 0.6970301894904948
public_incomes_proportion 0.3029698105095052


 <span style="color:green">private_incomes 0.6970301894904948 <br>
public_incomes 0.3029698105095052</span>

### ------------------------------------------------------------------------------------------------------------------------------

## Entropy

$$-\sum _{ i=1 }^{ c }{ P({ x }_{ i })\log _{ 2 }{ P({ x }_{ i }) }  } $$

In [71]:
import numpy as np

def log2(n):
    return np.log(n) / np.log(2)

def calc_entropy(column):
    p = np.array(column.value_counts()) / len(column)
    entropy = -np.dot(p, log2(p))
    return entropy

print("high_income", calc_entropy(income["high_income"]))
print("workclass", calc_entropy(income["workclass"] ))

high_income 0.796383955202
workclass 1.64797692751


 <span style="color:green">Attendu : <br> high_income 0.796383955202 <br>
workclass 1.64797692751</span>

### ------------------------------------------------------------------------------------------------------------------------------

## Information Gain

###  <span style="color:purple">Calcule le `Information Gain` de `age` en fonction de l'objectif final `high_income`</span>
 
$$Entropy(T)\quad =\sum _{ i=1 }^{ c }{ P({ x }_{ i })\log _{ b }{ P({ x }_{ i }) }  } \\ \\ IG(T,\quad A)\quad =\quad Entropy(T)-\sum _{ v\epsilon A }^{  }{ \frac { |{ T }_{ v }| }{ |T| }  } .Entropy({ T }_{ v })$$


In [77]:
import math


median_age = income['age'].median()
print('Median age: {} yr old'.format(median_age)) #trouve le médian de age.

# crée deux subset en fonction du médian
income_young = income[income['age'] <= median_age]
income_old = income[income['age'] > median_age]

# calcule la proportion de chacuns des splits.
young_prop = float(len(income_young)) / float(len(income))
old_prop = float(len(income_old)) / float(len(income))
print('Young people: {} %'.format(young_prop * 100.))
print('Old people: {} %'.format(old_prop * 100.))

Median age: 37.0 yr old
Young people: 51.22999907865238 %
Old people: 48.77000092134762 %


In [78]:
# Calcule l'entropie de high_income l'objectif final
income_entropy = calc_entropy(income["high_income"])

In [89]:
#calculez l' `age_information_gain`

age_information_gain = income_entropy - young_prop * calc_entropy(income_young['high_income']) \
                                      - old_prop * calc_entropy(income_old['high_income'])
age_information_gain

0.047028661304692021

 <span style="color:green">Attendu:0.047028661304691965</span>

 ### <span style="color:purple">Créer une fonction `calc_information_gain`</span>

In [155]:
def calc_information_gain(data, split_name, target_name):
    # calcule l'entropy d'origine
    original_entropy = calc_entropy(data[target_name])
    
    # trouve le médiant de la colonne
    median = data[split_name].mean()
    
    # crée deux subset en fonction du médian
    data_low = data[data[split_name] <= median]
    data_high = data[data[split_name] > median]
    
    # calcule le subset entropy de chacun des set
    to_subtract = 0
    for subset in [data_low, data_high]:
        to_subtract += float(len(subset)) / float(len(data)) * calc_entropy(subset[target_name])
        
    # Retourne information gain
    return original_entropy - to_subtract

# Vérifie que la valeur coïncide avec `income`, "age", "high_income"
print(calc_information_gain(income, "age", "high_income"))

0.0423657194597


 <span style="color:green">Attendu: 0.0470286613047</span>

 ### <span style="color:purple">Puis une `liste information_gains` de toutes les colonnes</span>

In [156]:
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
information_gains = []

#faire une boucle
for col in columns:
    gain = calc_information_gain(income, col, 'high_income')
    information_gains.append((col, gain))
information_gains

[('age', 0.042365719459676932),
 ('workclass', 3.7603253844054052e-13),
 ('education_num', 0.065012984132774232),
 ('marital_status', 0.1114272573715438),
 ('occupation', 0.00012688978571828713),
 ('relationship', 0.047362416650269412),
 ('race', 0.0057026139559477329),
 ('sex', 0.037171387438321157),
 ('hours_per_week', 0.040622468671234868),
 ('native_country', 0.00076407312508641745)]

### ------------------------------------------------------------------------------------------------------------------------------

 ### <span style="color:purple">Selectionne le nom de colonne avec la valeur la plus élevée</span>

In [157]:
def get_gain(col_gain):
    col, gain = col_gain
    return gain

highest_gain = max(information_gains, key = get_gain)
highest_gain

('marital_status', 0.1114272573715438)

### ------------------------------------------------------------------------------------------------------------------------------

 ### <span style="color:purple">A l'aide de la recusivite tu peux creer une fonction qui creera l integralite de l'arbre</span>

In [158]:
class Node:
    def __init__(self, criterion, subtree_low, subtree_high):
        self.criterion = criterion
        self.subtree_low = subtree_low
        self.subtree_high = subtree_high
        
    def __repr__(self):
        column_name, median = self.criterion
        return '{} <> {}'.format(column_name, median)



def build_tree(data, target_name, columns):
    if len(columns) == 0 or len(data) == 0:
        return None
    information_gains = []
    for col in columns:
        gain = calc_information_gain(data, col, target_name)
        information_gains.append((col, gain))
    highest_gain = max(information_gains, key = lambda col_gain: col_gain[1])
    split_name, median = highest_gain
    median = data[split_name].mean()
    # crée deux subsets en fonction de la médiane
    data_low = data[data[split_name] <= median]
    data_high = data[data[split_name] > median]
    new_columns = [col for col in columns if col != split_name]
    return Node(
        criterion = highest_gain,
        subtree_low = build_tree(data_low, target_name, new_columns),
        subtree_high = build_tree(data_high, target_name, new_columns)
    )

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
tree = build_tree(income, 'high_income', columns)

In [159]:
tree

marital_status <> 0.1114272573715438

In [160]:
tree.subtree_low

education_num <> 0.07967647835077063

In [161]:
tree.subtree_high

age <> 0.0257850963147917

In [162]:
tree.subtree_low.subtree_low

relationship <> 0.035421141859372285

In [163]:
tree.subtree_low.subtree_high

relationship <> 0.055762797799215

In [164]:
tree.subtree_high.subtree_low

hours_per_week <> 0.009762526234422517

In [165]:
tree.subtree_high.subtree_high

education_num <> 0.04060715789315511