In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
#sns.set_context("poster")

# Feature Selection

The element that has the biggest impact in the quality of your model is data features. You can only include in your model the attributes that you have and if they are not relevant, partially relevant or don't caputre the causality relationships behind the model, or introduce other relationships that correspond to other causes different from the ones that you want to investigate, then you'll have a poor model. 

Selecting the relevant features that add to your model is therefore of the utmost importance. 

In this notebook we will deal with four approaches:

        1) Univariate Selection.
        2) Recursive feature elimination.
        3) PCA - Principal Component Analysis.
        4) Estimating feature importance.

Feature selection is a process where you select those features in your data that contribute most to the variable of interest. Irrelevant features decrease the accuracy of many models because you try to adjust on noise, this is particularly important in the case of linear models, such as linear and logistic regressions, where all features are always taken into accout. Three are the main benefits of feature selection:

        1) Reduces overfitting. Less redundant data implies less decisions made on noise. 
        2) Improves accuracy. Less misleading data results in a more accurate model. 
        3) Reduces training time. Less data implies faster training. 
        
Scikitlearn has a nice and short article on feature selection where you can learn more https://scikit-learn.org/stable/modules/feature_selection.html

Again we will use the Pima Indians onset of diabetes dataset. 


yo: (which ost important features, except PCA)
UNIVARIATE: try to learn which feature is the most important one. But every model has a bias. Bias are not necessarily bad. so not so much overfitting. 
RECURSIVE: you take any model. e.g. you have 4 attributes, build the model with another combination of 3 and exhaust all the combinations. and then you rank accuracy. from that you can estimate the importance of that attribute. this is good because it realtes to a model. you are testing it with the model, it is not a theoretical thing. It is like this. Problem: you use more computational power. But it is real. Of course, it changes if you use another model. Is model-related.
One feature out and you try and with that you see the behaviour. And rank. without this attribute accuracy was X, while the other was Y. But computational power not so important today.
PCA: comes from statistic. is a completely differrent thing from the other three. you don't try to see the most relevant one. you. Reduce complexity. You create a matrix and eigenvalues... less dimensions. Theses dimensions are made of other ones. They are just combinations of things. Difficult to explain in real life. But many times no attribute dominates so difficult to explain. If you just carea about accuracy, fits very well the model. But not a good way to go for interpretation. New dimensions don't mean anything. (XGBoost does this for you?)
ESTIMATING: no bias.


<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

In [2]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]
X
pd.DataFrame(X).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Univariate Selection </h1>

One approach is to use statistical tests for example the Pearson Chi-Squared $\chi^2$ is commonly used to select the most significant features. 

We will use the <b> SelectKBest </b> class in scikit-learn.


In [3]:
# Univariate selection using Chi-squared 
set_printoptions(precision=3)
p_indians.head()

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 

# feature selection (we select the 4 best)
test = SelectKBest(score_func=chi2, k=4) #selecciona los mejores, la metodología siendo chi2 y seleccionando 4
fit = test.fit(X,Y)
print("Scores")

fit.scores_

print("The 4 attributes with the highest scores are: glucose, insulin, bmi and age ")
print()

features=fit.transform(X)
features[0:5,:]

# en el transform solo elimina las colunmnas.

# miras variabilidad explicada por chi2 de cada una de las variables y te quedas con el mayor valor.

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Scores


array([ 111.52 , 1411.887,   17.605,   53.108, 2175.565,  127.669,
          5.393,  181.304])

The 4 attributes with the highest scores are: glucose, insulin, bmi and age 



array([[148. ,   0. ,  33.6,  50. ],
       [ 85. ,   0. ,  26.6,  31. ],
       [183. ,   0. ,  23.3,  32. ],
       [ 89. ,  94. ,  28.1,  21. ],
       [137. , 168. ,  43.1,  33. ]])

<h1>Recursive Feature Elimination</h1>

This is a very intuitive approach. It consist on recursively removing attributes and building a model with those atrributes remaining. It uses the model accuracy to identify which atrributes or combination of attributes contribute the most. 

We will use it with a logistic regression, but the choice of algorithm doesn't matter too much as long as your are consistent. 

Recursive Feature Elimination uses the <b>RFE </b> class. 


In [4]:
# Recursive Feature Elimiantion

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression # el modelo del training, no de feature selection.

p_indians.head()

#Logistic regression
model = LogisticRegression(solver='liblinear') 

#A:es model-related. Buscar el mejor modelo es exploration.
#does not overfit as much, esa es la ventaja.

rfe = RFE(model, 3) #  we want to find the 3 top features
fit = rfe.fit(X, Y)

print(f'Number of features {fit.n_features_:d}')
print(f'Selected features {fit.support_}')
print(f'Ranking of features {fit.ranking_}')
print()
print("Top features seem to be pregnancies, bmi, and pedi(Diabetes Pedigree Function)")

#los resultados son completamente diferentes a lo que obteníamos antes.
#esto es real. You test it.
#rfe is real
#classification, you look for accuracy, is the normal thing.




# A:
# Coges todas las variables y vas quitando y viendo cómo afecta a la varianza que explicas.
# Está usando los coeficientes obtenidos en el fit del modelo que le he dado para saber qué tiene que eliminar.
# Hace algo similar a penalizar los coeficientes.

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Number of features 3
Selected features [ True False False False False  True  True False]
Ranking of features [1 2 3 5 6 1 1 4]

Top features seem to be pregnancies, bmi, and pedi(Diabetes Pedigree Function)


<b><font color="red" size=6>Mission 1</font>

For this and the next mission we will use data from Kaggle In concrete from the World University Rankings Competition https://www.kaggle.com/mylesoneill/world-university-rankings

a) Using the Shanghai rankings find the top 3 most important features to explain them with both univariate and recursive (in recursive because we are using log regression create an output variable of being in the top 50 or not).
<br><br>
b) Same for the Times ranking. 
<br><br>
c) Does it change if we choose the top 10 or top 100?
</b>

In [5]:
# A) Using the Shanghai rankings find the top 3 most important features to explain them with both univariate and recursive 
# (in recursive because we are using log regression create an output variable of being in the top 50 or not).

# Import dataset
sg_data = pd.read_csv('shanghaiData.csv')
sg_data['world_rank'] = sg_data['world_rank'].str.split('-').str.get(0).astype(float)
sg_data.drop(columns=['national_rank','total_score'], inplace=True)
sg_data.dropna(inplace=True)
sg_data.head()

    # 'world_rank' series presents several anomalies in 101-152 onwards, therefore the need to change that and convert to float.
    # Eliminate attributes that are not necessary: 'national_rank' and 'total_score'
    # Get rid of null values
    
# Select input and target variables with iloc (integer position-based) and transform to numpy arrays.
# I do not select university_name and year as input variables.
X = sg_data.iloc[:,2:8].values
Y = sg_data.iloc[:,0:1].values


print('------ Univariate ------')

# Imports
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Select 3 most important features
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X,Y)
fit.scores_
features = fit.transform(X)
features

# Show the results in a meaningful manner.
results = sorted(zip(fit.scores_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
results

    # reverse: sort in Descending order.
    # .columns: get the labels of the df.
    # show only first three
print('The selected features are: '+ str([x[1] for x in results]))




print('------ Recursive ------')

# Imports
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Top50, since using log regression need for classification. And converting it into array.
sg_data['Top50'] = sg_data['world_rank']<51
Y = sg_data['Top50'].values

# Select the 3 most important features.
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
fit.support_  # show selected features

# ¿HACE FALTA ESTO?
features = fit.transform(X)
features

# Show results
results = sorted(zip(fit.support_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))




Unnamed: 0,world_rank,university_name,alumni,award,hici,ns,pub,pcp,year
0,1.0,Harvard University,100.0,100.0,100.0,100.0,100.0,72.4,2005
1,2.0,University of Cambridge,99.8,93.4,53.3,56.6,70.9,66.9,2005
2,3.0,Stanford University,41.1,72.2,88.5,70.9,72.3,65.0,2005
3,4.0,"University of California, Berkeley",71.8,76.0,69.4,73.9,72.2,52.7,2005
4,5.0,Massachusetts Institute of Technology (MIT),74.0,80.6,66.7,65.8,64.3,53.0,2005


------ Univariate ------


array([ 75443.056, 123638.503,  51511.303,  42133.099,  15032.945,
        14897.774])

array([[100. , 100. , 100. ],
       [ 99.8,  93.4,  53.3],
       [ 41.1,  72.2,  88.5],
       ...,
       [ 13.6,   0. ,   3.6],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,  14.9]])

[(123638.50331046723, 'award'),
 (75443.05612636959, 'alumni'),
 (51511.303069663074, 'hici')]

The selected features are: ['award', 'alumni', 'hici']
------ Recursive ------


array([False,  True,  True,  True, False, False])

array([[100. , 100. , 100. ],
       [ 93.4,  53.3,  56.6],
       [ 72.2,  88.5,  70.9],
       ...,
       [  0. ,   3.6,  10.8],
       [  0. ,   0. ,  12.2],
       [  0. ,  14.9,   7.5]])

The selected features are: ['ns', 'hici', 'award']


In [6]:
# B) Same for the Times ranking. 

# Import dataset and do corresponding adjustments.
times = pd.read_csv('timesData.csv')
times.world_rank.replace('=.','.', regex=True, inplace=True)
times['world_rank'] = times['world_rank'].str.split('-').str.get(0).astype(float)
times.world_rank = pd.to_numeric(times.world_rank, errors='coerce')
times.dropna(subset=['world_rank'], axis=0, inplace=True) # ??????????
times.international = pd.to_numeric(times.international, errors='coerce')
times.income = pd.to_numeric(times.income, errors='coerce')
times.num_students.replace('\D','', regex=True, inplace=True)
times.num_students = pd.to_numeric(times.num_students, errors='coerce')
times.international_students.replace('\D','', regex=True, inplace=True)
times.international_students = pd.to_numeric(times.international_students, errors='coerce')
times.international_students = times.international_students/100
times['females'] = times.female_male_ratio.str.split(':').str.get(0) # ¿alguna forma mejor?
times.females = pd.to_numeric(times.females, errors='coerce')
times.drop(columns='total_score',inplace=True)
times.dropna(inplace=True)
times.head()

    # regex: bool or same types as to_replace, default False
    #... Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. 
    #...Alternatively,this could be a regular expression or a list, dict, or array of regular expressions 
    #...in which case to_replace must be None.
    #... Second, if regex=True then all of the strings in both lists will be interpreted as regexs.
    
    #pd.to_numeric: convert argument to a numeric type.\ If ‘coerce’, then invalid parsing will be set as NaN
    
    # dropna: subset: Labels along other axis to consider, 
    # e.g. if you are dropping rows these would be a list of columns to include.
    #... axis : {0 or ‘index’, 1 or ‘columns’}
    

print('------ Univariate ------')

# Select input and target variables with iloc (integer position-based) and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
Y = times.iloc[:,0:1].values
    #pandas.concat : Concatenate pandas objects along a particular axis with optional set logic along the other axes.
    
# Specific imports
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Select the 3 most important features.
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X,Y.astype(int)) # convert float to integer, otherwise gives back error in this case!!
fit.scores_
features = fit.transform(X)
features

# Show in a meaningful manner
results = sorted(zip(fit.scores_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))



print('------ Recursive ------')

# Specific imports
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Select target variable and convert it to array
times['Top50'] = times['world_rank']<51
Y = times['Top50'].values

# Select the 3 most important features.
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
fit.support_
features = fit.transform(X)
features

# Show in a meaningful manner
results = sorted(zip(fit.support_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,females
1,2.0,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,0.27,33 : 67,2011,33.0
2,3.0,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,0.33,37 : 63,2011,37.0
3,4.0,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,0.22,42 : 58,2011,42.0
5,6.0,University of Cambridge,United Kingdom,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,0.34,46 : 54,2011,46.0
6,6.0,University of Oxford,United Kingdom,88.2,77.2,93.9,95.1,73.5,19919.0,11.6,0.34,46 : 54,2011,46.0


------ Univariate ------


array([1.412e+04, 5.210e+03, 2.249e+04, 1.364e+04, 4.396e+03, 1.948e+06,
       1.962e+03, 4.587e+01, 5.467e+02])

array([[9.770e+01, 9.800e+01, 2.243e+03],
       [9.780e+01, 9.140e+01, 1.107e+04],
       [9.830e+01, 9.810e+01, 1.560e+04],
       ...,
       [1.450e+01, 7.600e+00, 3.127e+04],
       [2.010e+01, 1.600e+01, 1.012e+04],
       [1.620e+01, 1.830e+01, 8.663e+03]])

The selected features are: ['num_students', 'research', 'teaching']
------ Recursive ------


array([False, False,  True,  True, False, False, False,  True, False])

array([[9.80e+01, 9.99e+01, 2.70e-01],
       [9.14e+01, 9.99e+01, 3.30e-01],
       [9.81e+01, 9.92e+01, 2.20e-01],
       ...,
       [7.60e+00, 1.93e+01, 2.00e-02],
       [1.60e+01, 1.35e+01, 8.00e-02],
       [1.83e+01, 2.86e+01, 4.00e-02]])

The selected features are: ['research', 'international_students', 'citations']


In [7]:
# C) Does it change if we choose the top 10 or top 100?

# I just copy-paste the previous code and change to <11 and <101

print('------ Shanghai10 ------')
sg_data.head()
X = sg_data.iloc[:,2:8].values
sg_data['Top10'] = sg_data['world_rank']<11   # HERE THE DIFFERENCE
Y = sg_data['Top10'].values
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
fit.support_
features = fit.transform(X)
features
results = sorted(zip(fit.support_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))
print('------ Shanghai100 ------')
sg_data['Top100'] = sg_data['world_rank']<101   # HERE THE DIFFERENCE
Y = sg_data['Top100'].values
fit = rfe.fit(X, Y)
fit.support_
features = fit.transform(X)
features
results = sorted(zip(fit.support_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))


print('------ Times10 ------')
times.head()
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top10'] = times['world_rank']<11            # HERE THE DIFFERENCE
Y = times['Top10'].values
fit = rfe.fit(X, Y)
fit.support_
features = fit.transform(X)
features
results = sorted(zip(fit.support_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))
print('------ Times100 ------')
times['Top100'] = times['world_rank']<101         # HERE THE DIFFERENCE
Y = times['Top100'].values
fit = rfe.fit(X, Y)
fit.support_
features = fit.transform(X)
features
results = sorted(zip(fit.support_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))

------ Shanghai10 ------


Unnamed: 0,world_rank,university_name,alumni,award,hici,ns,pub,pcp,year,Top50
0,1.0,Harvard University,100.0,100.0,100.0,100.0,100.0,72.4,2005,True
1,2.0,University of Cambridge,99.8,93.4,53.3,56.6,70.9,66.9,2005,True
2,3.0,Stanford University,41.1,72.2,88.5,70.9,72.3,65.0,2005,True
3,4.0,"University of California, Berkeley",71.8,76.0,69.4,73.9,72.2,52.7,2005,True
4,5.0,Massachusetts Institute of Technology (MIT),74.0,80.6,66.7,65.8,64.3,53.0,2005,True


array([ True, False,  True, False,  True, False])

array([[100. , 100. , 100. ],
       [ 99.8,  53.3,  70.9],
       [ 41.1,  88.5,  72.3],
       ...,
       [ 13.6,   3.6,  25.1],
       [  0. ,   0. ,  28.8],
       [  0. ,  14.9,  25. ]])

The selected features are: ['pub', 'hici', 'alumni']
------ Shanghai100 ------


array([False,  True,  True,  True, False, False])

array([[100. , 100. , 100. ],
       [ 93.4,  53.3,  56.6],
       [ 72.2,  88.5,  70.9],
       ...,
       [  0. ,   3.6,  10.8],
       [  0. ,   0. ,  12.2],
       [  0. ,  14.9,   7.5]])

The selected features are: ['ns', 'hici', 'award']
------ Times10 ------


Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,females,Top50
1,2.0,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,0.27,33 : 67,2011,33.0,True
2,3.0,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,0.33,37 : 63,2011,37.0,True
3,4.0,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,0.22,42 : 58,2011,42.0,True
5,6.0,University of Cambridge,United Kingdom,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,0.34,46 : 54,2011,46.0,True
6,6.0,University of Oxford,United Kingdom,88.2,77.2,93.9,95.1,73.5,19919.0,11.6,0.34,46 : 54,2011,46.0,True


array([False, False, False,  True, False, False, False,  True,  True])

array([[9.99e+01, 2.70e-01, 3.30e+01],
       [9.99e+01, 3.30e-01, 3.70e+01],
       [9.92e+01, 2.20e-01, 4.20e+01],
       ...,
       [1.93e+01, 2.00e-02, 3.60e+01],
       [1.35e+01, 8.00e-02, 2.80e+01],
       [2.86e+01, 4.00e-02, 4.30e+01]])

The selected features are: ['international_students', 'females', 'citations']
------ Times100 ------


array([False, False,  True,  True, False, False, False,  True, False])

array([[9.80e+01, 9.99e+01, 2.70e-01],
       [9.14e+01, 9.99e+01, 3.30e-01],
       [9.81e+01, 9.92e+01, 2.20e-01],
       ...,
       [7.60e+00, 1.93e+01, 2.00e-02],
       [1.60e+01, 1.35e+01, 8.00e-02],
       [1.83e+01, 2.86e+01, 4.00e-02]])

The selected features are: ['research', 'international_students', 'citations']


<h1>Principal Component Analysis</h1>

Principal Component Analysis is a data reduction technique using linear algebra. The idea here is to "compress" several dimensions into pricipal components. 

One problem of PCA is the explainability. Once you compressed the attributes into principal components you can no longer to refer them individually establishing causality links or relationships. 

A property of PCA is that you can choose the number of dimensions or principal components. In our example we will select 3 principal components. 

For Principal Component Analysis you use the <b>PCA</b> class. 


In [8]:
from sklearn.decomposition import PCA

p_indians.head()

#PCA
pca = PCA(n_components=3) # número de componentes que quiero. Menor o igual a los que tienes.
pca_fit = pca.fit(X)

print(f"Explained variance: {pca_fit.explained_variance_ratio_}")
print()

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print("Principal Components have little resemblance to the source data attributes")
print()
print(pca_fit.components_)


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Explained variance: [1.000e+00 3.215e-06 1.598e-06]

Principal Components have little resemblance to the source data attributes

[[ 0.000 -0.000  0.000 -0.000 -0.000  1.000  0.000 -0.000  0.000]
 [ 0.442  0.315  0.570  0.542  0.287  0.000 -0.065  0.002 -0.011]
 [ 0.178 -0.588  0.187 -0.343  0.626 -0.000 -0.052 -0.002 -0.276]]


In [9]:
#in each vector try to explain as much variance as possible.
#insulina con otras cosas, pero insulina domina en el primer vector. Explica 88% of variance. la gran parte de la explicación.
#el segundo vector es glucosa. Correlacionado negativamente.solo explica 6%
#el tercero is whatever.
#probablemente lo puedas reducir a dos vectores o incluso a uno.
#pero cuando lo quieres explicar.. difícil porque not relate to real thing.
#if you care about fit then perfect.
#este es el mejor caso porque tienes a uno que domina el vector, pero si tuvieras a varios dominando... entonces difícil.
#reduce complexity with less attributes through calculating the variances of the dimensions.
#these are the eigenvalues of the matrix.

<h1>Feature Importance </h1>

One of the added features of tree based algorithms is that they can be used to estimate the importance of each feature and use it to refine the model to different levels depending on where we want to situate ourselves in the tension between explainability and accuracy. 

In this example we are going to use the ExtraTreesClassifier, but the technique is commonly used in all tree algoritms. 

For this example of assessing feature importance with trees we will use the <b>ExtraTreesClassifier</b> class. 


In [10]:
from sklearn.ensemble import ExtraTreesClassifier

p_indians.head()

model = ExtraTreesClassifier(n_estimators=100) # the number of trees in the forest (tree en este caso como variable, ensemble)
model.fit(X,Y) # Adaptar ensemble a tu modelo.

print(model.feature_importances_) # Sacar los números de importancia... cuanto mayor más importante. 
# la importancia podría ser como un peso, un coeficiente.

#esto es un random forest.
#hay 3 cosas importantes: cuántos trees in the forest, the learning? rate and finally how deep are the trees.
#3 parameters. But each library has a name for the same.
#presentarías el vector como un histograma para que se pueda ver qué es lo más importante y como rankeado por importancia.
#you run it a few times, if the result is the same then is pretty robust. If not, then maybe add more trees, etc.


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.272  0.044  0.275  0.236  0.050  0.030  0.028  0.036  0.030]


<b><font color="red" size=6>Mission 2</font>

a) Using the Shangai Data find the top attributes with a tree classifier for top-10, top-50 and top-100.  
<br>
b) Same for the Times ranking. 
<br><br>

</b>

In [11]:
# A) Using the Shangai Data find the top attributes with a tree classifier for top-10, top-50 and top-100.

In [12]:
sg_data.head()

print('------ Shanghai10 ------')

# Specific import
from sklearn.ensemble import ExtraTreesClassifier

# Select variables and transform to numpy arrays
X = sg_data.iloc[:,2:8].values
Y = sg_data['Top10'].values

# Feature importance
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)

# Show results
print(model.feature_importances_)
results = sorted(zip(model.feature_importances_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))


print('------ Shanghai50 ------')

# Copy-paste and just change the target variable top 'Top50'
from sklearn.ensemble import ExtraTreesClassifier
Y = sg_data['Top50'].values
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)
print(model.feature_importances_)
results = sorted(zip(model.feature_importances_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))

print('------ Shanghai100 ------')

# Copy-paste and just change the target variable top 'Top100'
from sklearn.ensemble import ExtraTreesClassifier
Y = sg_data['Top100'].values
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)
print(model.feature_importances_)
results = sorted(zip(model.feature_importances_, sg_data.iloc[:,2:8].columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))

Unnamed: 0,world_rank,university_name,alumni,award,hici,ns,pub,pcp,year,Top50,Top10,Top100
0,1.0,Harvard University,100.0,100.0,100.0,100.0,100.0,72.4,2005,True,True,True
1,2.0,University of Cambridge,99.8,93.4,53.3,56.6,70.9,66.9,2005,True,True,True
2,3.0,Stanford University,41.1,72.2,88.5,70.9,72.3,65.0,2005,True,True,True
3,4.0,"University of California, Berkeley",71.8,76.0,69.4,73.9,72.2,52.7,2005,True,True,True
4,5.0,Massachusetts Institute of Technology (MIT),74.0,80.6,66.7,65.8,64.3,53.0,2005,True,True,True


------ Shanghai10 ------


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.198  0.398  0.119  0.118  0.041  0.127]
The selected features are: ['award', 'alumni', 'pcp']
------ Shanghai50 ------


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.089  0.217  0.237  0.244  0.138  0.075]
The selected features are: ['ns', 'hici', 'award']
------ Shanghai100 ------


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.126  0.226  0.219  0.167  0.166  0.096]
The selected features are: ['award', 'hici', 'ns']


In [13]:
# B) Same for the Times ranking. 

times.head()

print('------ Times10 ------')

# Specific import
from sklearn.ensemble import ExtraTreesClassifier

# Select variables and transform to numpy arrays
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
Y = times['Top10'].values

# Feature importance
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)

# Show results
print(model.feature_importances_)
results = sorted(zip(model.feature_importances_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))


print('------ Times50 ------')

# Copy-paste and just change the target variable top 'Top50'
from sklearn.ensemble import ExtraTreesClassifier
Y = times['Top50'].values
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)
print(model.feature_importances_)
results = sorted(zip(model.feature_importances_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))

print('------ Times100 ------')

# Copy-paste and just change the target variable top 'Top100'
from sklearn.ensemble import ExtraTreesClassifier
Y = times['Top100'].values
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)
print(model.feature_importances_)
results = sorted(zip(model.feature_importances_, pd.concat([times.iloc[:,3:11],times['females']],axis=1).columns),reverse=True)[:3]
print('The selected features are: '+ str([x[1] for x in results]))

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,females,Top50,Top10,Top100
1,2.0,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,0.27,33 : 67,2011,33.0,True,True,True
2,3.0,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,0.33,37 : 63,2011,37.0,True,True,True
3,4.0,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,0.22,42 : 58,2011,42.0,True,True,True
5,6.0,University of Cambridge,United Kingdom,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,0.34,46 : 54,2011,46.0,True,True,True
6,6.0,University of Oxford,United Kingdom,88.2,77.2,93.9,95.1,73.5,19919.0,11.6,0.34,46 : 54,2011,46.0,True,True,True


------ Times10 ------


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.183  0.104  0.191  0.144  0.099  0.067  0.073  0.066  0.072]
The selected features are: ['research', 'teaching', 'citations']
------ Times50 ------


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.269  0.058  0.287  0.168  0.053  0.036  0.040  0.051  0.038]
The selected features are: ['research', 'teaching', 'citations']
------ Times100 ------


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.260  0.046  0.280  0.231  0.052  0.030  0.029  0.042  0.030]
The selected features are: ['research', 'teaching', 'citations']
